第一篇:高性能高可靠的計(jì)算機(jī)存儲(chǔ)系統(tǒng)架構(gòu)設(shè)計(jì)研究中英文摘要
論文中英文摘要
作者姓名:孫宏濱
論文題目:高性能高可靠的計(jì)算機(jī)存儲(chǔ)系統(tǒng)架構(gòu)設(shè)計(jì)研究
中文摘要
過(guò)去幾十年里,集成電路工藝尺寸縮小已經(jīng)為電路設(shè)計(jì)帶來(lái)巨大的性能改善。按照摩爾定律的預(yù)測(cè),處理器的速度每18個(gè)月就會(huì)翻一番,而存儲(chǔ)器的速度每年僅僅增長(zhǎng)7%。結(jié)果,處理器與存儲(chǔ)器之間的速度鴻溝每21一個(gè)月就會(huì)翻一番,這被稱(chēng)為“存儲(chǔ)墻”問(wèn)題。包括高速緩存與主存在內(nèi)的計(jì)算機(jī)存儲(chǔ)系統(tǒng)架構(gòu)設(shè)計(jì)是解決處理器與存儲(chǔ)器之間性能鴻溝的主要方法。隨著CMOS工藝尺寸的不斷縮小,計(jì)算存儲(chǔ)系統(tǒng)的可靠性和性能都受到了嚴(yán)重威脅。日益升高的硬件缺陷與軟錯(cuò)誤發(fā)生率,使高速緩存的良率和可靠性不斷惡化。同時(shí),逐漸成熟的三維集成工藝技術(shù)為解決“存儲(chǔ)墻”問(wèn)題也提供了更好的技術(shù)手段。設(shè)計(jì)高性能高可靠性的存儲(chǔ)結(jié)構(gòu)已經(jīng)成為計(jì)算機(jī)系統(tǒng)的關(guān)鍵技術(shù)。本文在減輕處理器-存儲(chǔ)系統(tǒng)性能鴻溝和改進(jìn)存儲(chǔ)系統(tǒng)可靠性方面做了以下幾項(xiàng)重要研究工作:
首先,本文提出了一種高效的內(nèi)建修復(fù)分析方法來(lái)提高嵌入式存儲(chǔ)器的良率。當(dāng)前,嵌入式存儲(chǔ)器已成為處理器和系統(tǒng)集成芯片的核心部件,決定著整個(gè)芯片的良率。嵌入式存儲(chǔ)器很難像傳統(tǒng)存儲(chǔ)器一樣通過(guò)外部測(cè)試設(shè)備來(lái)檢測(cè)缺陷并分析修復(fù)策略,而需要內(nèi)建自測(cè)試和內(nèi)建修復(fù)分析電路來(lái)完成存儲(chǔ)器的測(cè)試與修復(fù)。以前內(nèi)建修復(fù)分析器的研究都假設(shè)硬件缺陷只能夠被片上的冗余行或列修復(fù),但事實(shí)上大多數(shù)的嵌入式存儲(chǔ)器都集成有糾錯(cuò)碼電路來(lái)防止存儲(chǔ)器中的軟錯(cuò)誤。本文的方法通過(guò)適當(dāng)?shù)睦闷弦延械募m錯(cuò)碼電路,開(kāi)發(fā)一種修復(fù)率高且硬件開(kāi)銷(xiāo)小的存儲(chǔ)器內(nèi)建修復(fù)分析器。該方法使用非常簡(jiǎn)單的塊缺陷優(yōu)先的修復(fù)分析方法來(lái)降低硬件開(kāi)銷(xiāo),使用片上糾錯(cuò)碼資源修正殘余的硬件缺陷,并最終使用適當(dāng)?shù)姆椒ɑ謴?fù)軟錯(cuò)誤的容忍力。本文提出的方法可有效降低內(nèi)建修復(fù)分析算法的硬件開(kāi)銷(xiāo),同時(shí)能夠保持相同或者更高的硬件缺陷修復(fù)率以及容軟錯(cuò)誤能力。
其次,本文提出了一種利用多比特糾錯(cuò)碼來(lái)提高二級(jí)緩存容錯(cuò)能力的方法。靜態(tài)隨機(jī)存儲(chǔ)器的錯(cuò)誤包括硬件缺陷和粒子射線引起的軟錯(cuò)誤兩種。在傳統(tǒng)存儲(chǔ)器的設(shè)計(jì)中,硬件缺陷一般由片上冗余的行列資源來(lái)修補(bǔ),而軟錯(cuò)誤由單比特糾錯(cuò)碼來(lái)保護(hù)。集成電路工藝尺寸的不斷縮小已使可靠高密度的高速緩存設(shè)計(jì)越來(lái)越復(fù)雜,傳統(tǒng)的可靠性設(shè)計(jì)方法將無(wú)法滿(mǎn)足良率要求。雖然多比特糾錯(cuò)碼能顯著提高高速緩存的可靠性,但由于多比特糾錯(cuò)碼會(huì)明顯的降低計(jì)算機(jī)性能并增加面積開(kāi)銷(xiāo),通常被認(rèn)為無(wú)法應(yīng)用于高速緩存設(shè)計(jì)。本文研究了在二級(jí)緩
存中使用多比特糾錯(cuò)碼在防止軟錯(cuò)誤的同時(shí)容忍大量隨機(jī)硬件缺陷的可行性與可能的可靠性改善問(wèn)題。我們的研究并不著眼于開(kāi)發(fā)新的多比特糾錯(cuò)碼,而專(zhuān)注于如何在二級(jí)緩存中利用架構(gòu)設(shè)計(jì)有效利用多比特糾錯(cuò)碼。由于那些有一個(gè)或多個(gè)缺陷位的緩存數(shù)據(jù)塊可以在存儲(chǔ)器測(cè)試的時(shí)候檢查出來(lái),我們本能的可以采用一種更好的方式:只使用多比特糾錯(cuò)碼保護(hù)那些需要的緩存數(shù)據(jù)塊,而不是普遍的保護(hù)所有數(shù)據(jù)塊。這種選擇性保護(hù)方案可以使多比特解碼較長(zhǎng)延遲對(duì)處理器性能的影響大大降低,而需要存儲(chǔ)的糾錯(cuò)碼冗余位也會(huì)相應(yīng)下降。這種選擇性使用多比特糾錯(cuò)碼的方案需要基于內(nèi)容尋址存儲(chǔ)器的實(shí)時(shí)查找表來(lái)判斷當(dāng)前訪問(wèn)的數(shù)據(jù)塊是否被多比特糾錯(cuò)碼保護(hù)。但是,盡管直接由內(nèi)容尋址存儲(chǔ)器來(lái)實(shí)現(xiàn)選擇性使用多比特糾錯(cuò)碼看似簡(jiǎn)單,其功能無(wú)法滿(mǎn)足高密度的缺陷率條件:(1)隨著隨機(jī)缺陷率的增加,大部分的緩存訪問(wèn)都將引起多比特糾錯(cuò)碼解碼操作,因而會(huì)降低整個(gè)系統(tǒng)的性能。(2)內(nèi)容尋址存儲(chǔ)器比普通SRAM的功耗要大得多,因此不斷查找缺陷表將導(dǎo)致過(guò)大的能量消耗。本文進(jìn)一步利用高速緩存訪問(wèn)的局部性原理,通過(guò)以幾個(gè)特殊功能的小緩存來(lái)輔助高速緩存的方法,巧妙避免了大多數(shù)的多比特容錯(cuò)碼解碼延遲,極大降低了多比特容錯(cuò)碼的面積開(kāi)銷(xiāo)。此外,本文提出的二級(jí)緩存設(shè)計(jì)可在提高可靠性的同時(shí)保持相同的容軟錯(cuò)誤能力。
三維集成已經(jīng)成為處理器設(shè)計(jì)領(lǐng)域一項(xiàng)前景廣闊的技術(shù),為解決高性能處理器的存儲(chǔ)墻問(wèn)題提供了可行的解決方案。面向三維集成的工藝技術(shù),本文開(kāi)發(fā)了一種采用粗顆粒度分區(qū)策略的三維動(dòng)態(tài)隨機(jī)存儲(chǔ)器結(jié)構(gòu)。與之前的研究相比,該結(jié)構(gòu)在不引起過(guò)孔加工限制的情況下充分利用三維集成的優(yōu)勢(shì),在所有硅基層合理共享全局的地址和數(shù)據(jù)總線,從而只需要的很少量的硅層間過(guò)孔和相對(duì)較低的過(guò)孔加工尺寸要求。本文進(jìn)一步提出使用該存儲(chǔ)結(jié)構(gòu)為多核計(jì)算系統(tǒng)設(shè)計(jì)了一種異構(gòu)三維動(dòng)態(tài)隨機(jī)存儲(chǔ)器結(jié)構(gòu),利用三維動(dòng)態(tài)隨機(jī)存儲(chǔ)器同時(shí)設(shè)計(jì)計(jì)算機(jī)系統(tǒng)的二級(jí)緩存和計(jì)算機(jī)主存。為提高動(dòng)態(tài)隨機(jī)存儲(chǔ)器二級(jí)緩存的性能,本文提出采用可變子單元大小和多閾值電路等技術(shù)降低訪問(wèn)延遲。與通常動(dòng)態(tài)存儲(chǔ)器性能遠(yuǎn)低于靜態(tài)存儲(chǔ)器的印象相反,我們使用改進(jìn)的存儲(chǔ)器建模工具證明三維動(dòng)態(tài)隨機(jī)存儲(chǔ)器二級(jí)緩存設(shè)計(jì)可實(shí)現(xiàn)與靜態(tài)存儲(chǔ)器相同的訪問(wèn)速度,甚至更快。通過(guò)應(yīng)用以上技術(shù),本文提出的三維動(dòng)態(tài)存儲(chǔ)結(jié)構(gòu)能有效的減小訪問(wèn)延遲,進(jìn)而改進(jìn)三維集成計(jì)算系統(tǒng)的整體性能。
對(duì)于未來(lái)的三維集成微處理器,由于硅片垂直疊放相互遮擋,不同的硅片層受射線粒子引起軟錯(cuò)誤的程度也不同。研究表明,外層硅片可以為內(nèi)層硅片遮擋粒子射線,這一現(xiàn)象被稱(chēng)為屏蔽效應(yīng)。受三維微處理器結(jié)構(gòu)的屏蔽效應(yīng)啟發(fā),本文提出一種容軟錯(cuò)誤的三維高速緩存結(jié)構(gòu)。由于外硅片層為內(nèi)層電路遮擋Alpha射線,內(nèi)層電路可能天然的不受Alpha射線的影響而具有容軟錯(cuò)誤的能力,其容錯(cuò)電路可以省去。因而,訪問(wèn)不受軟錯(cuò)誤威脅的內(nèi)層硅片上的緩存數(shù)據(jù)的延遲與能耗比其他硅片層要小得多。進(jìn)一步,我們開(kāi)發(fā)了多種技術(shù)來(lái)使外層硅片上的數(shù)據(jù)動(dòng)態(tài)搬移到內(nèi)層,從而使高速緩存的數(shù)據(jù)訪問(wèn)集中于不受軟錯(cuò)誤影響的硅片層。
對(duì)于一級(jí)緩存,我們提出一種內(nèi)層直接映射緩存結(jié)構(gòu)來(lái)盡量增加內(nèi)層數(shù)據(jù)的訪問(wèn),同時(shí)避免訪問(wèn)不必要數(shù)據(jù)所引起的功耗損失;對(duì)于低級(jí)緩存,我們提出解除Tag與Data塊之間的直接對(duì)應(yīng)關(guān)系,來(lái)彌補(bǔ)低級(jí)緩存相對(duì)低的局部訪問(wèn)特性。該三維高速緩存結(jié)構(gòu)可顯著提高處理器的性能和能耗效率。
最后,本文分析了未來(lái)三維集成的視頻處理電路的性能與功耗改善。隨著視頻處理算法的復(fù)雜度不斷提高,存儲(chǔ)帶寬已成為高級(jí)視頻編碼與顯示處理系統(tǒng)的主要瓶頸,這一帶寬不足狀況還會(huì)進(jìn)一步惡化。由于三維邏輯-存儲(chǔ)集成將會(huì)提供大量的垂直互聯(lián),因而將對(duì)需要大存儲(chǔ)容量與高帶寬的視頻處理應(yīng)用產(chǎn)生重要影響。為了量化估計(jì)三維集成視頻處理系統(tǒng)的性能和功耗改善,本文進(jìn)一步開(kāi)發(fā)了一款可無(wú)縫集成于多媒體多核處理系統(tǒng)的三維集成的運(yùn)動(dòng)估計(jì)加速器。我們提出一種三維集成的動(dòng)態(tài)存儲(chǔ)器結(jié)構(gòu)和圖像幀存儲(chǔ)策略,并設(shè)計(jì)一種全并行的二維運(yùn)動(dòng)估計(jì)加速方法來(lái)利用三維集成動(dòng)態(tài)存儲(chǔ)器降低系統(tǒng)功耗。該方法可無(wú)縫的支持各種運(yùn)動(dòng)估計(jì)視頻處理算法,包括H.264/AVC編碼標(biāo)準(zhǔn)中的變塊運(yùn)動(dòng)估計(jì)。我們以多幀運(yùn)動(dòng)估計(jì)為例,使用硬件設(shè)計(jì)和動(dòng)態(tài)存儲(chǔ)器建模工具證明了該運(yùn)動(dòng)估計(jì)加速器的能耗效率。
本文結(jié)合存儲(chǔ)系統(tǒng)的設(shè)計(jì)需求與最新的集成電路工藝進(jìn)展,針對(duì)計(jì)算機(jī)存儲(chǔ)系統(tǒng)設(shè)計(jì)的多個(gè)關(guān)鍵問(wèn)題提出了系統(tǒng)的解決方案。本文提出的所有架構(gòu)設(shè)計(jì)與方法研究都使用系統(tǒng)級(jí)和電路級(jí)的仿真工具完成了有效性的驗(yàn)證。其中存儲(chǔ)器電路級(jí)設(shè)計(jì)主要使用硬件電路設(shè)計(jì)仿真與存儲(chǔ)器建模工具來(lái)完成對(duì)電路參數(shù)的預(yù)估。計(jì)算機(jī)系統(tǒng)級(jí)性能則分別采用了單核和多核處理器系統(tǒng)仿真器對(duì)本文提出架構(gòu)的處理能力和功耗進(jìn)行了完整的評(píng)估。
關(guān)鍵詞:存儲(chǔ)結(jié)構(gòu);可靠性;容錯(cuò)技術(shù);三維集成Architecture Design of High Performance and Reliable
Computer Memory Systems
Sun Hongbin
ABSTRACT
Scaling of CMOS devices has provided remarkable improvement in performance of integrated circuits in the past few decades.Moore’s law tells that processor speed doubles every 18 months because of technology scaling.The memory speed, however, increased only by about 7% per year.As a consequence, the processor-memory speed gap doubles every 21 months, which is called as “memory wall”.To bridge the processor-memory gap, computer memory hierarchy including both cache and main memory has played a key role to alleviate the affect of the memory slowness.As CMOS technology continues to scale down, how to design a high performance and reliable memory hierarchy in computer system has become a grand challenge.The yield and reliability of cache memory are threatened by both hard faults and soft errors.In the meanwhile, the emerging three-dimensional(3D)integration technology provides the better approaches to address the “memory wall”.As a consequence, to design the high performance and reliable memory architecture becomes the critical technique in computer systems.This thesis makes several important contributions to mitigate the processor-memory gap and improve the reliability of memory hierarchy.First, we present a cost-efficient built-in repair analysis(BIRA)approach to improve the yield of embedded memory.As embedded memories become more and more dominant in system-on-chip(SoC)design, it is very crucial to achieve sufficiently high embedded memory yield.Due to the increasing number of diversified embedded memories on chip, external memory testing and redundancy repair analysis become inadequate and the use of BIRA becomes more attractive and even indispensable.All the prior work on BIRA assumed that defects can only be repaired by redundant rows or columns.Motivated by the fact that most embedded memories use error correction code(ECC)to uniformly protect all the memory words from soft errors, we propose to appropriately leverage the existing on-chip error correction circuit to enable very low-cost built-in repair analysis implementations while maintaining the same and even higher defect repair rate and the same soft error tolerance.Second, we propose a defect tolerant L2 cache memory by using multi-bit error correction codes.Potential faults in SRAM can be parametric/catastrophic defects or transient soft errors, both of which are becoming increasingly serious as the technology feature size shrinks.In conventional design practice, memory defects are handled by using spare(or redundant)rows, columns, and/or words to repair(i.e., replace)the defective ones, while soft errors are compensated by single-error-correcting error-correcting codes.As the technology continues to scale down, traditional repair-only defect tolerance strategy may no longer be sufficient to ensure high enough yield.Although strong multi-bit ECCs appear to be a natural choice to improve the reliability, it is commonly believed that multi-bit ECCs may incur prohibitive performance degradation and
silicon/energy cost for cache memory.This work concerns the feasibility and potential of using multi-bit ECC to tolerate a large amount of random defects in L2 cache without the loss of soft error tolerance.This work does not intend to develop any new multi-bit ECC, instead we focus on how to enable the effective use of multi-bit ECC in L2 cache.Since cache blocks consisting of one or more defective cells can be identified during memory testing, it is very intuitive that a better choice is to apply multi-bit ECC to the cache blocks only whenever necessary instead of uniformly protecting all the cache blocks using multi-bit ECC.Such a simple selective use of multi-bit ECC may largely alleviate the impact on the overall cache performance and area overhead.Intuitively, implementation of the selective use of multi-bit ECC must perform content addressable memory(CAM)based run-time table look-up to check whether or not the cache block being accessed should be protected by the multi-bit ECC.However, although a direct realization of selective use of multi-bit ECC accompanied by CAM is quite straightforward, its effectiveness may be inadequate in the presence of a relatively high random defect density for two main reasons:(i)As the random defect density increases, a larger percentage of cache read operations may invoke multi-bit ECC decoding, which will directly degrade the overall system performance such as IPC;(ii)Since the energy consumption of CAM is greatly larger than that of normal SRAM and the size of CAM will increase as the random defect density increases, a significant energy consumption overhead will be incurred.However, by supplementing a conventional L2 cache core with several special-purpose small caches/buffers, we can greatly reduce the silicon cost and minimize the probability of explicitly executing multi-bit ECC decoding on the cache read critical path.Moreover, the proposed L2 cache design can maintain the same level of soft error tolerance in the meanwhile.Three dimensional(3D)integration is emerging as an attractive technology for microprocessor design, and provides a viable and promising option to address the well-known memory wall problem in high performance computing systems.Based on 3D integration technology, we develop a 3D DRAM design applying coarse-grained 3D partitioning strategy, which introduces a much less number of through-silicon vias(TSVs)and less stringent constraints on TSV pitch compared with prior work.The key is to share the global routing of memory address and data bus among all the DRAM dies through coarse-grained TSVs with the small pitch.We also investigate the potential of using 3D DRAM to implement both L2 cache and main memory in 3D multi-core processor-DRAM integrated computing systems.In contrast to the common impression that DRAM is much slower than SRAM, using the modified CACTI tool, we show that 3D DRAM L2 cache may achieve comparable and even faster speed than 2D SRAM L2 cache.By employing these design techniques, the proposed 3D DRAM design can effectively reduce the access latency, hence improve the overall 3D integrated computing system performance.3D microarchitecture provides another interesting advantage that circuits on different dies may exhibit the heterogeneous soft error vulnerabilities due to the shielding effect of die stacking.Recent research characterized microarchitecture soft error vulnerabilities across the 3D-stacked chip dies and concluded that the inner-dies can be shielded by the outer-dies from particle strikes.Motivated by the shielding effect in 3D microarchitecture, we propose a soft error resilient 3D cache architecture.The underlying idea is to eliminate the error correction circuits on the soft error invulnerable dies(SID), being aware that the inner-dies may be inherently soft error invulnerable since they are implicitly protected by the outer dies from particle strikes.As a result, data access on the soft error invulnerable dies introduces a much less latency and energy dissipation.Moreover, we develop techniques to enable the dynamic data block movement in cache memory, which can effectively maximize the data access on the soft error invulnerable dies.For L1 cache, we propose a SID direct mapping cache architecture to maximize the accesses on the SIDs and avoid the energy
waste on the useless data accesses in the meanwhile.For low level caches, we propose to decouple the tag entry from data block to compensate the relatively poor locality characteristics in low level caches.The overall cache hierarchy can achieve a significant performance and energy efficiency improvement.Finally, we analyze the potential benefits of 3D-stacked video processing circuits in terms of performance and energy consumption.Currently, bandwidth has become the primary bottleneck of the advanced video coding and display processing systems.The bandwidth deficiency in video processor may be even worse when people try to use more sophisticated algorithms to further improve the performance.We show that 3D integration will have a significant impact on memory intensive video processing, given the massive logic-memory interconnect bandwidth enabled by die stacking.To quantitatively demonstrate the attractive advantages, we further develop a 3D integrated motion estimation accelerator that can be integrated in multimedia processing multicore processor.We develop a 3D integrated DRAM memory organization and image frame storage strategy geared to motion estimation, and apply a fully parallel 2D motion estimation computation engine to take advantage of the 3D stacked DRAM to minimize the energy consumption.Our proposed approach seamlessly supports various motion estimation algorithms and variable block-size motion estimation(VBSME)that has been adopted in H.264/AVC.We present a case study on multi-frame motion estimation by applying the proposed accelerator design solution based on DRAM performance modeling and ASIC design to demonstrate its energy efficiency.By focusing on the design requirement of memory hierarchy and new advance in semiconductor technology, this thesis proposes several efficient architecture solutions to address the critical problems in computer memory systems.All the architectures and approaches proposed in this thesis are extensively demonstrated by using the system-level and/or circuit-level simulation tools.The electrical properties of memory circuit design are mainly evaluated and estimated by leveraging circuit design and simulation tools.While the unicore and multicore microprocessor simulators are used to give an extensive evaluation to the computational capability and energy consumption of the proposed architectures.Key words:Memory hierarchy;Reliability;Defect tolerance;3D integration