原创 你对磁盘的认识全都错了

2007-3-10 22:08 5462 10 11 分类: 工程师职场

200721316日,在美国加里福尼亚州圣琼斯举行的文件与存储技术会议USENIX FAST’07上,有多篇论文对硬盘制造商提供的MTBF提出了质疑,其中影响最大的两篇论文分别是卡内基梅隆大学的Bianca SchroederGarth Gibson的文章 Disk failures in the real world:What does an MTTF of 1,000,000 hours mean to you? 和Google研究小组的  Failure Trends in a Large Disk Drive Population。两篇文章同时公开了同一个发现:实际的硬盘失效率比制造商给出的数值要高出多倍,硬盘制造商提供的MTBF参数根本经不住时间考验。


我作为科普作者,觉得这是一个非常好的题材。因此,与马来西亚大学的董欣女士合作完成了《我们被硬盘制造商耍了》一文,对该问题进行了深入剖析。


下面这篇富有煽动性博文最早进入了我的视线,也正因为受到它的启发,才产生了写作兴趣,所以全文转载,以表达对Robin Harris先生的谢意 。


 


Everything you know about disks is wrong <?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />


By Robin Harris on Wed, <?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" />02/21/2007


Filed under :Storageobin Harris's blog


 

Two bombshell papers released at the Usenix FAST '07 (File And Storage Technology) conference this week bring a welcome dose of reality to the basic building block of storage: the disk drive.


 


Together the two papers are 29 pages of dense computer science with lots of info on populations, statistical analysis, and related arcana. I recommend both papers. The following summary, and two longer analyses at StorageMojo are summaries of what I found interesting.


 


The first conference paper, from researchers at Google, Failure Trends in a Large Disk Drive Population (pdf) looks at a 100,000-drive population of Google PATA and SATA drives. Remember that these drives are in professionally managed, Class A data centers, and once powered on, are almost never powered down. So conditions should be nearly ideal for maximum drive life.


 


The most interesting results came in five areas:


·                                 The validity of manufacturer's MTBF specs


·                                 The usefulness of SMART statistics


·                                 Workload and drive life


·                                 Age and drive failure


·                                 Temperature and drive failure


 


MTBF


 


Google found that Annual Failure Rates were quite a bit higher than vendor MTBF specs suggest. For a 300,000-hour MTBF, one would expect an AFR of 1.46%, but the best the Googlers observed was 1.7% in the first year, rising to over 8.6% in the third year.


 


SMART: not very smart


 


SMART (Self-Monitoring, Analysis, and Reporting Technology) is supposed to capture drive error data to predict failure. The authors found that several SMART errors were strong predictors of ensuing failure:


·                                 scan errors


·                                 reallocation count


·                                 offline reallocation


·                                 probational count


 


For example, after the first scan error, they found a drive was 39 times more likely to fail in the next 60 days. The other three correlations are less striking, but still significant. The problem: even these four predictors miss over 50% of drive failures. If you get one of these errors, replace your drive, but not getting one doesn't mean you are safe. SMART is simply not reliable.


 


Workload and drive life


 


Defining workload isn't easy, but the good news is that the Googlers didn't find much of a correlation.


 


After the first year, the AFR of high utilization drives is at most moderately higher than that of low utilization drives. The three-year group in fact appears to have the opposite of the expected behavior, with low utilization drives having slightly higher failure rates than high ulization ones.


 


They did find infant mortality was higher among high-utilization drives. So burn those babies in!


 


Age and drive failure


 


The authors note that their data doesn't really answer this question due to the mix of drive types and vendors. Nonetheless their drive population does show AFR increases with age.


 


Hot drives = dead drives?


 


 Possibly the biggest surprise in the Google study is that failure rates do not increase when the average temperature increases. At very high temperatures there is a negative effect, but even that is slight. This might mean cooling costs could be significantly reduced at data centers.


 


Beyond Google Google's paper wasn't the only cool storage paper or even the best: Bianca Schroeder and Garth Gibson of CMU's Parallel Data Lab paper Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? won a "Best Paper" award.


 


They looked at 100,000 drives Including HPC clusters at Los Alamos and the Pittsburgh Supercomputer Center, as well as several unnamed Internet services providers. The drives had different workloads, different definitions of "failure" and different levels of data collection so the data isn't quite as smooth or complete as Google's. Yet it probably looks more like a typical enterprise data center, IMHO. Also she included "enterprise" drives in her sample.


 


Key observations from the CMU paper: High-end "enterprise" drives versus "consumer" drives?


 

. . . we observe little difference in replacement rates between SCSI, FC and SATA drives, . . . ."


So how much of that 1,000,000 hour MTBF are you actually getting?


 


Infant mortality?


 

. . . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.


 


The infant mortality effect is slightly different than what Google reported. Both agree on early the more important issue of early wear-out. Vendor MTBF reliability?


 


While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs [Average Replacement Rate] range from 0.5% to as high as 13.5%. . . . up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%range.


 


Actual MTBFs?


 

The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours."


 


In other words, that 1 million hour MTBF is really about 300,000 hours - about what consumer drives are spec'd at.


 


Drive reliability after burn-in?


 

Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.


Drives are mechanical devices and wear out like machines do, not like electronics.


 


Data safety under RAID 5?


 


The assumption of data safety behind RAID 5 is that drive failures are independent so that the likelihood of two drive failures in a single RAID 5 LUN is vanishingly low. The authors found that this assumption is incorrect.


 


. . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .


In fact, they found that a disk replacement made another disk replacement much more likely.


 


Independence of drive failures in an array?


 

The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.


Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!


 


Let the dialogue begin!


 


 The importance of these papers is that they present real-world results from large drive populations. Vendors have kept drive-reliability data to themselves for what now seem obvious reasons: they've been inflating their numbers. With good field numbers coming out, smart storage and systems folks can start designing for the real world. It's about time.

PARTNER CONTENT

文章评论1条评论)

登录后参与讨论

用户143084 2008-4-18 12:31

支持“探长”老师,学生觉得诸如SystemARM之徒,只是个耍小伎俩的虫虫而已~~~蛀掉的不仅仅是21IC们的声誉啊~~~“千夫所指”就是给这类虫的杀虫剂~~~

程序匠人 2008-4-17 11:32

avan:您好! 首先,对您的遭遇表示同情。 这种事情匠人也遭遇过很多次了。一直没有好的对策,也只好淡然处之。后来,匠人想到了一个办法,就是在自己的文章中称呼自己时,全部以“匠人”自称,而不用“我、俺、咱”,相当于给自己的文章打上了鲜明的烙印。这样一来,不管他谁来转载,即使不注明出处和作者,明眼人一看就知道是匠人出品了,呵呵。 另外,关于你所怀疑的,该网友是否受到21ic顾佣这一点。匠人可以负责任地告诉你,21ic没有给博客网友支付薪水,也不会指使他们去这么做。我想,网友们转载您的文章,更多的,是出于喜欢,而不是为了钱。如果侵犯了您的权利,还是通过沟通解决比较好。不要把矛盾上升到网站的层面。 匠人很喜欢看你的文章,希望您不会因这种事情而影响写博客的激情。

用户31132 2008-4-16 09:52

抵制非法拷贝, 顶

用户854823 2008-4-15 22:33

恩哪 支持原创 转载要说明的

ash_riple_768180695 2007-3-15 13:19

谢谢avan老哥,我按照你给的线索找到了FAST的网页,看到了很多关于文件系统和存储的最新研究资料,对我的工作帮助太大了。

非常感谢!

相关推荐阅读
用户1034861 2012-04-21 16:23
利用超短波热脉冲取代传统磁头,可提高硬盘存储速度数百倍
现代磁记录技术利用我们熟知的磁铁同极排斥和异极相吸现象,通过改变磁铁的磁极存储数据,因此需要借助一个外部磁场。外部磁场的强度越大,数据存储数据越快。 美国纽约大学的科学家研发出一项新的电脑硬盘...
用户1034861 2011-10-24 12:49
基于互联网的开放式研究模式受追捧
    把科学研究从传统封闭的实验室搬到开放的互联网上进行,会取得怎样的效果?一些年轻科学工作者对这种科研方式予以肯定。他们在网上征召合作者和志愿者,在博客里发表科学设想并征求意见,令互联网成为科研新...
用户1034861 2010-10-30 17:34
闪存芯片的密度提升在下一年将会停滞
企业存储论坛网2010年9月17日发表了一篇题为Why Flash Drive Density Will Stop Growing Next Year的文章,文章指出:闪存芯片的密度提升在下一年将会停...
用户1034861 2010-03-29 09:29
热启动组合键的第三个功能
同时按下[Ctrl]+[Alt]+[Del]三个键,可以让电脑重新启动(DOS环境下)或者弹出Windows管理器(Windows XP环境下),这是大家熟知的热启动组合键的两个功能。昨天偶尔发现,它...
用户1034861 2010-01-11 20:20
子虚乌有的“防辐射机箱”
一瓶普通的二锅头被吹嘘成宫廷御液酒,白萝卜跟胡萝卜合在一起,便成了群英荟萃……这是赵丽蓉、巩汉林表演的小品《打工奇遇》里面的场景。这种场景在现实世界里也时有发生,在IT界更是不乏其例。一度被炒得沸沸扬...
用户1034861 2009-05-08 11:54
按键跟开关不是一回事儿
昨天,一个学生让我帮他诊断一下电脑。他的电脑电源工作正常,按下电源按钮以后,电源风扇和CPU风扇都呼呼地转了起来,可是机器却完全没有自检的迹象,一开机便死在那儿了。   我重新插拔了内存条和显卡,故障...
EE直播间
更多
我要评论
1
10
关闭 站长推荐上一条 /3 下一条