2007年2月13~16日,在美国加里福尼亚州圣琼斯举行的文件与存储技术会议USENIX FAST’07上,有多篇论文对硬盘制造商提供的MTBF提出了质疑,其中影响最大的两篇论文分别是卡内基梅隆大学的Bianca Schroeder和Garth Gibson的文章 Disk failures in the real world:What does an MTTF of 1,000,000 hours mean to you? 和Google研究小组的 Failure Trends in a Large Disk Drive Population。两篇文章同时公开了同一个发现:实际的硬盘失效率比制造商给出的数值要高出多倍,硬盘制造商提供的MTBF参数根本经不住时间考验。
我作为科普作者,觉得这是一个非常好的题材。因此,与马来西亚大学的董欣女士合作完成了《我们被硬盘制造商耍了》一文,对该问题进行了深入剖析。
下面这篇富有煽动性博文最早进入了我的视线,也正因为受到它的启发,才产生了写作兴趣,所以全文转载,以表达对Robin Harris先生的谢意 。
By Robin Harris on Wed, <?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" />02/21/2007
Filed under :Storageobin Harris's blog
Two bombshell papers released at the Usenix FAST '07 (File And Storage Technology) conference this week bring a welcome dose of reality to the basic building block of storage: the disk drive.
Together the two papers are 29 pages of dense computer science with lots of info on populations, statistical analysis, and related arcana. I recommend both papers. The following summary, and two longer analyses at StorageMojo are summaries of what I found interesting.
The first conference paper, from researchers at Google, Failure Trends in a Large Disk Drive Population (pdf) looks at a 100,000-drive population of Google PATA and SATA drives. Remember that these drives are in professionally managed, Class A data centers, and once powered on, are almost never powered down. So conditions should be nearly ideal for maximum drive life.
The most interesting results came in five areas:
· The validity of manufacturer's MTBF specs
· The usefulness of SMART statistics
· Workload and drive life
· Age and drive failure
· Temperature and drive failure
MTBF
Google found that Annual Failure Rates were quite a bit higher than vendor MTBF specs suggest. For a 300,000-hour MTBF, one would expect an AFR of 1.46%, but the best the Googlers observed was 1.7% in the first year, rising to over 8.6% in the third year.
SMART: not very smart
SMART (Self-Monitoring, Analysis, and Reporting Technology) is supposed to capture drive error data to predict failure. The authors found that several SMART errors were strong predictors of ensuing failure:
· scan errors
· reallocation count
· offline reallocation
· probational count
For example, after the first scan error, they found a drive was 39 times more likely to fail in the next 60 days. The other three correlations are less striking, but still significant. The problem: even these four predictors miss over 50% of drive failures. If you get one of these errors, replace your drive, but not getting one doesn't mean you are safe. SMART is simply not reliable.
Workload and drive life
Defining workload isn't easy, but the good news is that the Googlers didn't find much of a correlation.
After the first year, the AFR of high utilization drives is at most moderately higher than that of low utilization drives. The three-year group in fact appears to have the opposite of the expected behavior, with low utilization drives having slightly higher failure rates than high ulization ones.
They did find infant mortality was higher among high-utilization drives. So burn those babies in!
Age and drive failure
The authors note that their data doesn't really answer this question due to the mix of drive types and vendors. Nonetheless their drive population does show AFR increases with age.
Hot drives = dead drives?
Possibly the biggest surprise in the Google study is that failure rates do not increase when the average temperature increases. At very high temperatures there is a negative effect, but even that is slight. This might mean cooling costs could be significantly reduced at data centers.
Beyond Google Google's paper wasn't the only cool storage paper or even the best: Bianca Schroeder and Garth Gibson of CMU's Parallel Data Lab paper Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? won a "Best Paper" award.
They looked at 100,000 drives Including HPC clusters at Los Alamos and the Pittsburgh Supercomputer Center, as well as several unnamed Internet services providers. The drives had different workloads, different definitions of "failure" and different levels of data collection so the data isn't quite as smooth or complete as Google's. Yet it probably looks more like a typical enterprise data center, IMHO. Also she included "enterprise" drives in her sample.
Key observations from the CMU paper: High-end "enterprise" drives versus "consumer" drives?
. . . we observe little difference in replacement rates between SCSI, FC and SATA drives, . . . ."
So how much of that 1,000,000 hour MTBF are you actually getting?
Infant mortality?
. . . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.
The infant mortality effect is slightly different than what Google reported. Both agree on early the more important issue of early wear-out. Vendor MTBF reliability?
While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs [Average Replacement Rate] range from 0.5% to as high as 13.5%. . . . up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%range.
Actual MTBFs?
The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours."
In other words, that 1 million hour MTBF is really about 300,000 hours - about what consumer drives are spec'd at.
Drive reliability after burn-in?
Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.
Drives are mechanical devices and wear out like machines do, not like electronics.
Data safety under RAID 5?
The assumption of data safety behind RAID 5 is that drive failures are independent so that the likelihood of two drive failures in a single RAID 5 LUN is vanishingly low. The authors found that this assumption is incorrect.
. . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .
In fact, they found that a disk replacement made another disk replacement much more likely.
Independence of drive failures in an array?
The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.
Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!
Let the dialogue begin!
The importance of these papers is that they present real-world results from large drive populations. Vendors have kept drive-reliability data to themselves for what now seem obvious reasons: they've been inflating their numbers. With good field numbers coming out, smart storage and systems folks can start designing for the real world. It's about time.
用户143084 2008-4-18 12:31
程序匠人 2008-4-17 11:32
用户31132 2008-4-16 09:52
用户854823 2008-4-15 22:33
ash_riple_768180695 2007-3-15 13:19
谢谢avan老哥,我按照你给的线索找到了FAST的网页,看到了很多关于文件系统和存储的最新研究资料,对我的工作帮助太大了。
非常感谢!