Disk Average Failure Rate (AFR)
Oracle Database Tips by Donald Burleson
Doug Burns noted
on Google relating to disk failures, a very interesting study.
Unlike traditional measure of Mean Time Between Failure (MTBF) and Mean
Time to Failure (MTTF), this study uses Average Failure Rate (AFR) and
it also attempted to validate the predictive value of the SMART method (Self-Monitoring
Analysis Reporting Technology) for predicting disk failure.
Interestingly, SMART is similar to
proprietary predictive models for Oracle failures, using scientific
correlations to warn of failure before they occur.
The study claims to
be one of the largest and most comprehensive studies on disk, and it highlights
the importance of redundancy in disk technology. The paper concludes:
- Heat does not matter - Hot
temperatures were not correlated to higher disk failures.
- Early warnings count for predicting
disk failure - Checking the syslogs for sporadic I/O errors has high
predictive value: "After their first scan error, drives are 39 times
more likely to fail within 60 days than drives with no such errors."
- SMART is not predictive - The study
noted that their SMART method (Self-Monitoring Analysis Reporting
Technology) did not provide statistically significant correlations for
predictive benefits. However, some SMART values have more predictive
value than others:
"Some SMART parameters (scan
errors, reallocation counts, offline reallocation counts, and
probational counts) have a large impact on failure probability.
Given the lack of occurrence of predictive SMART signals on a large
fraction of failed drives, it is unlikely that an accurate
predictive failure model can be built based on these signals alone."
- Infant mortality - The study
suggests that disks show a form of infant mortality; "It is interesting
to note that our 3-month, 6-months and 1-year data points do seem to
indicate a noticeable influence of infant mortality phenomena, with
1-year AFR dropping significantly from the AFR observed in the first
Google study - disk failure rate and disk age
- Disk Utilization factor -
The study showed that high utilization is clearly a failure factor
for young disks, and this seems similar to the old "burn-in" tests
on motherboards. While we might expect high-utilization disks
to have a higher average failure rate, the study noted that 3
year-old disks had a higher failure rate for low utilization
Google Study - Failure rate as a function of
If you like Oracle tuning, see the book "Oracle
Tuning: The Definitive Reference", with 950 pages of tuning tips and
You can buy it direct from the publisher for 30%-off and get
instant access to the code depot of Oracle tuning scripts.