Technology

HDD metrics and why imply time to failure isn’t terribly helpful


On this podcast, we speak to Rainer Kaese, senior supervisor of enterprise growth for onerous disk drives at Toshiba Electronics Europe, about onerous disk drive (HDD) metrics.

Particularly, he takes aside imply time to failure (MTTF) and reveals why it’s not a really helpful measure. As a substitute, he suggests annualised failure charge (AFR) as extra helpful, and reveals why with regards to human lifespans.

He additionally talks about why imply time between failure isn’t relevant to onerous drives, and why enterprise storage programs want enterprise drives.

What’s the imply time to failure charge (MTTF)?

MTTF is the measure of the likelihood of how lengthy it can take a tough disk drive to fail.

But it surely’s not a really helpful metric. Let’s say a typical enterprise onerous disk drive MTTF is 2.5 million hours. Which suggests it might take 2.5 million hours till a drive fails. However 2.5 million hours, in the event you do the mathematics, is 285 years.

That’s not the right interpretation of that worth, and there’s numerous misunderstanding, so I wish to make clear it right here and return to a extra helpful worth.



This MTTF of two.5 million hours could be calculated into an annualised failure charge, and this annualised failure charge for enterprise drives is 0.35%. This can be a extra helpful worth as a result of it signifies that 0.35% of the HDDs you might be operating might fail inside a 12 months.

Let’s say you may have a datacentre with 1,000 drives. [That means] 0.35% or 3.5 drives per 12 months might fail. That might be inside the reliability specification. So, you would need to finances for 3 to 4 failure replacements, and you’ll count on three to 4 failures per 12 months.

Meaning onerous disk drives are fairly dependable, with solely three to 4 failures per 12 months. In fact, that is all in the event you function the onerous disk drives inside the agreed specification. Meaning 24×7 operation per 12 months with a temperature lower than 42°C on common and a workload lower than 550TB [terabytes] per 12 months, and likewise solely inside a guaranty interval of 5 years.

From this 0.35%, in the event you divide the variety of hours per 12 months, which is 8,760, by this AFR, you come to the imply time to failure.

So, 8,760 hours divided by 0.35%, or 0.0035 – this equation provides you 2.5 million hours. In case you have just one onerous disk drive, it can take 285 years for this one to fail on common, however solely below the agreed situation, and the agreed situation is inside 5 years of guarantee.

This 2.5 million hours, or 285 years, would imply in the event you change your onerous disk drive each 5 years, then after 285 years, you’ll encounter a random failure. However once more, 285 years is method too excessive. You possibly can phrase it that when you’ve got 2.5 million drives, you’d have one failure per hour.

Or when you’ve got 2,500 drives, you’d have one failure each thousand hours. That might be a type of practical interpretation of this 2.5 million hours.

However in the event you solely have the imply time to failure worth, and you are taking 8,760 hours per 12 months divided by this MTTF, you’ll have the annualised failure charge, which is a extra helpful worth.

However, MTTF isn’t a really helpful worth and for low-failure-rate merchandise like onerous disk drives, it usually results in misunderstanding.

A greater analogy to clarify it with is one other low-failure-rate sort of product: the human being. My failure charge inside the subsequent 12 months is kind of low. Most individuals of my age working below the specification “workplace employee” will survive subsequent 12 months.

I requested my medical insurance firm, ‘What’s the likelihood that I’ll fail inside the subsequent 12 months?’ They know this worth as a result of if I fail, if I die subsequent 12 months, they must pay. They know this worth very nicely, and so they advised me it’s 0.16%. Out of 1,000 life insurance coverage contracts of individuals like me, they’ve of their books, they’re calculating for 1.6 deaths within the subsequent 12 months.

If I do the mathematics and calculate from 0.16%, this can give an MTTF of 5 million hours, which suggests I’m twice as dependable as an HDD; 5 million hours is 625 years and, in fact, I cannot stay 625 years. The life insurance coverage firm advised me they’re calculating for 82 years.

That’s the reliability. It tells us what number of failures there will probably be inside the subsequent 12 months – and that’s all. It’s not 100 years.

Are you able to clarify the distinction between MTTF and MTBF (imply time between failures)?

We talked about MTTF, imply time to failure. Generally in knowledge sheets, it’s written as MTBF, imply time between failures.

Strictly talking, imply time between failures is supposed for technical merchandise, which could be repaired. With automobiles, you’ll be able to have a imply time to first failure. After the automobile is repaired, you then have imply time to the subsequent failure.

As onerous disk drives can’t be repaired, the right time period for the onerous disk drive is MTTF, imply time to failure.

The following query is what causes drives to fail?

Something. Drives are mechanical parts with numerous electronics.

There may very well be an digital failure, electro-migration, a number of the wires within the chip might tear off. There could be mechanical issues just like the glue of the pinnacle failing or a head crash. There are lots of totally different failure modes. Fortuitously, drives are very dependable.

The previous 0.35%; that’s extremely dependable. It occurs not often. It occurs seldom. It takes a very long time on common for that to occur.

Most drives of their five-year guarantee interval, and even seven, eight, 9 years of operation, gained’t fail. The overwhelming majority gained’t fail, however it may occur.

Because of this we’ve got this statistic reliability values. Though it might occur not often, it might occur late, it might not occur in any respect, there’s nonetheless a remaining likelihood {that a} failure might occur to your specific drive at any time.

Even on the primary or second day, it might occur even with decrease likelihood, nevertheless it nonetheless might occur. Because of this a backup is all the time necessary.

What’s the distinction when it comes to failure between a 10-drive setup and a 60-drive setup?

Failures might occur to any drive with very low likelihood. However there’s a distinction when you’ve got just one drive or 10 drives, or when you’ve got 60 or 120 drives. The extra drives you may have, the upper the likelihood on the similar drive, however the larger the [probability] that you simply encounter one failure within the system.

Let’s say when you’ve got one or 10 drives, you could possibly run it with drives of decrease reliability. Let’s say desktop drives, for instance. They’ve an annual failure charge of 1.5%, however when you’ve got just one or two or 4 with an annual failure charge of 1.5%, you gained’t have many failures.

Many of the programs will probably be secure. If you happen to take this 1.5% annual failure charge right into a 60-bay system, every system might fail yearly. If you wish to do this, you might be tremendous with it, however most drive failures trigger interruptions in service and require handbook interplay when changing drives.

And while you function a 60-bay system, you can’t afford that many failures or handbook interplay circumstances together with your system. You might want to depend on low-failure-probability enterprise drives. That’s principally the distinction.

Smaller programs could be run with lower-reliability drives due to the decrease variety of drives. With many drives in enterprise environments, it’s best to use correct enterprise drives.

How ought to storage programs be set as much as minimise the chance of onerous disk drive failure?

Once more, function the onerous disk drives within the reliability situations that are within the knowledge sheet. A tough disk drive that’s non-24×7 shouldn’t be operated 24×7.

Arduous disk drives ought to be operated inside the temperature vary. Arduous disk drives mustn’t exceed the workload that’s set within the knowledge sheets. The workload is simply a sign.

It isn’t like an endurance restrict. For enterprise drives, we are saying 550TB a 12 months. If you happen to learn or write slightly bit extra, it doesn’t matter, however in the event you learn or write double or triple, which you might do in the event you load the onerous disk drive as a lot as you’ll be able to, you may have a decrease reliability.

So long as you retain these working situations and inside the temperature vary – 42°C on common is the best reliability – then you’ll be able to get pleasure from a protracted lifetime on your onerous disk drives.