Oracle
Blogs Home
Products & Services
Downloads
Support
Partners
Communities
About
Login
Oracle Blog
Jeff Bonwicks Blog
Blog for bonwick
RAID-Z | Main | SEEK_HOLE and SEEK_D
ZFS End-to-End Data Integrity
By bonwick on Dec 08, 2005
The job of any filesystem boils down to this: when asked to read a block, it should return the same data that was previously written to that block. If it cant do that because the disk is offline or the data has been damaged or tampered with it should detect this and return an error.
Incredibly, most filesystems fail this test. They depend on the underlying hardware to detect and report errors. If a disk simply returns bad data, the average filesystem wont even detect it.
Even if we could assume that all disks were perfect, the data would still be vulnerable to damage in transit: controller bugs, DMA parity errors, and so on. All youd really know is that the data was intact when it left the platter. If you think of your data as a package, this would be like UPS saying, We guarantee that your package wasnt damaged when we picked it up. Not quite the guarantee you were looking for.
In-flight damage is not a mere academic concern: even something as mundane as a bad power supply can cause silent data corruption.
Arbitrarily expensive storage arrays cant solve the problem. The I/O path remains just as vulnerable, but becomes even longer: after leaving the platter, the data has to survive whatever hardware and firmware bugs the array has to offer.
And if youre on a SAN, youre using a network designed by disk firmware writers. God help you.
What to do? One option is to store a checksum with every disk block. Most modern disk drives can be formatted with sectors that are slightly larger than the usual 512 bytes typically 520 or 528. These extra bytes can be used to hold a block checksum. But making good use of this checksum is harder than it sounds: the effectiveness of a checksum depends tremendously on where its stored and when its evaluated.
In many storage arrays (see the Dell|EMC PowerVault paper for a typical example with an excellent description of the issues), the data is compared to its checksum inside the array. Unfortunately this doesnt help much. It doesnt detect common firmware bugs such as phantom writes (the previous write never made it to disk) because the data and checksum are stored as a unit so theyre self-consistent even when the disk returns stale data. And the rest of the I/O path from the array to the host remains unprotected. In short, this type of block checksum provides a good way to ensure that an array product is not any less reliable than the disks it contains, but thats about all.
NetApps block-appended checksum approach appears similar but is in fact much stronger. Like many arrays, NetApp formats its drives with 520-byte sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it compares the checksum to the data just like an array would, but theres a key difference: it does this comparison after the data has made it through the I/O path, so it validates that the block made the journey from platter to memory without damage in transit.
This is a major improvement, but its still not enough. A block-level checksum only proves that a block is self-consistent; it doesnt prove that its the right block. Reprising our UPS analogy, We guarantee that the package you received is not damaged. We do not guarantee that its your package.
The fundamental problem with all of these schemes is that they dont provide fault isolation between the data and the checksum that protects it.
ZFS Data Authentication
End-to-end data integrity requires that each data block be verified against an independent checksum, after the data has arrived in the hosts memory. Its not enough to know that each block is merely consistent with itself, or that it was correct at some earlier point in the I/O path. Our goal is to detect every possible form of damage, including human mistakes like swapping on a filesystem disk or mistyping the arguments to dd(1). (Have you ever typed of= when you meant if=?)
A ZFS storage pool is really just a tree of blocks. ZFS provides fault isolation between data and checksum by storing the checksum of each block in its parent block pointer not in the block itself. Every block in the tree contains the checksums for all its children, so the entire pool is self-validating. [The uberblock (the root of the tree) is a special case because it has no parent; more on how we handle that in another post.]
When the data and checksum disagree, ZFS knows that the checksum can be trusted because the checksum itself is part of some other block thats one level higher in the tree, and that block has already been validated.
ZFS uses its end-to-end checksums to detect and correct silent data corruption. If a disk returns bad data transiently, ZFS will detect it and retry the read. If the disk is part of a mirror or RAID-Z group, ZFS will both detect and correct the error: it will use the checksum to determine which copy is correct, provide good data to the application, and repair the damaged copy.
As always, note that ZFS end-to-end data integrity doesnt require any special hardware. You dont need pricey disks or arrays, you dont need to reformat drives with 520-byte sectors, and you dont have to modify applications to benefit from it. Its entirely automatic, and it works with cheap disks.
But wait, theres more!
The blocks of a ZFS storage pool form a Merkle tree in which each block validates all of its children. Merkle trees have been proven to provide cryptographically-strong authentication for any component of the tree, and for the tree as a whole. ZFS employs 256-bit checksums for every block, and offers checksum functions ranging from the simple-and-fast fletcher2 (the default) to the slower-but-secure SHA-256. When using a cryptographic hash like SHA-256, the uberblock checksum provides a constantly up-to-date digital signature for the entire storage pool.
Which comes in handy if you ask UPS to move it.
Technorati Tags: OpenSolaris Solaris ZFS
Category: ZFS
Tags: none
Permanent link to this entry
RAID-Z | Main | SEEK_HOLE and SEEK_D
Comments:
Is there a file-system/RAID equivalent of the Fallacies of Distributed Computing? Sounds like there should be. Maybe something like: (*) Data redundancy gurantees data reliability, (*) checksums gurantee data accuracy, (*) bits only flip on disc platters, (*) firmware is bug-free (*) NVRAM can persistent in the long-term (*) only the file-system can write data.
Posted by Chris Rijk on December 08, 2005 at 08:08 PM PST #
While were at it, Chris, maybe we can add multiple writes issued at the same time cant fail separately because thats how RAID-Z supposedly fixes the RAID-5 write hole. We could also add calculating checksums is free and caching metadata on the host alone is better than caching on both the host and the storage because those also seem to be common myths. Then theres I/O bandwidth is the scarcest commodity on systems today so were going to use up to half of it writing parity and break layering to save a few CPU cycles but thats really more of an inconsistency than a myth.
But seriously, Jeff, thanks for writing about this. The transactional-write strategy and checksumming are IMO the truly magical parts of ZFS. Even if the former is foreshadowed by other systems (such as WAFL which you finally mention but still fail to credit as an inspiration) its still very cool to see it in a general-purpose filesystem and it does solve data-integrity problems that cant possibly be solved at the external-hardware level. Its a great application of the end-to-end principle.
P.S. Id love to see what you have to say about transactional updates and batching and so on. Are you planning to do that one next?
Posted by Platypus on December 08, 2005 at 08:46 PM PST #
Hmm some comments for Platypus:
calculating checksums is free not free, just essential if you want your data actually correct. Platypus, why arent you railing about the absurd cost of adding ECC or parity to RAM? Remember: Performance is a goal, correctness is a constraint.
I/O bandwidth is the scarcest commodity on systems today so were going to use up to half of it writing parity and break layering to save a few CPU cycles. Half of it writing parity? Sounds like your using a mirror, not RAIDZ Note that were going to see doubling of the number of CPUs on a single chip every 18 to 24 months and that networking bandwidth is growing even faster.
And breaking layering? Layers are fine but when they have outlived their usefulness, its time to consolidate functions together is Intel going to complain that AMDs integration of a memory controller on their Opteron is breaking layering? Layering throws away information in this case very valuable information. Forcing all redunancy in the IO subsystem to be hidded behind a block interface was an expedient design decision for the first volume managers not something engraved in stone. Layering is for cakes, not for software.
Posted by Bart on December 09, 2005 at 01:07 AM PST #
Half of it writing parity?
According to a comment Jeff B left on my website explaining how RAID-Z varies the stripe size, if youre doing single-block writes each upper-layer write will be done as one data plus one parity on disk.
Note that were going to see doubling of the number of CPUs on a single chip every 18 to 24 months and that networking bandwidth is growing even faster.
all of which only reinforces the point about system balance. If the rate of improvement is lower for I/O bandwidth than for other factors such as CPU or networking, then a feature that uses up to 50% of that I/O bandwidth writing parity will become more and more of an issue as systems continue to evolve. Thats how things can work out with RAID-Z, as a couple of people have noted in the OpenSolaris forums, but ZFS would work without RAID-Z. If any part of a full stripe write into as-yet-unattached space fails, it can simply be retried with no ill effect. Thats the beauty of transactional updates. That leaves one free to do wider writes all the time, which could potentially waste a bit of space temporarily except that (a) disk space is likely to be the most expendable resource available, and (b) the waste will often be only temporary as the stripe is filled and earlier partial versions are reclaimed so that the final state will actually be more compact.
By the way, whenever the stripe width is two its silly to write parity anyway. Fallback to mirroring in that case would be better.
Layers are fine but when they have outlived their usefulness, its time to consolidate functions together
It has yet to be proven that the layers involved in this case have outlived their usefulness. I reject that assumption, for reasons Ill get to in a moment.
Forcing all redunancy in the IO subsystem to be hidded behind a block interface was an expedient design decision for the first volume managers not something engraved in stone.
One can change the details of an interface without rearranging the layers entirely. The interface between filesystems and volume managers can and in my opinion should be made richer, not abandoned, and that would have been sufficient to meet ZFSs goals while retaining greater compatibility with the rest of the storage ecosystem. Of course, that would have meant negotiating with others instead of reinventing the wheel.
Layering is for cakes, not for software.
Its a good thing there are smarter people than you, who dont think in simplistic slogans. Collapsing layers is a common trick in the embedded world, to wring every last bit of performance out of a CPU- or memory-constrained system at cost of adaptability. In a system that is constrained by neither CPU nor memory and where future extension should be expected, its a bad approach. If networking folks had thought as you do, it would be a lot harder to implement or deploy IPv6 or the current crop of transport protocols, not to mention packet filters and such that insert themselves between layers. In other words, it would be harder for technology to progress. In storage that same story has been played out with SCSI/FC and PATA/SATA/SAS.
Of course a layered implementation might require more CPU cycles, but weve already established that those are not the constraining resource. More importantly, it can be hard work to figure out how the interface should look to maximize functionality while minimizing performance impact, but thats why they pay us the big bucks. A designer who picks the easy way out isnt innovating; hes failing to do his job.
Layering also makes more rigorous testing possible, and provides other benefits, which and heres the real kicker is probably why ZFS itself is still layered. Its different layering, to be sure, but its still layering. Im sure Jeff B will claim there are good reasons why the new layering is better than the old, and those reasons might even be valid, but right now those justifications are not apparent. So far it looks a lot like when a junior engineer rewrites code before they fully understand it, and introduces unnecessary regression in the process as a result. A more experienced engineer will try to understand whats there first, and might end up rewriting it (properly) anyway, but will more often find a less disruptive fix. Which kind of engineer are you, Bart?
Posted by Platypus on December 09, 2005 at 03:44 AM PST #
Another thought also occurred to me with respect to the most filesystems fail this test and correctness is a constraint and such. Sun still sells a filesystem that fails this test, and even ZFS users still have to boot off of one that violates this constraint. It might not be wise to be too critical of every other filesystem while thats the case.
The real fact is that data integrity is about probabilities, not absolutes. Any hardware capable of corrupting your data between the time its checked by your HBA and the time its checked by ZFS is also in all probability capable of corrupting it after its checked by ZFS, especially if any kind of caching is involved (as is almost certain to be the case). Someone at, say, Oracle might say that the integrity checkings not truly end to end unless one end is the application not the filesystem. In the end theres always going to be a window of vulnerability somewhere, so the best one can do is try to address the most common causes of failure. I think Fibre Channel is overdesigned, for example, and Ive never been a firmware engineer so if that dig was directed at me it was misplaced, but those awful FC SANs do manage to reduce the occurrence of what has in my experience been one of the most common failure modes someone pulling out a cable while theyre doing something totally unrelated in the data center. If you can re-zone everything from your desk, you spend a lot less time anywhere near the cables.
RAID does a good job dealing with bit-rot on idle media, FC and other protocols do a pretty good job of safeguarding the physical data path while bits are in transit, and ZFS adds yet another bit of protection from (roughly) the hardware up to the top of the filesystem interface. Yay. All good. Now what about something that rots in the buffer/page cache? What about an application that writes something other than what it meant to, as Ive seen databases do on many occasions? Oh well? Too bad? Not My Problem? Maybe. You could say the filesystem did its job, but that would be cold comfort to the person whos data was lost and a little disingenuous from anyone who claims layers dont matter. There are still more holes to be plugged, and more ends to which integrity checking could be extended. The games not over yet.
Posted by Platypus on December 11, 2005 at 02:26 AM PST #
Platypus wrote:
According to a comment Jeff B left on my website explaining how RAID-Z varies the stripe size, if youre doing single-block writes each upper-layer write will be done as one data plus one parity on disk.
How often do single-block writes happen? This is a fallback; analyzing it requires knowing what the actual write patterns are. My guess is that the vast majority of writes are of the full-stripe variety, not the mirroring fallback.
By the way, whenever the stripe width is two its silly to write parity anyway. Fallback to mirroring in that case would be better.
If you use even parity (i.e. XORing all the blocks yields a zero block), then when the stripe-width is two, you get mirroring.
Posted by Jonathan Adams on December 11, 2005 at 07:53 AM PST #
If you use even parity (i.e. XORing all the blocks yields a zero block), then when the stripe-width is two, you get mirroring.
True, but in most systems doing it that way would mean touching data that could otherwise be left alone with all of the obvious effects on cache pollution, memory bandwidth, etc. In the ZFS case, of course, the data has to be touched anyway to calculate a checksum so that price has already been paid which was sort of one of my earlier points.
Posted by Platypus on December 15, 2005 at 11:06 PM PST #
The description of NetApps block-appended checksum technology is incorrect: Not only does the checksum guarantee that the block is self-consistent, it also ensures that it is the right block. The checksum includes the logical identity of the block (I am block 15 of LUN foo) and contains enough information to distinguish the most recently written contents from the previous (self-consistent but stale) contents of that block.
In terms of the UPS analogy, the NetApp checksum mechanism guarantees that the package is undamaged -and- that its the right package.
Posted by Eric Hamilton on May 30, 2006 at 03:03 AM PDT #
Post a Comment:
Comments are closed for this entry.
About
bonwick
Search
Enter search term:
Search filtering requires JavaScript
Recent Posts
And now, page 2
ZFS Deduplication
ZFS
RAID-Z
?
ZFS in MacOS X Snow Leopard
ZFS
Top Tags
apple
gpl
leopard
linux
macos
solaris
zfs
Categories
General
Slab Allocator
ZFS
Archives
August 2015
Sun
Mon
Tue
Wed
Thu
Fri
Sat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Today
Solaris Kernel Developers
Adam Leventhal
Bryan Cantrill
ZFS Team
Bill Moore
Eric Kustarz
Eric Schrock
Neil Perrin
Tabriz Leman
Matt Ahrens
Menu
Blogs Home
Weblog
Login
Feeds
RSS
All
/General
/Slab Allocator
/ZFS
Comments
Atom
All
/General
/Slab Allocator
/ZFS
Comments
The views expressed on this blog are those of the author and do not necessarily reflect the views of Oracle. Terms of Use | Your Privacy Rights |
Reviews
There are no reviews yet.