ECS 150 Filesystem Implementation
Prof. Joel Porquet-Lupine
UC Davis 2020/2021
Copyright 2017-2021 Joel Porquet-Lupine CC BY-NC-SA 4.0 International License /
1 / 45
Introduction
Concepts
Volume
Disk, or partition on a disk Large array of data blocks
Filesystem
Methods and data structures to organize files on a volume
Volume
Directory
Hierarchy of named files
directory
music/ 320 work/ 219 foo.txt 871
File
index structure
Storage sectors
File name
foo.txt
File number
871
Metadata to describe files characteristics
Actual sequence of data
Metadata
size: 42 KiB
created: 1970-01-01
Data
010111011001
101001110001
110111011000
.
2 / 45
/
Introduction
Objectives
Performance
In spite of underlying storage devices limitations Achieved by maintaining spatial locality
Blocks that are logically related should be stored near one another
Flexibility
Handle all file sizes: small vs large
Handle all access types: sequential vs random Handle all access frequencies: rare vs frequent Handle all file lifetimes: temporary vs permanent
Persistence
Maintain both user data and internal data structures Survive system crashes and power failures
Reliability
Store data reliability over time
In spite of crash during updates, or hardware errors
3 / 45
/
Design considerations
Workload
File size and storage space
$ du -sh /usr/bin/
1.1G /usr/bin/
$ ls -1 /usr/bin/ | wc -l
4566
$ du -h VirtualBox/Machines/Ubuntu/Ubuntu.vdi 20G VirtualBox/Machines/Ubuntu/Ubuntu.vdi
File access and I/O transfer
Most files are small
Large files account for more storage
Most accesses are to small files
Accesses to large file account for more I/O transfer
$ strace chrome |& grep open | wc -l 557
$ dd if=Ubuntu.vdi of=copy.vdi bs=4K
20981043200 bytes (21 GB, 20 GiB) copied, 62.74 s, 334 MB/s
File access pattern and usage
Most files are read/written sequentially (e.g., config files, executables) Some files are read/written randomly (e.g., database files, swap files)
Some files have pre-defined size at creation (e.g., downloaded files) Some files start small and grow over time (e.g., system logs)
4 / 45
/
Design considerations
Blocks vs disk sectors
Rationale
OS can allocate blocks of disk sectors rather than individual sectors
Does not cost much more to access few consecutive disk sectors
Big block size
E.g., 32 KiB
Management requires less space Performance improvement Wasted space if block not full
Trade-off
Make block size multiple of memory page size 4KiB on most systems
Small block size
E.g., size of single sector Management requires more space More separate accesses to data Less wasted space if block not full
5 / 45
/
Design considerations
Review
Small files
Small blocks for efficient storage
Files used together should be stored together
Large files
Large blocks for efficient storage
Contiguous data allocation for sequential access Efficient lookup for random access
Problems
May not know at file creation
Whether file will become small or large
Whether file is persistent or temporary
Whether file will be used sequentially or randomly
6 / 45
/
Design considerations
Implementation overview
From pair
File name, Offset
foo.txt, 42
Data structures
Directories
directory
music/ 320 work/ 219 foo.txt 871
File number
871
index structure
Storage sectors
Map filenames to file numbers (to find metadata)
Index structure
Part of a files metadata Map data blocks of file
Free space map
Manage the list of free disk blocks Allow files to grow/shrink
Data structures organization
Storage devices often have non- uniform performance
Use of locality heuristics to optimize data placement
Defragmentation Files grouping
7 / 45
/
Directories
Design
A directory is simply a file
List of mappings from filenames to file numbers Each mapping is a directory entry
Only directly accessible by OS
Ensure integrity of mapping Accessible for processes via syscalls
e.g., opendir()/readdir()
Hexdump of directory
8 / 45
/
Directories
Organization strategies Flat hierarchy
One unique namespace for the entire volume
Use special area of the disk to hold root directory
Two files can never be named the same
Multi-user, Multi-level hierarchy
One special root directory Hierarchy of subdirectories
Permissions to distinguish between users
Multi-user flat hierarchy
Separate root directory for each user
But all users files must still have unique names
9 / 45
/
Directories
Implementation
Linear layout
Simple array of entries E.g., MS-FAT
Pros/Cons
Simple
Need to scan all entries
10 / 45
/
Directories
List layout
Linked-list of entries E.g., ext2
Pros/Cons
Jump over blocks of free entries Linear traversal
11 / 45
/
Directories
Tree layout
Tree of entries
Filenames hashed into keys used to traverse tree E.g., XFS, NTFS
Pros/Cons
Fast search
More complicated
12 / 45
/
Index structures
Metadata to data
From directory mapping, find metadata From metadata, find files contents
File name
foo.txt
Goals
music/ 320 work/ 219 foo.txt 871
Metadata File
Content M File ???
Support sequential data placement to maximize sequential file access Provide efficient random access to any file block
Limit overheads to be efficient for small files
Be scalable to support large files
Provide space to hold metadata itself
13 / 45
/
Index structures
Contiguous allocation
Files stored as a sequence of contiguous blocks File-to-blocks mapping includes first block (and size) Require allocation policy (e.g., first-fit, best-fit, worst-fit)
Example
Metadata Metadata Content
Content File 2
File 1 File 2 MM
File 1
Pros
Cons
Very simple
Best performance
Efficient sequential and random access
Change in size likely to require entire reallocation
External fragmentation
Usage
ISO 9660 (CD-ROM, DVD, BD)
14 / 45
/
Index structures
Digression about fragmentation
Internal fragmentation Waster space inside blocks Issue if blocks are too large
File 1 File 2 File 3
External fragmentation
Free space scattered instead of being contiguous Issue if blocks need to be allocated contiguously
File 1 File 2 File 3 File 4
(needs 4 free blocks: they exist but arent contiguous)
15 / 45
/
Index structures
Linked-list allocation
Files stored as linked lists of blocks
File-to-blocks mapping includes pointer to the first block
Pointer to the last block to optimize file growth For each block, pointer to the next block in chain
Example
Metadata Metadata File 1 File 2
MM
Pros
Cons
No (true) random access
Potentially inefficient sequential access
Usage
MS-FAT
Fairly simple
File size flexibility
No external fragmentation Easy sequential access
16 / 45
/
ECS 150 Filesystem Implementation
Prof. Joel Porquet-Lupine
UC Davis 2020/2021
Copyright 2017-2021 Joel Porquet-Lupine CC BY-NC-SA 4.0 International License /
17 / 45
Recap
Implementation overview
From pair
File name, Offset
foo.txt, 42
Directories
directory
music/ 320 work/ 219 foo.txt 871
File number
871
index structure
Storage sectors
Index structures
Contiguous allocation
Metadata Metadata Content File 1 File 2 File 1
MM
Linked-list allocation
Metadata Metadata File 1 File 2
MM
Content File 2
18 / 45
/
Index structures
Direct allocation
File-to-blocks mapping includes direct pointers to each data block
Example
Metadata File
M
Pros
Cons
Limited file size Non-scalable index structure
File size flexibility
Supports true random access
19 / 45
/
Index structures
Indexed allocation
File-to-blocks mapping includes a pointer to an index block An index block contains an array of data block pointers Data blocks allocated only on demand
Example
Metadata File
M
IB
Pros
Cons
Same as indexed allocation Overhead for small files
Same as direct allocation
Decouple index structure from metadata
20 / 45
/
Index structures
Linked index blocks (IB + IB + )
Last index blocks pointer can point to next index block
Example
Metadata File
M
File size flexibility
IB
IB
IB
Pros
Cons
Traversal for very large files
21 / 45
/
Index structures
Multilevel index blocks (IB x IB x )
First-level index block points onto second-level index blocks
Example
Metadata File
M
Great support for very large files
IB
IB
IB
IB
Pros
Cons
Wasteful for small files
22 / 45
/
Case study: MS-FAT
Introduction
Microsoft File Allocation Table
Originally created for floppy disks, in the late 70s
Used on MS-DOS, and early version of Windows (before NTFS)
Still very popular on certain systems (e.g., thumb drives, camera SD-cards, embedded systems, etc.)
Different versions over time: FAT12, FAT16, FAT32, and now exFAT
File Allocation Table
Index structure for files
Each file is a linked list of blocks Free space map
Linear map of all data blocks on disk
FAT Data blocks
Data
Next
Data
Next
Data
Next
23 / 45
/
Case study: MS-FAT
FAT structure
1 entry per data block
Index structures
Directory entry maps name to first block index
FAT Data blocks
0 1 2 3 4 5 6 7 8 9
foo.txt (block #3)
foo.txt (block #0)
foo.txt (block #1)
foo.txt (block #2)
bar.txt (block #0)
bar.txt (block #1)
foo.txt (block #4)
17
File
Size
Index
foo.txt
18000
9
bar.txt
5000
12
0
Indicates next block in chain
10 11 12 13 Locality heuristics 14 15 16 17 18 19
10
or EOC for last block of a file Free space tracking
11
0 indicates free block
3
16
Usually simple allocation strategy (e.g. next-fit)
0
EOC
EOC
0
24 / 45
/
Case study: MS-FAT
Directory structure
Directory is a file containing an array of 32-byte entries. Each entry composed of
8-byte name + 3-byte extension (ASCII)
Long file names were later supported by allowing to chain multiple directory entries
Creation date and time
Last modification date and time Index of first data block in FAT Size of the file
25 / 45
/
Case study: MS-FAT
Layout on disk
26 / 45
/
Case study: MS-FAT
Locality heuristics
Data blocks for a file may be scattered across the disk Defragmentation can rearrange data blocks and improve spatial locality
Before After
FAT Data blocks FAT
00 11 220 3 17 foo.txt (block #3) 3 0 440 55 66 707 88
Data blocks
9 10 10 11 11 3 12 16 13
14 0 15
16 EOC 17 EOC 18
19 0
foo.txt (block #0) foo.txt (block #1) foo.txt (block #2) bar.txt (block #0)
bar.txt (block #1) foo.txt (block #4)
9 10
10 11
11 12
12 13
13 EOC
14
15
16 17
17 EOC
18 19
foo.txt (block #0) foo.txt (block #1) foo.txt (block #2) foo.txt (block #3) foo.txt (block #4)
bar.txt (block #0) bar.txt (block #1)
27 / 45
/
Case study: MS-FAT
Conclusion
Pros
Simple
State required per file: start block and size
Widely supported (maybe even the most popular FS ever!) No external fragmentation
Cons
Limited performance
Many seeks if FAT cannot be cached in memory
Poor locality for sequential access if files are fragmented Poor random access
Limited metadata No access control
No support for hard links Limited volume and file sizes
E.g., 2-TiB max volume and 4-GiB max file size with FAT32 No support for reliability strategies
28 / 45
/
ECS 150 Filesystem Implementation
Prof. Joel Porquet-Lupine
UC Davis 2020/2021
Copyright 2017-2021 Joel Porquet-Lupine CC BY-NC-SA 4.0 International License /
29 / 45
Recap
MS-FAT
Design
Index structure
Linked list
Implemented via FAT (File Allocation Table) FAT[cur_block] == next_block
Free space
Via FAT as well
FAT[i] == 0
Locality heuristics
Next fit for FAT allocation Data block defragmentation
FAT Data blocks
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14
foo.txt (block #3)
foo.txt (block #0)
foo.txt (block #1)
foo.txt (block #2)
bar.txt (block #0)
bar.txt (block #1)
foo.txt (block #4)
17
0
10
11
3
16
Optimize sequential layout 15 16 17 18 19
Discussion
Very simple, still widely used and supported
Poor locality, poor random access, limited metadata
0
EOC
EOC
0
30 / 45
/
Case study: FFS
Introduction
Berkeley Fast File System
Originally created as improvement of UFS (Unix
File System), in the early 80s Inspiration for the ext2/3/4 family
Inodes
Inode contains files metadata and a set of pointers to locate its data blocks
Index structure as combination of all the indexed- based approaches
File represented as a fixed, asymmetric tree, with 4-KiB data blocks as leaves
Inodes are stored consecutively in a big array, and indexed via an i-number
31 / 45
/
Case study: FFS
Files metadata
Type:
Ordinary file Directory Symbolic link Etc.
Directory inode (128B)
Permissions and owners
Size in bytes
Number of (hard-)links to the inode Timestamps:
Created, last modified, last accessed
File inode (128B)
Type
Mode
User ID
Group ID
File size
# blocks
# links
Flags
Timestamps (3)
Direct blocks (12)
Single indirect
Double indirect
Triple indirect
Type
Mode
User ID
Group ID
File size
# blocks
# links
Flags
Timestamps (3)
Direct blocks (12)
Single indirect
Double indirect
Triple indirect
Directory block
File data block
Data
.
inode #
..
inode #
passwd
inode #
fstab
inode #
32 / 45
/
Case study: FFS
Files index structure
Fixed, asymmetrical tree index structure (Multilevel index) Combination of: direct, indexed and multilevel indexed allocation
33 / 45
/
Case study: FFS
Characteristics Tree structure
File represented as a tree
Efficient to find any data block (e.g., random access)
High degree
Each indirect block points to 100s of blocks
Minimize number of seeks
Fixed structure
Byte n of a file always accessible via the same pointer(s)
Simple to implement
Asymmetric
Not all data blocks at the same level Efficiently supports small and large files
34 / 45
/
Case study: FFS
Flexible file size
15 pointers per inode
12 direct pointers to data blocks
With 4 KiB data blocks: max size of 48 KiB
1 pointer to a block of 1024 direct pointers to data blocks (single indirection) With 4 KiB data blocks: 4 MiB (+ 48 KiB)
1 pointer to a block of pointers to blocks of 1024 direct pointers (double indirection) With 4 KiB data blocks: 4 GiB (+ 4 MiB + 48 KiB)
1 pointer to a block of pointers to blocks of pointers to blocks of 1024 direct pointers (triple indirection)
With 4 KiB data blocks: 4 TiB (+ 4 GiB + 4 MiB + 48 KiB)
In total, the structure can point to: 12 + 1024 + 1024^2 + 1024^3 blocks *
*/
In practice, limited to 2 TiB 35 / 45
Case study: FFS
Small files support
All the blocks are reached via direct pointers
Two accesses to read data
Inode + data block
Fixed-depth tree?
With a fixed 3-level tree (instead of asymmetric tree)
A 4 KiB file would consume ~16 KiB!
4 KiB data + 3 levels of 4 KiB indirect blocks
5 accesses to read data
Inode + 3 indirection blocks + data block
36 / 45
/
Case study: FFS
Sparse files support
Sparse files contain large empty areas (holes)
fd = open(/path/to/sql.db, O_RDWR|O_CREAT, 0644);
ftruncate(fd, 1 << 30);write(fd, buf, sizeof(buf)); lseek(fd, -sizeof(buf), SEEK_END); write(fd, buf, sizeof(buf));// Grow the file to 1GiB// Write something at the beginning// Write something at the very endEfficient supportRead from hole 0-filled bufferWrite to holeData blocks and indirect blocksdynamically allocatedExample (above)2 writes: 4 KiB at offset 0 and 4 KiB at offset 230File size of 1.1 GiBBut space on disk of 16 KiB! 37 / 45/ Case study: FFSDirectory structureEntry structureOriginally, array of 16-byte entries14 bytes for file name2 bytes for i-numberLater, linked list in which each entry contains4 bytes for i-number Length of file name Variable-length file nameDirectory contentsFirst entry always . Points to selfSecond entry always .. Points to parent’s i-number 38 / 45/ Case study: FFSFree space managementNeed to keep track of which inode entries and data blocks are free Use of bitmapsBitmap data structureKeeping track of n resources requires a bitmap of n bits.Each bit in the bitmap tracks a single resource0 if resource is free1 if resource is allocated 011100001 110101101 000101001 100100010 … 111100010 39 / 45/ Case study: FFSLayout on diskSuperblockInformation about the file system volume (type, size, etc.) Inode bitmapData block bitmap Inode arrayInode #0 and #1 are reservedInode #2 is always the root directoryData blocks (includes actual file data blocks, but also directory contents and indirection blocks) Super BlockInode BitmapBlock BitmapReserved Root directory Other inodesInode ArrayData blocks FFS 11101110…0 1000100…1 40 / 45/ Case study: FFSExample: reading a file1. Inode #2 (/, root directory)Find root directory’s data block (912)2. Browse root directory’s content Find foo’s i-number (31)3. Inode #31Find directory’s data block (194)4. Browse directory’s content Find bar’s i-number (73)5. Inode #73Find directory’s data block (991)Inode arrayData blocks fd = open(“/foo/bar/toto”);read(fd, buf, len); Reserved Reserved Type: directory …912their boss. The person who knows HOW will always have afoobar 73 mp3 23 tmp 81 fooType: directory …194 bin 47 foo 31 usr 98 job. The person who knows WHY will always bebar 6. Browse Finddirectory’s content ‘s i-number (40)#0 #1 #2#31#40#73#194#301#302#912#913#991totoType: file … 302 913 301bar7. Inode #40Find toto file’s data block (302, 913, 301)8. Read data blocksType: directory …991 bit 80 dat 87 toto 4041 / 45/Case study: FFSLocality heuristics: block groupsDivide volume into block groups Sets of consecutive cylindersSeek time between blocks in a group is smallDistribute metadataDistribute into block groups File’s metadata close to its dataFile placementFiles belonging to same directory in same groupNew directory in different group than parent’s directoryData block placement First-fit strategyShort-term vs long-term locality42 / 45/Case study: FFSFirst-fit data placement43 / 45/ Case study: FFSLocality heuristics: reserved spaceBlock group heuristic is efficient but needs significant amount of free spaceIf volume is near full, little room for locality optimization FFS reserves a fraction of volume’s space (~10%)Treats volume as full before it actually is Leaves room for locality optimizationChoice motivated by (disk) technology trends Sacrifices disk spaceBut known to steadily increase To reduce seek timesKnown to only improve very slowly 44 / 45/ Case study: FFSConclusionProsEfficient storage for both small and large files Locality for both small and large filesLocality for metadata and dataConsInefficient for tiny filese.g., 1-byte file requires both an inode and a data blockOptimization possible for symbolic links (ext family) Inefficient encoding when a file is mostly contiguous on disk Need to reserve fraction of free space to optimize locality 45 / 45/
Reviews
There are no reviews yet.