Linux VFS internals and dentries
There is often a lot of counfusion around how directories are represented on the filesystem. This
post tries to clarify how ext4 uses dirent
structure to store directory information.
Linux VFS
The Linux VFS subsystem implements file-system realted operations by exporting the
usual open/read/write interface to userspace processes regardless of the underlying
filesystem or physical device. The main objects implemented by the VFS layer are
superblock
, inode
, dentry
and file
.
superblock
The super_block
object is a collection of metadata describing a filesystem.
On ext filesystems, this object reflects the information stored in the
filesystem superblock at the beginning of each partition, just
after the initial boot block. The VFS superblock
contains information such as:
- The filesystem block size
- The filesystem magic number
- An object representing the mount point within the system (see
dentry
discussion further down) - A pointer to a pool of inodes
The superblock
is extremely important and it is usually backed up throughout
the whole medium on ext filesystems. dumpe2fs
lists all the backup copies available.
inode and dentry
The VFS treats directories as special files and represents each component
of a filesystem path with a dentry
object. Dentries
are not physically stored on the medium, but they are rather created on-the-fly by
the kernel when needed. The most relevant fields of a dentry
are the following:
-
struct dentry *d_parent
, a pointer to the parentdentry
. When building a path, the kernel chains the corresponding dentries together. -
struct inode *d_inode
, a pointer to theinode
implementing the current element of the path, being it a directory or a regular file. -
struct qstr d_name
, the name of the path element. Names are not part of theinode
but rather of thedentry
.
Directories structure on the filesystem
As mentioned before, directories are also represented on the filesystem with
inodes, which in turn contain pointers to blocks on the storage device. The layout of
a directory on the storage device consists of a list of <inode, name> pairs that
represent the entities contained in that directory. Linux uses the dirent
structure to model such information:
struct dirent {
ino_t d_ino; /* inode number */
off_t d_off; /* not an offset; see NOTES */
unsigned short d_reclen; /* length of this record */
unsigned char d_type; /* type of file; not supported
by all filesystem types */
char d_name[256]; /* filename */
};
It is straightforward to verify the layout of a directory directly on the storage device. Considering for example the following hierarchy:
directory1
`|- [13320526] directory2
|- [13328753] directory3
`- [13238920] file1
debugfs
allows to obtain the block numbers referenced by an inode representing a
file or a directory. In this case, directory directory1
has inode number 13279608
(ls
with -i
flag displays such information).
$ sudo debugfs /dev/mapper/debian-debian--home
debugfs: stat <13279608>
Inode: 13279608 Type: directory Mode: 0755 Flags: 0x80000
Generation: 791866861 Version: 0x00000000:0000000c
User: 1000 Group: 1000 Size: 4096
File ACL: 0 Directory ACL: 0
Links: 4 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x583aaaed:81e380d8 -- Sun Nov 27 10:44:13 2016
atime: 0x583aaaf4:84c24ee8 -- Sun Nov 27 10:44:20 2016
mtime: 0x583aaaed:81e380d8 -- Sun Nov 27 10:44:13 2016
crtime: 0x57eac34a:8c8b2808 -- Tue Sep 27 21:06:50 2016
Size of extra inode fields: 28
EXTENTS:
(0):52965455
This directory is stored in block 52965455, whose sector position on the storage device depends on the filesystem block size.
$ sudo dumpe2fs /dev/mapper/debian-debian--home | grep "Block size"
dumpe2fs 1.42.12 (29-Aug-2014)
Block size: 4096
Extent 52965455 corresponds to sector 52965455*8
when using 4K blocks, i.e.
423723640, which can be dumped with dd
. stat
reports a size of 4K, which is
the allocation unit at the filesystem level, but the relevant data is most
likely less than 4096 bytes.
$ sudo dd if=/dev/mapper/debian-debian--home of=dump.bin skip=52965455 bs=4K count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000197388 s, 20.8 MB/s
The first part of the hex dump is the following:
00000000 78 a1 ca 00 0c 00 01 02 2e 00 00 00 01 00 ca 00 |x...............|
00000010 0c 00 02 02 2e 2e 00 00 88 02 ca 00 10 00 05 01 |................|
00000020 66 69 6c 65 31 62 61 64 4e 41 cb 00 14 00 0a 02 |file1badNA......|
00000030 64 69 72 65 63 74 6f 72 79 32 ca 00 71 61 cb 00 |directory2..qa..|
00000040 c4 0f 0a 02 64 69 72 65 63 74 6f 72 79 33 05 07 |....directory3..|
00000050 74 65 6d 70 38 73 74 32 35 36 2e 73 62 73 74 6f |temp8st256.sbsto|
00000060 72 65 00 00 1e 62 ca 00 24 00 1c 01 67 6f 6f 67 |re...b..$...goog|
00000070 70 75 62 2d 70 68 69 73 68 2d 73 68 61 76 61 72 |pub-phish-shavar|
00000080 2e 73 62 73 74 6f 72 65 25 62 ca 00 24 00 1a 01 |.sbstore%b..$...|
00000090 74 65 73 74 2d 66 6f 72 62 69 64 2d 73 69 6d 70 |test-forbid-simp|
000000a0 6c 65 2e 73 62 73 74 6f 72 65 00 00 29 62 ca 00 |le.sbstore..)b..|
Indeed this 4K block seems to contain the directory information but
also additional irrelevant data from previous allocations in the same block. The first
piece of information that immediately stands out is the name of the files and directories
contained in directory1
. What should follow each file name is an ino_t
representing
the inode number, encoded as a little endian unsigned long
at the VFS layers,
which translates to 64-bits on my system.
However, it is up to the actual implementation to decide how to map Linux VFS fields
to the internal structures. As an example, ext4
uses ext4_dir_entry
which consists
of the following fields:
struct ext4_dir_entry {
__le32 inode; /* Inode number */
__le16 rec_len; /* Directory entry length */
__le16 name_len; /* Name length */
char name[EXT4_NAME_LEN]; /* File name */
};
The binary dump from 0x18 to 0x30 represents entry file1
and maps directy to the
fields above.
88 02 ca 00 __le32 representing the inode number (13238920)
10 00 __le16 representing the directory entry length (16)
05 01 __le16 representing the file name length (261)
66 69 6c 65 31 char* representing the file name
Does this look right? Not really. inode
seems correct and rec_len
as well (the total
record len would be 13, but it’s rounded to the byte boundary). name_len
instead
is definitely not correct: it should be simply 0x05 and the leftmost byte should
be 0x00. This mismatch can be explained with the introduction of ext4_dir_entry_2
structure, which is defined in fs/ext4/ext4.h
as follows:
/*
* The new version of the directory entry. Since EXT4 structures are
* stored in intel byte order, and the name_len field could never be
* bigger than 255 chars, it's safe to reclaim the extra byte for the
* file_type field.
*/
struct ext4_dir_entry_2 {
__le32 inode; /* Inode number */
__le16 rec_len; /* Directory entry length */
__u8 name_len; /* Name length */
__u8 file_type;
char name[EXT4_NAME_LEN]; /* File name */
};
So, EXT4_NAME_LEN
is defined as 255, therefore having 2 bytes to represent
the lenght does not make much sense. As a consequence, one byte that previously
was part of the __le16
representing the name is now used to designate the file
type and indeed now name_len
becomes a __u8
, i.e. simply 0x05. The same analysis
can be applied to the remaining entries. Now, the analysis started from the
first entry which could be easiy identified from the filename, but the block contains
some more data from 0x00 to 0x17.
00000000 78 a1 ca 00 0c 00 01 02 2e 00 00 00 01 00 ca 00 |x...............|
00000010 0c 00 02 02 2e 2e 00 00 |........|
What is this data? Well 0x78a1ca00
looks like a little endian inode number
and so does 0x0100ca00
. A quick check reveals what these entries are:
$ sudo debugfs /dev/mapper/debian-debian--home
debugfs 1.42.12 (29-Aug-2014)
debugfs: ncheck 13238273
Inode Pathname
13238273 //mguerri
debugfs: ncheck 13279608
Inode Pathname
13279608 /mguerri/directory1
These are dentries
representing . and ..!