Category: OpenZFS

high availability web cluster storage & cloud platform

Complexity of FreeBSD VFS using ZFS

I spend a lot of time hacking on the ZFS port to FreeBSD and fixing various bugs. Quite often the bugs are specific to the port and not to the OpenZFS core. A good share of those bugs are caused by differences between VFS models in Solaris and its descendants like illumos, and FreeBSD. I would like to talk about those differences. But first a few words about VFS in general. VFS stands for “virtual file system”. It is an interface that all concrete filesystem drivers must implement so that higher level code could be agnostic of any implementation details. More strictly, VFS is a contract between an operating system and a filesystem driver. In a wider sense VFS also includes the higher level filesystem-independent code that provides the more high level and convenient interfaces to the web development firm. For example, a filesystem must implement an interface for looking up an entry by name in a directory. VFS provides a more convenient interface that allows to perform a lookup using an absolute or a relative path given a starting directory. Additionally, VFS in a wider sense includes utility code that could be shared between different filesystem drivers. Common VFS models for UNIX and UNIX-like operating systems have some common requirements on a structure of a filesystem. First, it is assumed that there are special filesystem objects called directories that provide mapping from names to other filesystem objects that are called directory entries. All other filesystem objects contain data or provide other utilities. The directories form a directed rooted tree starting with a specially designated root directory. In other words, it’s a connected rooted directed acyclic graph where each edge has a name associated with it. Non-directory objects may be reachable by multiple paths. Alternatively, a non-directory object can be a directory entry in more than one directory or it can appear as multiple entries with different names in a single directory. Conventionally those multiple paths to a single object are referred to as hard links. Additionally, there is a designated object type called a symbolic link that can contain a relative or an absolute path. When traversing such a symbolic link object a filesystem consumer may jump to the contained path called a symbolic link destination. It is not required to so, however. The symbolic links allow to create appearence of arbitrary topologies including loops or broken paths that lead nowhere. A directory must always contain two special entries: Each filesystem object is customarily referred to as an inode, especially in the context of a filesystem driver implementation. VFS requires that each filesystem object must have a unique integer identifier referred to as an inode number. At the VFS API layer the inodes are represented as vnodes where ‘v’ stands for virtual. In object oriented terms the vnodes can be thought of as interfaces or abstract base classes for the inodes of the concrete filesystems. The vnode interface has abstract methods known as vnode operations or VOPs that dispatch calls to concrete implementations. Typically an OS kernel is implemented in C, so object oriented facilities have to be emulated. In particular, a one-to-one relation between a vnode and an inode is established via pointers rather than by using an is-a relationship. For example, here is how a method for creating a new directory looks in FreeBSD VFS: int VOP_MKDIR( struct vnode *dvp, struct vnode **vpp, struct componentname *cnp, struct vattr *vap); dvp (“directory vnode pointer”) is a vnode that represents an existing directory; the method would be dispatched to an implementation associated with this vnode. If the call is successful, then vpp (“vnode pointer to pointer”) would point to a vnode representing a newly created directory. cnp defines a name for the new directory and vap various attributes of it. The same method in Solaris VFS has a few additional parameters, but otherwise it is equivalent to the FreeBSD VFS one. It would be wasteful or even plain impossible to have vnode objects in memory for every filesystem object that could potentially be accessed, so vnodes are created upon access and destroyed when they are no longer needed. Given that C does not provide any sort of smart pointers the vnode life cycle must be maintained explicitly. Since in modern operating systems multiple threads may concurrently access a filesystem, and potentially the same vnodes, the lifecycle must be controlled by a reference count. All VFS calls that produce a vnode such as lookups or new object creation return the vnode referenced. Once a caller is done using the vnode it must explicitly drop a reference. When the reference count goes to zero the concrete filesystem is notified about that and should take an appropriate action. In Solaris VFS model the concrete filesystem must free both its implementation specific object and the vnode. In FreeBSD VFS the filesystem must handle its private implementation object, but the vnode is handled by the VFS code. In practice an application may perform multiple accesses to a file without having any persistent handle open for it. For example, the application may call access(2), stat(2), etc system calls. Also, for example, lookups by different applications may frequently traverse the same directories. As a result, it would be inefficient to destroy a vnode and its associated inode as soon as its use count reaches zero. All VFS implementations cache vnodes to avoid the expense of their frequent destruction and construction. Also, VFS implementations tend to cache path to vnode relationships to avoid the expense of looking up a directory entry via a call to a filesystem driver, VOP_LOOKUP. Obviously, there can be different strategies for maintaining the caches. For example, a life time of a cache entry could be limited; or total size of the cache could be limited and any excess entries could be purged in a least recently used fashion or in a least frequently used fashion. And so on. Solaris VFS combines the name cache and the vnode cache. The name cache maintains an extra reference on a…
Read more