An elephant never forgets-- at least, not if that elephant is Hadoop. The Hadoop Distributed Filesystem (HDFS) can store petabytes of data. Services that run on top of HDFS often want to cache or index some of that data. When files in HDFS change, or when more files are added, these services need to update their caches and indices.
Until recently, the only practical way to monitor HDFS for changes was to periodically rescan the filesystem. However, rescanning is time-consuming and inefficient. It requires the client to ask the namenode to list multiple directories, which generates a high RPC load. Because rescans are costly, they have to be done with a relatively low frequency, which leads to long stretches of time between updates. As more data is added, full rescans get even costlier.
The new HDFS inotify API solves these problems. Instead of rescanning, applications can simply receive notifications about changes to the filesystem. In this talk, I will go over the design goals for inotify and how we accomplished them, the challenges we faced, and our plans for the future.