Keep Me in the Loop: Inotify in HDFS
Share this Session:
  Colin McCabe   Colin P McCabe
Software Engineer


Tuesday, August 18, 2015
02:00 PM - 02:45 PM

Level:  Technical - Intermediate

An elephant never forgets-- at least, not if that elephant is Hadoop. The Hadoop Distributed Filesystem (HDFS) can store petabytes of data. Services that run on top of HDFS often want to cache or index some of that data. When files in HDFS change, or when more files are added, these services need to update their caches and indices.

Until recently, the only practical way to monitor HDFS for changes was to periodically rescan the filesystem. However, rescanning is time-consuming and inefficient. It requires the client to ask the namenode to list multiple directories, which generates a high RPC load. Because rescans are costly, they have to be done with a relatively low frequency, which leads to long stretches of time between updates. As more data is added, full rescans get even costlier.

The new HDFS inotify API solves these problems. Instead of rescanning, applications can simply receive notifications about changes to the filesystem. In this talk, I will go over the design goals for inotify and how we accomplished them, the challenges we faced, and our plans for the future.

Colin McCabe is a Platform Software Engineer at Cloudera, where he works on HDFS and related technologies. He is a committer and PMC member on Hadoop. Prior to joining Cloudera, he worked on the Ceph Distributed Filesystem, and the Linux kernel, among other things. He studied Computer Science and Computer Engineering at Carnegie Mellon.

Close Window