bigData_wcloudNEWExtended attributes in HDFS will facilitate at-rest encryption for Project Rhino, but they have many other uses, too.

Many mainstream Linux filesystems implement extended attributes, which let you associate metadata with a file or directory beyond common “fixed” attributes like filesize, permissions, modification dates, and so on. Extended attributes are key/value pairs in which the values are optional; generally, the key and value sizes are limited to some implementation-specific limit. A filesystem that implements extended attributes also provides system calls and shell commands to get, list, set, and remove attributes (and values) to/from a file or directory.

Recently, my Intel colleague Yi Liu led the implementation of extended attributes for HDFS (HDFS-2006). This work is largely motivated by Cloudera and Intel contributions to bringing at-rest encryption to Apache Hadoop (HDFS-6134; also see this post) under Project Rhino – extended attributes will be the mechanism for associating encryption key metadata with files and encryption zones — but it’s easy to imagine lots of other places where they could be useful.

For instance, you might want to store a document’s author and subject in sometime like user.author=cwl and user.subject=HDFS. You could store a file checksum in an attribute called user.checksum. Even just comments about a particular file or directory can be saved in an extended attribute.

In this post, you’ll learn some of the details of this feature from an HDFS user’s point of view.

Inside Extended Attributes

Extended attribute keys are java.lang.Strings and the values are byte[]s. By default, there is a maximum of 32 extended attribute key/value pairs per file or directory, and the (default) maximum size of the combined lengths of name and value is 16,384. You can configure these two limits with the dfs.namenode.fs-limits.max-xattrs-per-inode and dfs.namenode.fs-limits.max-xattr-size config parameters.

Every extended attribute name must include a namespace prefix, and just like in the Linux filesystem implementations, there are four extended attribute namespaces: user, trusted, system, and security. The system and security namespaces are for HDFS internal use only; only the HDFS super user can access trusted namespace. So, user extended attributes will generally reside in the user namespace (for example, “user.myXAttrkey”). Namespaces are case-insensitive and extended attribute names are case-sensitive.

Extended attributes can be accessed using the hdfs dfs command. To set an extended attribute on a file or directory, use the -setfattr subcommand. Read more