How-to Implement Role-based Security in Impala using Apache Sentry
Apache Sentry (incubating) is the Apache Hadoop ecosystem tool for role-based access control (RBAC). In this how-to, I will demonstrate how to implement Sentry for RBAC in Impala. I feel this introduction is best motivated by a use case.
Data warehouse optimization is one of the most common Hadoop use cases. After migrating data transformation workloads to Hadoop, customers typically want to provide self-service business intelligence access on Hadoop. Self-service BI results in many distinct users logging in and executing queries each under their own user id. When end users start using the cluster, fine-grained authorization is a requirement to satisfy internal controls and governmental regulations. Sentry was initially created originally for this use case.
I won’t go into detail here about why fined-grained authorization is useful; my colleague Shreepadma Venugopalan covered this topic in her post “With Sentry, Cloudera Fills Hadoop’s Enterprise Security Gap.” Furthermore, Sravya Tirukkovalur wrote a post about using Sentry with Apache Hive (“How-to: Get Started with Sentry in Hive”).
Sentry and Impala work together in a similar fashion as Sentry and Hive. In fact, since the policy file syntax is identical, users who use both Hive and Impala are encouraged to share the same policy file.
The two systems have different architectures resulting in some divergence in how they interact with Sentry. For example, Hive is typically configured with a single or small number of HiveServer2 instances. Impala works differently as each Impala daemon accepts queries, one of the many design features which helps Impala scale to a large number of concurrent queries.
In the Hive case, a small number of HiveServer2 instances will read the policy file from HDFS, whereas in the Impala case, each daemon will. (Since many Impala daemons will be reading the file from HDFS and the file is small, setting the replication count equal to the number of slave nodes is reasonable.) One additional difference is that while Hive reads and parses the policy file for each query, Impala checks to see if the policy file has been updated every five minutes.