Cloudera ImpalaThe new Python client for Impala will bring smiles to Pythonistas!

As a data scientist, I love using the Python data stack. I also love using Impala to work with very large data sets. But things that take me out of my Python workflow are generally considered hassles; so it’s annoying that my main options for working with Impala are to write shell scripts, use the Impala shell, and/or transfer query results by reading/writing local files to disk.

To remedy this, we have written the (unpredictably named) impyla Python package (not officially supported). Impyla communicates with Impala using the same standard Impala protocol as the ODBC/JDBC drivers. This RPC library is then wrapped by the commonly-used database API specified in PEP 249 (“DB API v2.0”). Below you’ll find a quick tour of its functionality. Note this is still a 0.x.y release: the PEP 249 client is beta, while the sklearn and udf submodules are pre-alpha.

Installation

Impyla works with Python 2.6 and 2.7 (support for Python 3 is planned for the future). It depends minimally on the Thrift Python package. Install it with pip:

$ [sudo] pip install impyla

or clone the repository.

Querying Impala from Python

Start by connecting to the Impala instance and getting a cursor. (Learn more about the API by checking out PEP 249.

>>> from impala.dbapi import connect
>>> conn = connect(host=’my.impala.host’, port=21050)
>>> cur = conn.cursor()

Note: make sure to set the port to the HS2 service, rather than the Beeswax service. The default in Cloudera-managed clusters is port 21050 for HS2. (The Impala shell defaults to port 21000, which is for Beeswax.) read more