Using Ibis with Impala¶
One goal of Ibis is to provide an integrated Python API for an Impala cluster without requiring you to switch back and forth between Python code and the Impala shell (where one would be using a mix of DDL and SQL statements).
If you find an Impala task that you cannot perform with Ibis, please get in touch on the GitHub issue tracker.
While interoperability between the Hadoop / Spark ecosystems and pandas / the PyData stack is overall poor (but improving), we also show some ways that you can use pandas with Ibis and Impala.
The Impala client object¶
To use Ibis with Impala, you first must connect to a cluster using the
ibis.impala.connect
function, optionally supplying an HDFS connection:
import ibis
hdfs = ibis.hdfs_connect(host=webhdfs_host, port=webhdfs_port)
client = ibis.impala.connect(host=impala_host, port=impala_port,
hdfs_client=hdfs)
You can accomplish many tasks directly through the client object, but we additionally provide to streamline tasks involving a single Impala table or database.
If you’re doing analytics on a single table, you can get going by using the
table
method on the client:
table = client.table(table_name, database=db_name)
Database and Table objects¶
ImpalaClient.database ([name]) |
Create a Database object for a given database name that can be used for |
ImpalaClient.table (name[, database]) |
Create a table expression that references a particular table in the |
The client’s table
method allows you to create an Ibis table expression
referencing a physical Impala table:
In [1]: table = client.table('functional_alltypes', database='ibis_testing')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-8aa65efb8d29> in <module>()
----> 1 table = client.table('functional_alltypes', database='ibis_testing')
NameError: name 'client' is not defined
While you can get by fine with only table and client objects, Ibis has a notion of a “database object” that simplifies interactions with a single Impala database. It also gives you IPython tab completion of table names (that are valid Python variable names):
In [2]: db = client.database('ibis_testing')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-6a0dee230b2c> in <module>()
----> 1 db = client.database('ibis_testing')
NameError: name 'client' is not defined
In [3]: db