Ibis: Python Data Analysis Productivity Framework¶
Ibis is a toolbox to bridge the gap between local Python environments (like pandas and scikit-learn) and remote storage and execution systems like Hadoop components (like HDFS, Impala, Hive, Spark) and SQL databases (Postgres, etc.). Its goal is to simplify analytical workflows and make you more productive.
We have a handful of specific priority focus areas:
- Enable data analysts to translate local, single-node data idioms to scalable computation representations (e.g. SQL or Spark)
- Integration with pandas and other Python data ecosystem components
- Provide high level analytics APIs and workflow tools to enhance productivity and streamline common or tedious tasks.
- Integration with community standard data formats (e.g. Parquet and Avro)
- Abstract away database-specific SQL differences
As the Apache Arrow project develops, we will look to use Arrow to enable computational code written in Python to be executed natively within other systems like Apache Spark and Apache Impala (incubating).
To learn more about Ibis’s vision, roadmap, and updates, please follow http://ibis-project.org.
Source code is on GitHub: http://github.com/pandas-dev/ibis
Install Ibis from PyPI with:
pip install ibis-framework
Or from conda-forge with
conda install ibis-framework -c conda-forge
At this time, Ibis offers some level of support for the following systems:
- Apache Impala (incubating)
- Apache Kudu (incubating)
- Hadoop Distributed File System (HDFS)
- PostgreSQL (Experimental)
- SQLite
- Direct execution of ibis expressions against pandas objects (Experimental)
Coming from SQL? Check out Ibis for SQL Programmers.
Architecturally, Ibis features:
- A pandas-like domain specific language (DSL) designed specifically for analytics, aka Ibis expressions, that enable composable, reusable analytics on structured data. If you can express something with a SQL SELECT query, you can write it with Ibis.
- Integrated user interfaces to HDFS and other storage systems.
- An extensible translator-compiler system that targets multiple SQL systems
SQL engine support needing code contributors:
- Redshift
- Vertica
- Spark SQL
- Presto
- Hive
- MySQL / MariaDB
Since this is a young project, the documentation is definitely patchy in places, but this will improve as things progress.