Ibis for SQL Programmers¶
Among other things, Ibis provides a full-featured replacement for SQL
SELECT
queries, but expressed with Python code that is:
- Type-checked and validated as you go. No more debugging cryptic database errors; Ibis catches your mistakes right away.
- Easier to write. Pythonic function calls with tab completion in IPython.
- More composable. Break complex queries down into easier-to-digest pieces
- Easier to reuse. Mix and match Ibis snippets to create expressions tailored for your analysis.
We intend for all SELECT
queries to be fully portable to Ibis. Coverage of
other DDL statements (e.g. CREATE TABLE
or INSERT
) may vary from
engine to engine.
This document will use the Impala SQL compiler (i.e. ibis.impala.compile
)
for convenience, but the code here is portable to whichever system you are
using Ibis with.
Note: If you find any SQL idioms or use cases in your work that are not represented here, please reach out so we can add more to this guide!
Projections: select/add/remove columns¶
All tables in Ibis are immutable. To select a subset of a table’s columns, or to add new columns, you must produce a new table by means of a projection.
In [1]: t = ibis.table([('one', 'string'),
...: ('two', 'double'),
...: ('three', 'int32')], 'my_data')
...:
In [2]: t
Out[2]:
UnboundTable[table]
name: my_data
schema:
one : string
two : double
three : int32
In SQL, you might write something like:
SELECT two, one
FROM my_data
In Ibis, this is
In [3]: proj = t['two', 'one']
or
In [4]: proj = t.projection(['two', 'one'])
This generates the expected SQL:
In [5]: print(ibis.impala.compile(proj))
SELECT `two`, `one`
FROM my_data
What about adding new columns? To form a valid projection, all column expressions must be named. Let’s look at the SQL:
SELECT two, one, three * 2 AS new_col
FROM my_data
The last expression is written:
In [6]: new_col = (t.three * 2).name('new_col')
Now, we have:
In [7]: proj = t['two', 'one', new_col]
In [8]: print(ibis.impala.compile(proj))
SELECT `two`, `one`, `three` * 2 AS `new_col`
FROM my_data
mutate
: Add or modify columns easily¶
Since adding new columns or modifying existing columns is so common, there is a
convenience method mutate
:
In [9]: mutated = t.mutate(new_col=t.three * 2)
Notice that using the name
was not necessary here because we’re using
Python keywords to provide the name. Indeed:
In [10]: print(ibis.impala.compile(mutated))
SELECT *, `three` * 2 AS `new_col`
FROM my_data
If you modify an existing column with mutate
it will list out all the other
columns:
In [11]: mutated = t.mutate(two=t.two * 2)
In [12]: print(ibis.impala.compile(mutated))
SELECT `one`, `two` * 2 AS `two`, `three`
FROM my_data
SELECT *
equivalent¶
Especially in combination with relational joins, it’s convenient to be able to
select all columns in a table using the SELECT *
construct. To do this, use
the table expression itself in a projection:
In [13]: proj = t[t]
In [14]: print(ibis.impala.compile(proj))
SELECT *
FROM my_data
This is how mutate
is implemented. The example above
t.mutate(new_col=t.three * 2)
can be written as a normal projection:
In [15]: proj = t[t, new_col]
In [16]: print(ibis.impala.compile(proj))
SELECT *, `three` * 2 AS `new_col`
FROM my_data
Let’s consider a table we might wish to join with t
:
In [17]: t2 = ibis.table([('key', 'string'),
....: ('value', 'double')], 'dim_table')
....:
Now let’s take the SQL:
SELECT t0.*, t0.two - t1.value AS diff
FROM my_data t0
INNER JOIN dim_table t1
ON t0.one = t1.key
To write this with Ibis, it is:
In [18]: diff = (t.two - t2.value).name('diff')
In [19]: joined = t.join(t2, t.one == t2.key)[t, diff]
And verify the generated SQL:
In [20]: print(ibis.impala.compile(joined))
SELECT t0.*, t0.`two` - t1.`value` AS `diff`
FROM my_data t0
INNER JOIN dim_table t1
ON t0.`one` = t1.`key`
Using functions in projections¶
If you pass a function instead of a string or Ibis expression in any projection context, it will be invoked with the “parent” table as its argument. This can help significantly when composing complex operations. Consider this SQL:
SELECT one, avg(abs(the_sum)) AS mad
FROM (
SELECT one, three, sum(two) AS the_sum
FROM my_data
GROUP BY 1, 2
) t0
GROUP BY 1
This can be written as one chained expression:
In [21]: expr = (t.group_by(['one', 'three'])
....: .aggregate(the_sum=t.two.sum())
....: .group_by('one')
....: .aggregate(mad=lambda x: x.the_sum.abs().mean()))
....:
Indeed:
In [22]: print(ibis.impala.compile(expr))
SELECT `one`, avg(abs(`the_sum`)) AS `mad`
FROM (
SELECT `one`, `three`, sum(`two`) AS `the_sum`
FROM my_data
GROUP BY 1, 2
) t0
GROUP BY 1
A useful pattern you can try is that of the function factory which allows you to create function that reference a field of interest:
def mad(field):
def closure(table):
return table[field].abs().mean()
return closure
Now you can do:
In [23]: expr = (t.group_by(['one', 'three'])
....: .aggregate(the_sum=t.two.sum())
....: .group_by('one')
....: .aggregate(mad=mad('the_sum')))
....:
Filtering / WHERE
¶
You can add filter clauses to a table expression either by indexing with []
(like pandas) or use the filter
method:
In [24]: filtered = t[t.two > 0]
In [25]: print(ibis.impala.compile(filtered))
SELECT *
FROM my_data
WHERE `two` > 0
filter
can take a list of expressions, which must all be satisfied for a
row to be included in the result:
In [26]: filtered = t.filter([t.two > 0,
....: t.one.isin(['A', 'B'])])
....:
In [27]: print(ibis.impala.compile(filtered))
SELECT *
FROM my_data
WHERE `two` > 0 AND
`one` IN ('A', 'B')
To compose boolean expressions with AND
or OR
, use the respective &
and |
operators:
In [28]: cond = (t.two < 0) | ((t.two > 0) | t.one.isin(['A', 'B']))
In [29]: filtered = t[cond]
In [30]: print(ibis.impala.compile(filtered))
SELECT *
FROM my_data
WHERE ((`two` < 0) OR ((`two` > 0) OR `one` IN ('A', 'B')))
Aggregation / GROUP BY
¶
To aggregate a table, you need:
- Zero or more grouping expressions (these can be column names)
- One or more aggregation expressions
Let’s look at the aggregate
method on tables:
In [31]: stats = [t.two.sum().name('total_two'),
....: t.three.mean().name('avg_three')]
....:
In [32]: agged = t.aggregate(stats)
If you don’t use any group expressions, the result will have a single row with your statistics of interest:
In [33]: agged.schema()
Out[33]:
ibis.Schema {
total_two double
avg_three double
}
In [34]: print(ibis.impala.compile(agged))