Highlights of the Ibis 1.3 release
Ibis 1.3 was just released, after 8 months of development work, with 104 new commits from 16 unique contributors. What is new? In this blog post we will discuss some important features in this new version!
First, if you are new to the Ibis framework world, you can check this blog post I wrote last year, with some introductory information about it.
Some highlighted features of this new version are:
- Addition of a
PySpark
backend - Improvement of geospatial support
- Addition of
JSON
,JSONB
andUUID
data types - Initial support for
Python 3.8
added and support forPython 3.5
dropped - Added new backends and geospatial methods to the documentation
- Renamed the
mapd
backend toomniscidb
This blog post is divided into different sections:
- OmniSciDB
- PostgreSQL
- PySpark
- Geospatial support
- Python versions support
import ibis
import pandas as pd
OmniSciDB¶
The mapd
backend is now named omniscidb
!
An important feature of omniscidb
is that now you can define if the connection is IPC
(Inter-Process Communication), and you can also specify the GPU
device ID you want to use (if you have a NVIDIA card, supported by cudf
).
IPC
is used to provide shared data support between processes. OmniSciDB uses Apache Arrow to provide IPC support.
con_omni = ibis.omniscidb.connect(
host='localhost',
port='6274',
user='admin',
password='HyperInteractive',
database='ibis_testing',
ipc=False,
gpu_device=None
)
con_omni.list_tables()
Also you can now specify ipc
or gpu_device
directly to the execute
method:
t = con_omni.table('functional_alltypes')
expr = t[['id', 'bool_col']].head(5)
df = expr.execute(ipc=False, gpu_device=None)
df
As you can imagine, the result of df
is a pandas.DataFrame
type(df)
But if you are using gpu_device
the result would be a cudf
:)
Note: when
IPC=True
is used, the code needs to be executed on the same machine where the database is runningNote: when
gpu_device
is used, 1) it uses IPC (see the note above) and 2) it needs a NVIDIA card supported bycudf
.
Another interesting feature is that now omniscidb
also supports shapefiles
(input) and geopandas
(output)!
Check out the Geospatial support section below to see more details!
Also the new version adds translations for more window operations for the omniscidb
backend, such as:
DenseRank
, RowNumber
, MinRank
, Count
, PercentRank/CumeDist
.
For more information about window operations, check the Window functions documentation section.
PostgreSQL¶
Some of the highlighted features for the PostgreSQL
backend are new data types included, such as:
JSON
, JSONB
and UUID
.
from uuid import uuid4
uuid_value = ibis.literal(uuid4(), type='uuid')
uuid_value == ibis.literal(uuid4(), type='uuid')
import json
json_value = ibis.literal(json.dumps({"id": 1}), type='json')
json_value
jsonb_value = ibis.literal(json.dumps({"id": 1}).encode('utf8'), type='jsonb')
jsonb_value
Another important new features on PostgreSQL
backend is the support of new geospatial
operations, such as
- GeometryType
- GeometryN
- IsValid
- LineLocatePoint
- LineMerge
- LineSubstring
- OrderingEquals
- Union
Also, now it has support for two geospatial data types: MULTIPOINT
and MULTILINESTRING
.
Check out the Geospatial support section below to see some usage examples of geospatial operations!
PySpark¶
This new version also includes support for a new backend: PySpark!
Let's do the first steps with this new backend starting with a Spark session creation.
import os
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.types as pt
from pathlib import Path
# spark session and pyspark connection
spark_session = SparkSession.builder.getOrCreate()
con_pyspark = ibis.pyspark.connect(session=spark_session)
We can use spark
or pandas
for reading from CSV
file. In this example, we will use pandas
.
data_directory = Path(
os.path.join(
os.path.dirname(ibis.__path__[0]),
'ci',
'ibis-testing-data'
)
)
pd_df_alltypes = pd.read_csv(data_directory / 'functional_alltypes.csv')
pd_df_alltypes.info()
Now, we can create a Spark DataFrame and we will create a temporary view from this data frame. Also we should enforce the desired types for each column.
def pyspark_cast(df, col_types):
for col, dtype in col_types.items():
df = df.withColumn(col, df[col].cast(dtype))
return df
ps_df_alltypes = spark_session.createDataFrame(pd_df_alltypes)
ps_df_alltypes = pyspark_cast(
ps_df_alltypes, {
'index': 'integer',
'Unnamed: 0': 'integer',
'id': 'integer',
'bool_col': 'boolean',
'tinyint_col': 'byte',
'smallint_col': 'short',
'int_col': 'integer',
'bigint_col': 'long',
'float_col': 'float',
'double_col': 'double',
'date_string_col': 'string',
'string_col': 'string',
'timestamp_col': 'timestamp',
'year': 'integer',
'month': 'integer'
}
)
# use ``SparkSession`` to create a table
ps_df_alltypes.createOrReplaceTempView('functional_alltypes')
con_pyspark.list_tables()
Check if all columns were created with the desired data type:
t = con_pyspark.table('functional_alltypes')
t
Different than a SQL
backend, that returns a SQL
statement, the returned
object
from the PySpark compile
method is a PySpark DataFrame
:
expr = t.head()
expr_comp = expr.compile()
type(expr_comp)
expr_comp
To convert the compiled expression to a Pandas DataFrame
, you can use the toPandas
method.
The result should be the same as that returned by the execute
method.
assert all(expr.execute() == expr_comp.toPandas())
expr.execute()
To finish this section, we can play a little bit with some aggregation operations.
expr = t
expr = expr.groupby('string_col').aggregate(
int_col_mean=t.int_col.mean(),
int_col_sum=t.int_col.sum(),
int_col_count=t.int_col.count(),
)
expr.execute()
Check out the PySpark Ibis backend API documentation and the tutorials for more details.
Geospatial support¶
Currently, ibis.omniscidb
and ibis.postgres
are the only Ibis backends that support geospatial features.
In this section we will check some geospatial features using the PostgreSQL
backend.
con_psql = ibis.postgres.connect(
host='localhost',
port=5432,
user='postgres',
password='postgres',
database='ibis_testing'
)
con_psql.list_tables()
Two important features are that it support shape
objects (input) and geopandas
dataframe (output)!
So, let's import shapely
to create a simple shape
point and polygon.
import shapely
shp_point = shapely.geometry.Point((20, 10))
shp_point
shp_polygon_1 = shapely.geometry.Polygon([(20, 10), (40, 30), (40, 20), (20, 10)])
shp_polygon_1
Now, let's create a Ibis table expression to manipulate a "geo" table:
t_geo = con_psql.table('geo')
df_geo = t_geo.execute()
df_geo
And the type of df_geo
is ... a geopandas
dataframe!
type(df_geo)
So you can take the advantage of GeoPandas features too!
df_geo.set_geometry('geo_multipolygon').head(1).plot();
Now, let's check if there are any geo_multipolygon
's that contain the shape
point we just created.
t_geo[
t_geo.geo_multipolygon,
t_geo['geo_multipolygon'].contains(shp_point).name('contains_point_1')
].execute()
Final words¶
Do you want to play more with Ibis framework?
You can install it from PyPI:
python -m pip install --upgrade ibis-framework==1.3.0
Or from conda-forge:
conda install ibis-framework=1.3.0 -c conda-forge
Check out some interesting tutorials to help you to start on Ibis: https://docs.ibis-project.org/tutorial.html. If you are coming from the SQL world, maybe Ibis for SQL Programmers documentation section will be helpful. Have fun!
Comments