With Jupyter Notebooks
In this tutorial, we'll show you how to use Hydra (an open-source, Postgres data warehouse) and JupySQL to analyze large datasets efficiently.
Using JupySQL to combine Jupyter Notebooks with Hydra allows data scientists to easily query data in SQL directly from Jupyter. This opens up a wide array of potential use cases to combine data across a number of sources using Postgres, or access hosted data sources which could contain hundreds or thousands of gigabytes of data. Locally, you can experiment with importing and combining datasets using both Hydra's columnar tables, Postgres built-in row tables, and external data using FDWs.
Requirements
To run this tutorial, you need to install the following Python packages:
Note: to run the \copy
command, you need pgspecial<2
, we recommend you getting it via conda install
; since pip install
might give you some issues.
You also need Docker installed and running to start the Hydra instance.
Starting a Hydra instance
Let's fetch the docker image and start the Hydra instance. The next command will take a couple of minutes as it needs to fetch the image from the repository:
Console output (1/1):
Data download
Now, let's fetch some sample data. We'll be using the NYC taxi dataset:
Console output (1/1):
As you can see, this dataset contains ~1.4M rows.
Uploading data to Hydra
We're ready to upload our data; we'll load the JupySQL extension and start the database connection:
Let's create the table on Hydra. Note that Hydra uses columnar tables by default, which will optimize storage for analytical queries. Below, we add the USING columnar
clause explicitly for clarity.
Note: to execute SQL on Jupyter via JupySQL, we must add %%sql
at the beginning of the cell:
Console output (1/2):
Console output (2/2):
Let's now upload the data:
Console output (1/2):
Console output (2/2):
Let's now query our data:
Console output (1/2):
Console output (2/2):
1369769
We see that the ~1.4M are there. Let's take a look at the first rows:
Console output (1/2):
Console output (2/2):
1
2021-01-01 00:30:10
2021-01-01 00:36:12
1.0
2.1
1.0
False
142
43
2
8.0
3.0
0.5
0.0
0.0
0.3
11.8
2.5
None
1
2021-01-01 00:51:20
2021-01-01 00:52:19
1.0
0.2
1.0
False
238
151
2
3.0
0.5
0.5
0.0
0.0
0.3
4.3
0.0
None
1
2021-01-01 00:43:30
2021-01-01 01:11:06
1.0
14.7
1.0
False
132
165
1
42.0
0.5
0.5
8.65
0.0
0.3
51.95
0.0
None
Hydra allows us to perform analytical queries efficiently. Let's compute the average trip distance given the passenger count:
Console output (1/2):
Console output (2/2):
0.0
2.5424466811344758
1.0
2.6805563237138753
2.0
2.7948325921160876
3.0
2.7576410606577899
4.0
2.8681984015618327
5.0
2.6940995207308051
6.0
2.5745177825092658
7.0
11.1340000000000000
8.0
1.05000000000000000000
None
29.6651257727346673
JupySQL comes with powerful plotting capabilities. Let's create a histogram of trip distance:
Console output (1/2):
Console output (2/2):
We cannot see much since there are some outliers. Let's find the 99th percentile:
Console output (1/2):
Console output (2/2):
19.24
Now, let's use this value as a cutoff:
Console output (1/1):
Console output (1/2):
Console output (2/2):
Much better! We just created a histogram of 1.4M observations!
Where to go from here
Clean up
To finish the tutorial, let's shut down the container:
Console output (1/1):
Console output (1/1):
Console output (1/1):
Console output (1/1):
Console output (1/1):
Package versions
For reproducibility, these are the package versions we used:
Console output (1/1):
Console output (1/1):
Console output (1/1):
Console output (1/1):
Console output (1/1):
Last updated