With Jupyter Notebooks
Last updated
Last updated
In this tutorial, we'll show you how to use (an open-source, Postgres data warehouse) and to analyze large datasets efficiently.
Using JupySQL to combine Jupyter Notebooks with Hydra allows data scientists to easily query data in SQL directly from Jupyter. This opens up a wide array of potential use cases to combine data across a number of sources using Postgres, or access hosted data sources which could contain hundreds or thousands of gigabytes of data. Locally, you can experiment with importing and combining datasets using both Hydra's columnar tables, Postgres built-in row tables, and external data using FDWs.
To run this tutorial, you need to install the following Python packages:
Note: to run the \copy
command, you need pgspecial<2
, we recommend you getting it via conda install
; since pip install
might give you some issues.
You also need Docker installed and running to start the Hydra instance.
Let's fetch the docker image and start the Hydra instance. The next command will take a couple of minutes as it needs to fetch the image from the repository:
Console output (1/1):
Console output (1/1):
As you can see, this dataset contains ~1.4M rows.
Note: to execute SQL on Jupyter via JupySQL, we must add %%sql
at the beginning of the cell:
Console output (1/2):
Console output (2/2):
Let's now upload the data:
Console output (1/2):
Console output (2/2):
Let's now query our data:
Console output (1/2):
Console output (2/2):
1369769
We see that the ~1.4M are there. Let's take a look at the first rows:
Console output (1/2):
Console output (2/2):
1
2021-01-01 00:30:10
2021-01-01 00:36:12
1.0
2.1
1.0
False
142
43
2
8.0
3.0
0.5
0.0
0.0
0.3
11.8
2.5
None
1
2021-01-01 00:51:20
2021-01-01 00:52:19
1.0
0.2
1.0
False
238
151
2
3.0
0.5
0.5
0.0
0.0
0.3
4.3
0.0
None
1
2021-01-01 00:43:30
2021-01-01 01:11:06
1.0
14.7
1.0
False
132
165
1
42.0
0.5
0.5
8.65
0.0
0.3
51.95
0.0
None
Hydra allows us to perform analytical queries efficiently. Let's compute the average trip distance given the passenger count:
Console output (1/2):
Console output (2/2):
0.0
2.5424466811344758
1.0
2.6805563237138753
2.0
2.7948325921160876
3.0
2.7576410606577899
4.0
2.8681984015618327
5.0
2.6940995207308051
6.0
2.5745177825092658
7.0
11.1340000000000000
8.0
1.05000000000000000000
None
29.6651257727346673
JupySQL comes with powerful plotting capabilities. Let's create a histogram of trip distance:
Console output (1/2):
Console output (2/2):
We cannot see much since there are some outliers. Let's find the 99th percentile:
Console output (1/2):
Console output (2/2):
19.24
Now, let's use this value as a cutoff:
Console output (1/1):
Console output (1/2):
Console output (2/2):
Much better! We just created a histogram of 1.4M observations!
To finish the tutorial, let's shut down the container:
Console output (1/1):
Console output (1/1):
Console output (1/1):
Console output (1/1):
Console output (1/1):
For reproducibility, these are the package versions we used:
Console output (1/1):
Console output (1/1):
Console output (1/1):
Console output (1/1):
Console output (1/1):
Now, let's fetch some sample data. We'll be using the :
We're ready to upload our data; we'll load the extension and start the database connection:
Let's create the table on Hydra. Note that Hydra uses columnar tables by default, which will optimize storage for analytical queries. Below, we add the clause explicitly for clarity.