Handling Large Data with Pandas v2, Polars, and DuckDB

Let's learn how to handle large data with Pandas, Polars, and DuckDB.

This post is based on a recent talk I gave at PyCon Italia. Feel free to check out the slide deck of my talk here.

Pandas is a fantastic Python library for cleaning, transforming, and exploring data. However, when you start working with larger datasets (think over 2GB), you might find yourself running into "OutOfMemory" errors or having your Jupyter notebook crash unexpectedly.

The good news is you don't have to torture yourself with dying Jupyter notebook kernels anymore. In the past few years, new Python packages have entered the scene to help us handle large data.

Poor guy trying to analyze a 10GB dataset locally with Pandas version 1.

In this post, we'll cover three Python packages that are used for handling large data:

  1. Pandas >= v2.0
  2. Polars
  3. DuckDb

These packages are built to run on a single-node system (e.g., your local machine) as opposed to a distributed system (e.g., Spark cluster). As we walk through each package, let's assume the primary use case is an exploratory data analysis on a 5-10GB dataset. Let's jump in!

Pandas (version 2)

In 2017, Wes McKinney, the creator of Pandas, published an article called "Apache Arrow and the 10 Things I Hate About Pandas" where he walked through the shortcomings of Pandas. The TL;DR of the article is that Pandas (version 1) was not built to handle large non-numerical data like strings. The plan was to replace Pandas' NumPy backend with a brand new columnar memory format called Apache Arrow.

In April 2023, Pandas v2.0 was released and as promised, Apache Arrow replaced NumPy as the memory format in the backend. This transition to Arrow resulted in several big improvements for Pandas:

  • Better support of missing values
  • More efficient data types. string[pyarrow] replaces the object data type.
  • Speed! Pandas version 2.0 is a whole lot faster than version 1.0. This is especially true for string-related operations. For example, the endswith() method is 31.6 times faster with an Apache Arrow backend as compared to NumPy. Incredible. 🏎️
First three rows are from Marc Garcias' pandas 2.0 and the Arrow revolution. The bottom two rows are from my own benchmarking exercise using aggregation operations.

It's very simple to set up a Pandas DataFrame with an Apache Arrow backend. You simply need to specify "pyarrow" as the backend when you read in your data file:

import pandas as pd 

df = pd.read_csv("data.csv", dtype_backend="pyarrow")

On the surface, you may not notice any changes to your DataFrame. But if you inspect the data, you'll see that the column data types are all pyarrow 🎉.

New data types with a PyArrow backend (left), as compared to the NumPy backend (right)

The default dtype_backend is still NumPy because Pandas v2 is not fully backwards compatible yet. Pandas v3 is in the works and my bet is that Apache Arrow will be the default by then.

Polars

Polars is a much newer Python package than Pandas. It was first released in 2020 by a software engineer named Ritchie Vink and has gained a ton of popularity over the past couple of years. A quick search on Youtube returns at least a dozen videos about Polars. I've counted at least 7 conference talks about Polars, all given in 2023. That's pretty wild!

A sample of the Polars conference talks I found on Youtube, all published in 2023.

Similar to Pandas v2, Polars uses Apache Arrow as its memory format and is built around "data frames" and "series" data structures. What makes Polars stand out is that it supports lazy evaluation and query optimization. The streaming API is a newer feature which allows you to process larger-than-memory data in batches. It's still under active development but seems very promising.

In my opinion, Polars' syntax is very similar to PySpark. This is great news for PySpark users because the learning curve will be very short. To demonstrate the similarities, here's a comparison of Polars and PySpark:

  • Polars: groupby(), filter(), with_columns()
  • PySpark: groupBy(),filter() or where(), withColumn()
Polars and PySpark syntaxes look very similar. 👀

When inspecting a Polars DataFrame, you'll notice that it looks a lot like a Pandas DataFrame. The main differences are 1) a Pandas DataFrame has indices while Polars does not and 2) the datatypes of a Polars DataFrame are displayed below the column names. Similar to Pandas v2, Polars also supports a string datatype.

Polars and Pandas DataFrames have some slight differences.

With Polars' lazy evaluation, each line of code gets added to a query plan instead of being executed straightaway. This is called a LazyFrame and only gets materialized to a DataFrame once you call the collect() method. Behind the scenes, a LazyFrame query plan has several optimizations like:

  • Predicate pushdown: filtering the data as soon as possible before doing heavy operations like aggregations
  • Projection pushdown: selecting only the columns needed when loading in the dataset

It's pretty straightforward to activate lazy evaluation in Polars. You simply need to 1) use scan_csv() instead of read_csv() and 2) call the collect() method at the end of your code. With these two small changes, you'll likely see significant improvements in your code's runtime. For example, when I use lazy evaluation to perform a basic aggregation on a dataset, the execution time dropped from 206 milliseconds to 61 milliseconds:

If you want layer on Polar's streaming abilities, you simply need to pass streaming=True into the collect() method like this:

(
  pl.scan_csv("realtor_data.csv")
  .filter(pl.col("bed") > 1)
  .group_by("bed")
  .agg(pl.col('price').mean())
  .collect(streaming=True)
)

The streaming API allows for larger-than-memory data processing because it executes the query plan in batches rather than all at once. This feature is still in active development so it doesn't support all Polars operations yet. That being said, it's comprehensive enough to support most use cases and is certainly worth trying out! Check out the documentation here.

DuckDB

DuckDB is an in-process online analytical processing (OLAP) database. It was created in 2019 by Mühleisen and Raasveldt and uses a columnar-vectorized SQL query processing engine. DuckDB offers client APIs in several languages (Julia, R, Node.js, etc.) but its primary focus is on the Python API.

In my opinion, DuckDB is a great option for SQL enthusiasts. To get things started, you first need to load your dataset as a table in your local DuckDB database.

import duckdb

df = duckdb.read_csv("data.csv")

Now you can use SQL syntax on your newly created table called df:

DuckDB uses general purpose datatypes so string columns are represented as varchar and floats are represented as double.

The output of duckdb.sql() is a DuckDBPyRelation object which is a symbolic representation of a SQL query and does not store any actual data. This is pretty neat because you're able to experiment with queries and inspect the first 5 rows of its output without having to materialize your SQL query into a proper DataFrame.

Once you're ready to convert your query into a Pandas or Polars DataFrame, you simply need to append .to_df() and .pl() respectively.

One of my favorite features of DuckDB is that you can store queries as variables and reference them in other queries. This blew my mind!

I can see myself using this feature to keep my SQL queries clean and readable. DuckDB for the win. 👏

DuckDB uses PostgreSQL syntax which offers a wide range of functions. If there's ever a situation where you need to apply a new function that isn't offered in PostgreSQL, DuckDB lets you create a Python user-defined function (UDF).

Python 🤝 SQL via UDFs

The best part about DuckDB is that it's interoperable with Apache Arrow. This makes it super easy and efficient to switch your analysis from DuckDB to Polars to Pandas v2.

Side-by-Side Comparison

So far, we've delved into Pandas, Polars, and DuckDB separately. Let's now take a step back and do a high-level side-by-side comparison of these packages.

General Stats

Data queries from Github on May 15th, 2024.

Pandas is by far the most mature package of the three. It's been around since 2008, in contrast to Polars and DuckDB which were only introduced in the past four to five years. Polars and DuckDB are still very new so these packages haven't reached their version 1 release yet. From what I've heard, this will soon change as Polars and DuckDB are both planning major releases in the next few months.

Downloads

Pandas also has the most downloads and Github stars, which is a strong indicator of its popularity over the past 16 years. I was curious to see the download trends by continent and found it pretty interesting that Pandas and DuckDB have a primarily North American user base, while Polars has equal usership between

Data queried from PyPI Stats on May 15th, 2024.

Interestingly, Pandas has significantly more contributors than Polars and DuckDB (3.2K contributors vs. 306 and 417 contributors). Despite having a small team of contributors, Polars and DuckDB have turned into full-fledged companies and have a team dedicated to maintaining the code base.

Features

When comparing the features of these packages, Polars and DuckDB offer lazy evaluation, query optimization, and multi-threading support while Pandas does not. DuckDB's interface is SQL-oriented while Pandas and Polars are more Pythonic.

Benchmarking

Benchmarking is not always a reliable measurement of a package's performance. Proceed with caution!

It's really hard to run fair benchmarking. In fact, the DuckDB team wrote a paper all about the pitfalls of benchmarking. But benchmarking is sometimes helpful to get a sense of how well a package performs.

Polars has published their benchmark results on their website and the code that you can run to reproduce the results. In the figure below, you can see that Polars and DuckDB both outperform Pandas by a long shot. Polars and DuckDB query performance is comparable and in some cases, DuckDB is actually faster than Polars.

Benchmarking results brought to you by the Polars team.

I did my own very, very unofficial version of a benchmark test and had a similar finding: Polars and DuckDb are super fast compared to Pandas v2.

I was particularly surprised to see the performance on the aggregated mean operations: Pandas took ~45 seconds to run, while Polars and DuckDB had a runtime on the order of milliseconds. In fact, with DuckDB, if you just inspect the output as a DuckDBPyRelation object, the code will run on the order of microseconds (!!!). That's because we save time by not materializing the entire output to a DataFrame or Arrow object.

So, which package is best?

By now, you should have a pretty good sense of what Pandas v2, Polars, and DuckDB are all about. You might be wondering where to start and which package to use first. Each package has its pros and cons, and you should take into consideration several factors when making a decision:

  • Size of the data. If you're dealing with a particularly large dataset, you might want to consider using Polars or DuckDB which run faster than Pandas v2.
  • Learning curve and amount of spare time to learn a new syntax.
  • Knowledge base of contributors and peer reviews. If you're working in industry and need your peers to review your code, you might want to consider using a package that your reviewers are familiar with.

That being said, I strongly recommend that you try them all! Pandas, Polars, and DuckDB have excellent documentation and learning materials to get you up to speed. Here are some useful links to get started:

Good luck!

Pandas, Polars, and DuckDB trifecta. Courtesy of MidJourney.