How to Move From pandas to Polars : Evgenia Verbina

June 19, 2024 June 19, 2024

How to Move From pandas to Polars
by: Evgenia Verbina
blow post content copied from PyCharm : The Python IDE for data science and web development | The JetBrains Blog
click here to view original post

This is a guest post from Cheuk Ting Ho, a data scientist who contributes to multiple open-source libraries, such as pandas and Polars.

How to Move From pandas to Polars banners

You’ve probably heard about Polars – it is now firmly in the spotlight in the data science community.

Are you still using pandas and would like to try out Polars? Are you worried that it will take a lot of effort to migrate your projects from pandas to Polars? You might be concerned that Polars won’t be compatible with your existing pipeline or the other tools you are currently using.

Fear not! In this article, I will answer these questions so you can decide whether to migrate to using Polars or not. I will also provide some tips for those of you who have already decided to migrate.

How is Polars different from pandas?

Polars is known for its speed and security, as it is written in Rust and based on Apache Arrow. For details about Polars vs. pandas, you can see our other blog post here. In short, while Polars’ backend architecture is different from pandas’, the creator and community around Polars have tried to maintain a Python API that is very similar to pandas’. At first glance, Polars code is very similar to pandas code. Fun fact – some contributors to pandas are also contributors to Polars. Due to this, the barrier for pandas users to start using Polars is relatively low. However, as it is still a different library, it is worth double-checking the differences between the two.

Advantages of using Polars

Have you struggled when using pandas for a relatively large data set? Do you think pandas is using too much RAM and slowing your computer down while working locally? Polars may solve this problem by using its lazy API. Intermediate steps won’t be executed unless needed, saving memory for the intermediate steps in some cases.

Another advantage Polars has is that, since it is written in Rust, it can make use of concurrency much better than pandas. Python is traditionally single-threaded, and although pandas uses the NumPy backend to speed up some operations, it is still mainly written in Python and has certain limitations in its multithreading capabilities.

Tools that make the switch easy

As Polars’ popularity grows, there is more and more support for Polars in popular tools for data scientists, including scikit-learn and HoloViz.

PyCharm, the most popular IDE used by data scientists, provides a similar experience when you work with pandas and Polars. This makes the process of migration smoother. For example, interactive tables allow you to easily see the information about your DataFrame, such as the number of rows and columns.

Try PyCharm for free

PyCharm has an excellent pagination feature – if you want to see more results per page, you can easily configure that via a drop-down menu:

You can see the statistical summary for the data when you hover the cursor over the column name:

You can also sort the data for inspection with a few clicks in the header. You can also use the multi-sorting functionality – after sorting the table once, press and hold ⌥ (macOS) or Alt (Windows) and click on the second column you want the table to be sorted by. For example, here, we can sort by island and bill_length_mm in the table.

To get more insights from the DataFrame, you can switch to chat view with the icon on the left:

You can also change how the data is shown in the settings, showing different columns and using different graph types:

It also helps you to auto-complete methods when using Polars, very handy when you are starting to use Polars and not familiar with all of the methods that it provides. To understand more about full line code completion in JetBrains IDEs, please check out this article.

You can also access the official documentation quickly by clicking the Polars icon in the top-right corner of the table, which is really handy.

How to migrate from pandas to Polars

If you’re now convinced to migrate to Polars, your final questions might be about the extent of changes needed for your existing code and how easy it is to learn Polars, especially considering your years of experience and muscle memory with pandas.

Similarities between pandas and Polars

Polars provides APIs similar to pandas, most notably the read_csv(), head(), tail(), and describe() for a glance at what the data looks like. It also provides similar data manipulation functions like join() and groupby()/ group_by(), and aggregation functions like mean() and sum().

Before going into the migration, let’s look at these code examples in Polars and pandas.

Example 1 – Calculating the mean score for each class

pandas

import pandas as pd

df_student = pd.read_csv("student_info.csv")

print(df_student.dtypes)

df_score = pd.read_csv("student_score.csv")

print(df_score.head())

df_class = df_student.join(df_score.set_index("name"), on="name").drop("name", axis=1)

df_mean_score = df_class.groupby("class").mean()

print(df_mean_score)

Polars

import polars as pl

df_student = pl.read_csv("student_info.csv")

print(df_student.dtypes)

df_score = pl.read_csv("student_score.csv")

print(df_score.head())

df_class = df_student.join(df_score, on="name").drop("name")

df_mean_score = df_class.group_by("class").mean()

print(df_mean_score)

Polars provides similar io methods like read_csv. You can also inspect the dtypes, do data cleaning with drop, and do groupby with aggregation functions like mean.

Example 2 – Calculating the rolling mean of temperatures

pandas

import pandas as pd

df_temp = pd.read_csv("temp_record.csv", index_col="date", parse_dates=True, dtype={"temp":int})

print(df_temp.dtypes)

print(df_temp.head())

df_temp.rolling(2).mean()

Polars

import polars as pl

df_temp = pl.read_csv("temp_record.csv", try_parse_dates=True, dtypes={"temp":int}).set_sorted("date")

print(df_temp.dtypes)

print(df_temp.head())

df_temp.rolling("date", period="2d").agg(pl.mean("temp"))

Reading with date as index in Polars can also be done with read_csv, with a slight difference in the function arguments. Rolling mean (or other types of aggregation) can also be done in Polars.

As you can see, these code examples are very similar, with only slight differences. If you are an experienced pandas user, I am sure your journey using Polars will be quite smooth.

Tips for migrating from pandas to Polars

As for code that was previously written in pandas, how can you migrate it to Polars? What are the differences in syntax that may trip you up? Here are some tips that may be useful:

Selecting and filtering

In pandas, we use .loc / .iloc and [] to select part of the data in a data frame. However, in Polars, we use .select to do so. For example, in pandas df["age"] or df.loc[:,"age"] becomes df.select("age") in Polars.

In pandas, we can also create a mask to filter out data. However, in Polars, we will use .filter instead. For example, in pandas df["age" > 18] becomes df.filter(pl.col("a") > 18) in Polars.

All of the code that involves selecting and filtering data needs to be rewritten accordingly.

Use .with_columns instead of .assign

A slight difference between pandas and Polars is that, in pandas we use .assign to create new columns by applying certain logic and operations to existing columns. In Polars, this is done with .with_columns. For example:

In pandas

df_rec.assign(

    diameter = lambda df: (df.x + df.y) * 2,

    area = lambda df: df.x * df.y

)

becomes

df_rec.with_columns(

    diameter = (pl.col("x") + pl.col("y")) * 2,

    area = lambda df: pl.col("x") * pl.col("y")

)

in Polars.

.with_columns can replace groupby

In addition to assigning a new column with simple logic and operations, .with_columns offers more advanced capabilities. With a little trick, you can perform operations similar to groupby in pandas by using window functions:

In pandas

df = pd.DataFrame({

    "class": ["a", "a", "a", "b", "b", "b", "b"],

    "score": ["80", "39", "67", "28", "77", "90", "44"],

})

df["avg_score"] = df.groupby("class")["score"].transform("mean")

becomes

df.with_columns(

    pl.col("score").mean().over("class").alias("avg_score")

)

in Polars.

Use scan_csv instead of read_csv if you can

Although read_csv also works in Polars, by using scan_csv instead of read_csv it will turn to lazy evaluation mode and benefit from the lazy API mentioned above.

Building pipelines properly with lazy API

In pandas, we usually use .pipe to build data pipelines. However, since Polars works a bit differently, especially when using the lazy API, we want the pipeline to be executed only once. So, we need to adjust the code accordingly. For example:

Instead of this pandas code snippet:

def discount(df):

    df["30_percent_off"] = df["price"] * 0.7

    return df

def vat(df):

    df["vat"] = df["price"] * 0.2

    return df

def total_cost(df):

    df["total"] = df["30_percent_off"] + df["vat"]

    return df

(df

 .pipe(discount)

 .pipe(vat)

 .pipe(total_cost)

)

We will have the following one in Polars:

def discount(input_col)r:

    return pl.col(input_col).mul(0.7).alias("70_percent_off")

def vat(input_col):

    return pl.col(input_col).mul(0.2).alias("vat")

def total_cost(input_col1, input_col2):

    return pl.col(input_col1).add(pl.col(input_col2).alias("total")

df.with_columns(

    discount("price"),

    val("price"),

    total_cost("30_percent_off", "vat"),

)

Missing data: No more NaN

Do you find NaN in pandas confusing? There is no NaN in Polars! Since NaN is an object in NumPy and Polars doesn’t use NumPy as the backend, all missing data will now be null instead. For details about null and NaN in Polars, check out the documentation.

Exploratory data analysis with Polars

Polars provides a similar API to pandas, and with hvPlot, you can easily create a simple plotting function with exploratory data analysis in Polars. Here I will show two examples, one creating simple statistical information from your data set, and the other plotting simple graphs to understand the data.

Summary statistics from dataset

When using pandas, the most common way to get a summary statistic is to use describe. In Polars, we can also use describe in a similar manner. For example, we have a DataFrame with some numerical data and missing data:

We can use describe to get summary statistics:

Notice how object types are treated – in this example, the column name gives a different result compared to pandas. In pandas, a column with object type will result in categorical data like this:

In Polars, the result is similar to numeric data, which makes less sense:

Simple plotting with Polars DataFrame

To better visualize of the data, we might want to plot some graphs to help us evaluate the data more efficiently. Here is how to do so with the plot method in Polars.

First of all, since Polars uses hvPlot as backend, make sure that it is installed. You can find the hvPlot User Guide here. Next, since hvPlot will output the graph as an interactive Bokeh graph, we need to use output_notebook from bokeh.plotting to make sure it will show inline in the notebook. Add this code at the top of your notebook:

from bokeh.plotting import output_notebook

output_notebook()

Also, make sure your notebook is trusted. This is done by simply checking the checkbox in the top-right of the display when using PyCharm.

Next, you can use the plot method in Polars. For example, to make a scatter plot, you have to specify the columns to be used as the x- and y-axis, and you can also specify the column to be used as color of the points:

df.plot.scatter(x="body_mass_g", y="bill_length_mm", color="species")

This will give you a nice plot of the different data points of different penguin species for inspection:

Of course, scatter plots aren’t your only option. In Polars, you can use similar steps to create any type of plot that is supported by hvPlot. For example, hist can be done like this:

df.plot.hist("body_mass_g", by=["species","sex"])

For a full list of plot types supported by hvPlot, you can have a look at the hvPlot reference gallery.

Conclusion

I hope the information provided here will help you on your way with using Polars. Polars is an open-source project that is actively maintained and developed. If you have suggestions or questions, I recommend reaching out to the Polars community.

About the author

Cheuk Ting Ho

Cheuk has been a Data Scientist at various companies – a job that demands high numerical and programming skills, especially in Python. Following her passion for the tech community, Cheuk has been a Developer Advocate for three years. She also contributes to multiple open-source libraries like Hypothesis, Pytest, pandas, Polars, PyO3, Jupyter Notebook, and Django. Cheuk is currently a consultant and trainer at CMD Limes.

June 19, 2024 at 05:18PM
Click here for more details...

=============================
The original post is available in PyCharm : The Python IDE for data science and web development | The JetBrains Blog by Evgenia Verbina
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Evgenia Verbina Python

Python Reader