The fastest way to read a CSV file in Pandas 2.0

It turns out that the fastest way to create a Pandas DataFrame from a CSV file is to use an entirely different library.

Finn Andersen
ITNEXT

--

In my previous article I explored in detail how the different options of Pandas’ pd.read_csv() function define its behaviour, and now I will look at how they impact performance. The results are surprising, and I discovered that the fastest way to create a Pandas DataFrame from a CSV file doesn’t involve using the pd.read_csv() function at all!

Test Data

For this benchmark test I’m going to use this Crime Data CSV dataset because it’s a decent size and has a good range of different data types. I cut the file down to the first 200,000 rows since my ChromeOS tablet was having memory issues running benchmarks on the full file. It has 28 columns in total, but for many of the tests I chose to only use 15 of them. Here’s what the first 5 rows of these columns look like:

First 5 rows of sample data for the 15 selected columns from the CSV file

Test Method

I investigated how the following options affected CSV reading speed:

  • Parser engines
  • Selecting a subset of columns
  • Specifying datatypes for fields
  • Different datatype_backend options (introduced in Pandas 2.0)

The many possible permutations of these parameters resulted in a bit of a complex multi-variate analysis, but I’ll do my best to present my findings in a clear and useful way.

Engines

As shown in my previous article, the pd.read_csv() function has 3 engine options to choose from, and they have a big impact on performance. Here I’m not even going to consider the python engine because it is notoriously slow (due to doing its processing in pure Python code). Just know that if you need to use any of the features the other engines don’t support, you’re going to be in for a bad (or at least slow) time.

Instead, I decided to test an additional entirely different approach — using the read_csv() function of the Polars library. If you’re not familiar with Polars already (where have you been?), it’s the new DataFrame library kid on the block that puts Pandas to shame in terms of performance and has a nice expressions API. It can be used to read a CSV and then convert the resulting Polars DataFrame to a Pandas DataFrame, like:

import polars as pl
df = pl.read_csv('file.csv').to_pandas()

Datatype Backends

Pandas 2.0 introduced the dtype_backend option to pd.read_csv() to choose the class of datatypes that will be used by default. This influences the behaviour of the data conversion process and therefore should impact performance.

When using Polars to read the CSV and then convert to Pandas using DataFrame.to_pandas(), by default the data will be converted to NumPy dtypes (equivalent to dtype_backend='numpy').

If the method is called with use_pyarrow_extension_array=True, the resulting Pandas DataFrame will have PyArrow Extension Arrays which means the data does not need to be copied. This is equivalent to using dtype_backend='pyarrow' in pd.read_csv().

In order to replicate the behaviour of dtype_backend='numpy_nullable' with Polars, the datatype mapping found here can be provided to DataFrame.to_pandas() using the types_mapper keyword argument.

Results

Now the time has finally come to feast your eyes on some sweet, sweet bar charts. Firstly, here’s how selecting a subset of columns (15 from a total of 28) from the CSV file impacts read performance (with all other options default):

As expected, performance is improved by a similar amount with all engines since data from the other columns does not need to be processed. So if you don’t need to use all the data in a CSV file, you should specify the columns you’re interested in while reading it instead of selecting them later on.

Next up, I wanted to compare the performance difference between explicitly specifying the dtypes of each column versus having them automatically inferred depending on the dtype_backend choice. For this test only the C engine was used, since the pyarrow engine does not actually use the provided dtype value(s) during parsing.

This result is very surprising to me, since based on my analysis of the C engine parser, specifying dtypes of columns allows skipping the data type inference step and therefore should improve performance. However, this shows that explicitly providing nullable Extension Types or PyArrowDtype actually significantly degrades performance compared to just setting the dtype_backend appropriately and letting the parser infer the data types itself.

One explanation for this might be that when an Extension Type is explicitly provided, the array will be constructed from the raw data using ExtensionArray._from_sequence_of_strings() , whereas the automatic upcasting for non-default dtype_backend constructs the various built-in ExtensionArrays directly from the source NumPy arrays, which is probably much more efficient.

However, it’s still desirable to explicitly specify the dtypes of your columns where possible, in order to use appropriately-sized numeric types or the categorical type which can lead to significant memory savings. Therefore, I did some further testing to see the performance of using DataFrame.astype() to cast to desired dtypes after reading with read_csv(), for different dtype_backend options.

This shows that when using the default C parser engine, it’s best to just leave dtype_backend and dtype as default and useDataFrame.astype() to cast to your desired dtypes afterwards, whether they’re nullable or PyArrow-backed types.

Finally, I did a benchmark across all engines for different datatype back-ends:

This shows that both the PyArrow and Polars engines are much faster than the C engine in all cases, probably because they use parallelism when parsing the CSV file. These two engines are especially fast at creating a DataFrame with PyArrow datatypes since that’s their native format and doesn’t require any further data conversion or copying. And it appears that the Polars CSV reader is even more optimised than PyArrow’s, halving its read speed, and 10x faster than using the C engine.

Conclusion

For best performance when reading a large CSV file, you should specify just the fields you need so that the others can be discarded. And contrary to intuition, it’s best not to specify column dtypes in read_csv(), but instead use DataFrame.astype() afterwards.

Since the standardised, feature-rich and efficient (especially for string data) PyArrow format is the future that Pandas is trending towards, it looks like it’s best to use the CSV reader engine of its nemesis for best performance. Or maybe it’s time to just migrate over to Polars entirely!

--

--