The ultimate guide to Pandas’ read_csv() function

Here I unravel the mysteries behind the omnipotent and overwhelmingly complicated read_csv() function, including new features added in Pandas 2.0.

Published in

ITNEXT

7 min readMar 30, 2023

Everyone knows Pandas can read CSVs, but do they know how? (Generated by DALL-E 2)

With 3 engines and over 50 parameters to choose from, Pandas’ miraculous read_csv() function is the one-stop-shop for all your text file reading needs. But how does it actually work its magic under the hood? No-one actually knows.

Until now.

CSV Data and Pandas

Comma-Seperated Value (CSV) files are probably still the most ubiquitous format for storing tabular data, even though much more efficient alternatives such as Parquet exist. The values are encoded as text which is convenient for human readability and viewing in programs such as Excel, but must be converted to appropriate Pandas datatypes when creating a DataFrame. This process can be rather convoluted given Panda’s complex datatype ecosystem, especially in more recent versions with the introduction of Extension Types enabling new nullable datatypes and the ArrowExtensionArray which stores data in the Apache Arrow storage format.

The read_csv() function is a magical black box which handles the creation of a DataFrame from a CSV data source using one of 3 possible engines and a myriad of configuration parameters. Even after reading the function documentation and IO tools guide, you’ll probably still be left confused about which engine to use and how the various configuration parameters influence its behaviour. Well luckily for you, I have wasted hours of my precious time diving deep into the horrors of Panda’s source code to try and figure it out, and will present my findings here in a way that will hopefully clear things up a bit.

Looking through Panda’s internals is not for the faint-of-heart (Generated by DALL-E 2)

How It Works

Calling read_csv() creates a TextFileReader instance, which acts as a wrapper around the desired parser engine. The input passed to read_csv() (file path, URL, file-like object) is converted to a file handle and used to initialise the parser engine along with the configuration options. If the chunksize or iterator arguments are provided, the TextFileReader instance is returned and has a get_chunks() method which can be called repeatedly to produce a DataFrame from the next desired number of rows of the CSV file. Otherwise, the entire file (or nrows if specified) is read and a corresponding DataFrame is returned.

Pandas 2.0 (which isn’t actually released at time of writing) introduces an additional dtype_backend argument which controls the class of datatypes to use when the dtype of a particular column is not specified. The choices are:

numpy (default) — Standard native NumPy array data types (intX, floatX, bool, etc)
numpy_nullable — Pandas nullable extension arrays (IntegerArray, BooleanArray, FloatingArray, StringDtype)
pyarrow — PyArrow-backed nullable ArrowDtype

That’s just a very high level overview of what read_csv() does; the important details depend on the implementation of each of the 3 available parser engines. Let’s see how they work now.

Python Engine

This engine uses a pure Python parser, which is the most feature-rich but also the slowest. It’s used as a fall-back from the default C engine when any of the following configuration is provided:

skipfooter > 0
sep set to None or longer than 1 character
quotechar longer than 1 character
on_bad_lines as a callable

The parser can attempt to automatically detect the seperator character if sep is not provided (set to None). If sep is a single character, the parser uses the standard Python csv.reader() to read and split lines from the source file. If sep is longer than one character, it is treated as a regular expression pattern and the re.split() function is used to split lines (which I presume is much slower).

Here’s the end-to-end process:

Get rows from reader (as lists of strings)
Skip rows matching values or callable in skiprows (if provided)
Skip lines or remove comment content using comment character (if provided)
Replace thousands separator (if provided) and decimal character (if not parser default)
Construct 2D NumPy object array from row content (rows as arrays)
Transpose 2D array to get columns as arrays
Remove any columns not in usecols list (if provided)
Perform data type inference and conversion on all columns not in parse_dates list (if provided)
Convert date data in parse_dates columns (if provided) using pd.to_datetime() on the array of string data
Create index from column(s) in index_col (if provided)
Construct a DataFrame from column arrays, column names and index

This is a nice high level overview of the parser process, but what are the juicy specifics of the data type conversion in step 8? Well I’m glad you asked, because I’ve spent far too much time digging through source code to put together this detailed flowchart:

No ultimate guide is complete without a confusing flowchart

Hopefully this provides a comprehensive insight into how the parser works and how various configuration parameters might impact its functionality and performance. A few take-aways are:

Specifying only the columns you need in usecols skips the data conversion process for the others, which will improve performance.
If you want to use nullable or PyArrow data types, it’s probably best to specify them explicitly using the dtype parameter, instead of setting dtype_backend. For example, it appears that when dtype_backend == “pyarrow” and dtype is not provided, data is first converted to a numpy numeric array, then to a nullable datatype, and finally to an ArrowExtensionArray.
Steps 2, 3 and 4 (processing lines before construction of arrays) involve lots of looping and conditional checks performed in Python, so are probably quite expensive.

C Engine

The default C engine aims to improve performance by implementing most data conversion logic in Cython and uses a pure C parser to read, tokenize and extract data from the CSV file.

The high-level process is similar to the Python parser:

Read and tokenize (split into fields) rows of CSV data using C parser
Convert column data into arrays of appropriate dtypes
Perform date parsing on columns in parse_dates (if provided) using pd.to_datetime() on the array of string data
Create index from column(s) in index_col (if provided)
Construct DataFrame from dictionary of arrays and index.

When low_memory=True (the default), CSV data is processed in chunks by repeating steps 1 and 2 with a buffer size that depends on the table width. This reduces memory usage since many data types will be stored more efficiently in an appropriate NumPy array than its original string format. These chunks are then concatenated and processing continues from step 3.

Again, I’m sure a measly 5-step explanation isn’t going to satisfy your insatiable hunger for intricate understanding, so here’s another painstakingly constructed flowchart of the data conversion in step 2:

This is where the Medium estimated read time becomes misleading

This shows that the process is quite similar to the Python Engine and so the same optimisation strategies should apply, but will be generally much faster due to being implemented in Cython and C.

PyArrow Engine

Added in Pandas 1.4.0, the PyArrow engine continues the trend of increased performance but with less features (see the list of unsupported options here). It uses PyArrow’s read_csv() function which is implemented in C++ and supports multi-threaded processing.

This is what the engine does:

Call pyarrow.csv.read_csv() with options constructed using provided arguments, to get an Arrow Table
Convert Arrow Table to Pandas DataFrame using Table.to_pandas() , with the types_mapper argument provided if dtype_backend option is not the numpy default. The type mapping here is used for the "numpy_nullable" backend, whereas the single ArrowDtype is used for “pyarrow”
Perform date parsing on columns in parse_dates (if provided) using pd.to_datetime() on the array of string data
Create and set index from column(s) in index_col (if provided)
Cast columns to desired type(s) in dtype (if provided).

Sorry to let you down but I’m not going to risk my sanity attempting to build a flowchart for PyArrow’s C++read_csv() function.

The usecols argument is used in options provided to PyArrow's read_csv() , so should probably improve performance if specified. Strangely, the ConvertOptions for PyArrow’s CSV reader has a column_types parameter which appears to do the same thing as Pandas’ dtype parameter, but it’s not currently used. This means that type inference must be performed on all columns which is not ideal for performance. It seems like this would be a rather straight-forward capability to add, so I opened an issue and perhaps there will be an update in the near future.

However, one exciting observation here is the behaviour introduced in Pandas 2.0 when dtype_backend = “pyarrow”. The ArrowDtype is used to construct an ArrowExtensionArray from each column in the source PyArrow Table, which should be a zero-copy operation since they can both share the same underlying data representation. This means that if you want to work with PyArrow data types in Pandas (which you probably should), then the pyarrow engine should be a very high performance solution for reading CSV files.

Conclusion

If you’ve made it this far I hope you’ve learned something useful about how Pandas’ read_csv() function works under the hood. If you have control over how the CSV files you’re working with are generated, then you should look into using the Parquet format instead as it is much more optimised for use by tools such as Pandas. If not, you should now know about some ways to improve the performance of reading the CSV files you’re stuck with.

Stay tuned for my next article where I will compare the performance of different engines and configuration options when reading CSV files!

ITNEXT

The ultimate guide to Pandas’ read_csv() function

Here I unravel the mysteries behind the omnipotent and overwhelmingly complicated read_csv() function, including new features added in Pandas 2.0.

CSV Data and Pandas

How It Works

Python Engine

C Engine

PyArrow Engine

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in ITNEXT

Written by Finn Andersen

No responses yet