3 simple ways to handle large data with Pandas
Pandas has become one of the most popular Data Science libraries out there. It’s easy to use, the documentation is fantastic, and it’s capabilities are powerful.
Yet regardless of what library one uses, large datasets always present an extra challenge that needs to be handled with care.
You start to run into hardware roadblocks since you don’t have enough RAM to hold all the data in memory. Enterprise companies store datasets that get up to the range of 100s or even 1000s of GBs .
Even if you do happen to buy a machine that has enough RAM to store all that data, just reading it into memory is very slow.
But once again the Pandas library is going to help us out. This article will talk about 3 techniques you can use to reduce the memory footprint and read-in time for your large dataset. I’ve used these techniques for datasets of over 100GB in size, squeezing them onto machines with 64 and sometimes 32GB of RAM.
Let’s check them out!
Chunking your data
CSV format is a very convenient way to store data, being both easy to write to and human readable. Plus, there’s a nice pandas function
read_csv() for loading up data that’s stored as CSV.
But what happens when your CSV is so big that you run out of memory?
There’s a very simple pandas trick to handle that! Instead of trying to handle our data all at once, we’re going to do it in pieces. Typically, these pieces are referred to as chunks.
A chunk is just a part of our dataset. We can make that chunk as big or as small as we want. It just depends on how much RAM we have.
The process then works as follows:
- Read in a chunk
- Process the chunk
- Save the results of the chunk
- Repeat steps 1 to 3 until we have all chunk results
- Combine the chunk results
We can perform all of the above steps using a handy variable of the
read_csv() function called chunksize. The chunksize refers to how many CSV rows pandas will read at a time. This will of course depend on how much RAM you have and how big each row is.
If we think that our data has a pretty easy to handle distribution like Gaussian, then we can perform our desired processing and visualisations on one chunk at a time without too much loss in accuracy.
If our distribution is a bit more complex like a Poisson, then it’s best to filter each chunk and put all of the small pieces together before processioning. Most of the time, you’ll end up dropping many irrelevant columns or removing rows that have missing values. We can do that for each chunk to make them smaller, then put them all-together and perform our data analysis on the final dataframe.
The code below performs all of these steps.
Sometime, we’ll know right off the bat which columns of our dataset we want to analyse. In fact, it’s often the case that there are several or more columns that we don’t care about like names, account numbers, etc.
Skipping over the columns directly before reading in the data can save on tons of memory. Pandas allows us to specify the columns we would like to read in:
Throwing away the columns containing that useless miscellaneous information is going to be one of your biggest memory savings.
The other thing we can do is filter out any rows with missing or NA values. This is easiest with the
There’s a few really useful variables that we can pass to the
- how: this will let you specify either “any” (drop a row if any of its columns are NA) or “all” (drop a row only if all its columns are NA)
- thresh: Set a threshold of how many NA values are required for a row to be dropped
- subset: Selects a subset of columns that will be considered for checking for NA values
You can use those arguments, especially the thresh and subset to get really specific about which rows will be dropped.
Pandas doesn’t come with a way to do this at read time like with the columns, but we can always do it on each chunk as we did above.
Set specific data types for each column
For many beginner Data Scientists, data types aren’t given much thought. But once you start dealing with very large datasets, dealing with data types becomes essential.
The standard practice tends to be to read in the dataframe and then convert the data type of a column as needed. But with a big dataset, we really have to be memory-space conscious.
There may be columns in our CSV, such as floating point numbers, which will take up way more space than they need to. For example, if we downloaded a dataset for predicting stock prices, our prices might be saved as 32 bit floating point!
But do we really need 32 float? Most of the time, stocks are bought at prices specified by two decimal places. Even if we wanted to be really accurate, float16 is more than enough.
So instead of reading in our dataset with the columns’ original data types, we’re going to specify the data types we want pandas to use reading in our columns. That way, we never use up more memory than we actually need.
This is easily done using the dtype parameter in the
read_csv() function. We can specify a dictionary where each key is a column in our dataset and each value is the data type we want to use that key.