Quick Tip – Speed up Pandas using Modin

I ran across a neat little library called Modin recently that claims to run pandas faster. The one line sentence that they use to describe the project is:

Speed up your Pandas workflows by changing a single line of code

Interesting…and important if true.

Using modin only requires importing modin instead of pandas and thats it…no other changes required to your existing code.

One caveat – modin currently uses pandas 0.20.3 (at least it installs pandas 0.20. when modin is installed with pip install modin). If you’re using the latest version of pandas and need functionality that doesn’t exist in previous versions, you might need to wait on checking out modin – or play around with trying to get it to work with the latest version of pandas (I haven’t done that yet).

To install modin:

pip install modin

To use modin:

import modin.pandas as pd

That’s it.  Rather than import pandas as pd you import modin.pandas as pd and you get all the advantages of additional speed.

read_csv_benchmark from Modin
A Read CSV Benchmark provided by Modin

According to the documentation, modin takes advantage of multi-cores on modern machines, which pandas does not do. From their website:

In pandas, you are only able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. Even in read_csv, we see large gains by efficiently distributing the work across your entire machine.

Let’s give is a shot and see how it works.

For this test, I’m going to try out their read_csv method since its something they highlight. For this test, I have a 105 MB csv file. Lets time both pandas and modin and see how things work.

We’ll start with pandas.

from timeit import default_timer as timer
import pandas as pd

# run 25 iterations of read_csv to get an average
time = []
for i in range (0, 25):
    start = timer()
    df = pd.read_csv('OSMV-20190206.csv')
    end = timer()
    time.append((end - start)) 

# print out the average time taken 
# I *think* I got this little trick from 
# from https://stackoverflow.com/a/9039992/2887031
print reduce(lambda x, y: x + y, time) / len(time)

With pandas, it seems to take – on average – 1.26 seconds to read a 105MB csv file.

Now, lets take a look at modin.

Before continuing, I should share that I had to do a couple extra steps to get modin to work beyond just pip install modin. I had to install typing and dask as well.

pip install "modin[dask]"
pip install typing

Using the exact same code as above (except one minor change to import modin — import modin.pandas as pd.

from timeit import default_timer as timer
import modin.pandas as pd

# run 25 iterations of read_csv to get an average
time = []
for i in range (0, 25):
    start = timer()
    df = pd.read_csv('OSMV-20190206.csv')
    end = timer()
    time.append((end - start)) 

# print out the average time taken 
# I *think* I got this little trick from 
# from https://stackoverflow.com/a/9039992/2887031
print reduce(lambda x, y: x + y, time) / len(time)

With modin, it seems to take – on average – 0.96 seconds to read a 105MB csv file.

Using modin – in this example – I was able to save 0.3 seconds off of read time for reading in that 105MB csv file. That may not seem like a lot of time, but if you’ve got 5000 csv files to read in that are of similar size, that’s a savings of 1500 seconds on average…that’s 25 minutes of time saved in just reading files.

Modin uses Ray to speed pandas up, so there could be even more savings if you get in and play around with some of the settings of Ray.

I’ll be looking at modin more in the future to use in some of my projects to help gain some efficiencies.  Take a look at it and let me know what you think.

The post Quick Tip – Speed up Pandas using Modin appeared first on Python Data.