7

I'm using Pandas pivot_table function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it takes around 8 secs to process the whole dataset which is way to slow and I hope to find alternatives to improve speed/performance.

My current Pandas pivot_table:

df_pivot = df_original.pivot_table(index="industry", columns = "months",
                    values = ["orders", "client_name"],
                    aggfunc ={"orders": np.sum, "client_name": pd.Series.nunique})

df_original includes all the data (10m rows, imported from a csv). Industry is the client's industry, months are the order months (Jan to Dec), orders are the number of orders. All data was converted to categorical data, except number of orders (int datatype). Originally industry, months and client_name were strings.

I tried using pandas.DataFrame.unstack - which was even slower. Also I experimented with Dask. The dask pivot_table yielded some improvement (6 sec execution time - so 2 sec less). However, it is still pretty slow. Are there any faster alternatives (for large datasets)? Maybe recreation of the pivot table with groupy, crosstab, ... Unfortunately, I did not get the alternatives to work at all and I am still quite new to Python and Pandas... Looking forward to your suggestions. Thanks in advance!

Update:

I figured out the groupby way with:

df_new = df_original.groupby(["months", "industry"]).agg({"orders": np.sum, "client_name": pd.Series.nunique}).unstack(level="months").fillna(0)

This is much faster now with about 2-3 secs. Are there still some options to improve speed further?

pythoneer
  • 403
  • 2
  • 4
  • 15
  • What's your execution environment? Multiple cores, cluster, or a single machine? Large RAM, or lots of fast disk-access? Tuning for performance means knowing what resources you have at your particular disposal. – Thomas Kimber Mar 28 '19 at 18:43
  • You are right, sorry!! Single machine, 16GB RAM, 8 cores (i7-8650U CPU @ 1.90GHz). I run the code within my IDE (Visual Studio Code). Ideally my module should also run "fast" on a machine with less RAM (e.g. 8GB) and less CPU power... – pythoneer Mar 28 '19 at 19:11
  • OK next question, is the data ordered? i.e. can you rely on the splits you want to create in your pivots to be reflected in the underlying data? It'd also be interesting to find a lower bound of expectation by timing some moderately simpler function, just to see what a best result might look like. e.g. how long does it take to read each line of the file into memory without any additional processing? – Thomas Kimber Mar 28 '19 at 21:32
  • Data (i.e. number of `orders`) is not ordered. Should it be? How does it help with performance? I convert the csv file first, convert it to HDF5 format for faster loading the next time I work with the dataset. I dont worry so much about reading each line of the file into memory. Maybe I misunderstood your question... – pythoneer Mar 30 '19 at 21:18
  • 1
    I figured out the `groupby` way with: ```df_new = df_original.groupby(["months", "industry"]).agg({"orders": np.sum, "client_name": pd.Series.nunique}).unstack(level="months").fillna(0)``` This is much faster now with about 2 secs. Are there still some options to improve speed further? – pythoneer Mar 30 '19 at 21:24
  • Just change import pandas.modin as pd and see the differences. – BhishanPoudel Mar 31 '19 at 16:39
  • @astro123: I use Visual Studio Code on a Windows machine with Python 3.7.2 64bit. When I run `pip install modin` I get the error message "could not find a version that satisfies the requirement ray (from versions: ) No matching distribution found for ray". Why is that? I would really like to try out modin / Ray and could not find an answer on: https://ray.readthedocs.io/en/latest/installation.html#latest-stable-version – pythoneer Mar 31 '19 at 21:22
  • @pythoneer I strongly suggest you to install ANACONDA and create new environments. I have also encountered this module installation errors multiple times and the best solution I have found is use conda and create a new environment whenever get issues with installing a module in given environment. Hope that helps. – BhishanPoudel Mar 31 '19 at 21:26
  • Following up on the `groupby` way - how can I add a total/subtotal row and column comparable to the `pandas.pivot_table` `margins`? I can add a total row with `df_new["all"] = df_new.sum(axis=0)` and column with `df_new["orders", "all"] = df_new["orders"].sum(axis=1)`. However it does not work for `nunique` values. How can I accomplish this? I only want the unique number of clients for the columns which include the number of clients, i.e. I want to show the sum for the columns with the number of orders and the unique count of clients in the columns with client_name. Do you have an idea? Thanks – pythoneer Apr 08 '19 at 22:31

3 Answers3

2

Convert the columns months and industry to categorical columns: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html This way you avoid a lot of string comparisons.

  • Thanks, I did that already and the exectution time is around 8 sec. All data was converted to categorical columns before, e.g. ```df_original["industry"] = df_original["industry"].astype("category")``` – pythoneer Mar 28 '19 at 19:23
  • Any thoughts why I have TypeError after doing this and then .reset_index() ? – Ivan Sudos Jun 07 '21 at 16:59
2

You can use Sparse Matrices. They are fast to implement, a little bit restricted though. For example: You can't do indexing on a COO_matrix

I recently needed to train a recommmender system(lightFM) and it accepted sparse matrices as input, which made my job a lot easier. See it in action:

row  = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
mat = sparse.coo_matrix((data, (row, col)), shape=(4, 4))
>>> print(mat)
  (0, 0)    4
  (3, 3)    5
  (1, 1)    7
  (0, 2)    9
>>> print(mat.toarray())
[[4 0 9 0]
 [0 7 0 0]
 [0 0 0 0]
 [0 0 0 5]]

As you can see, it automatically creates a pivot table for you using the columns and rows of the data you have and fills the rest with zeros. You can convert the sparse matrix into array and dataframe as well (df = pd.DataFrame.sparse.from_spmatrix(mat, index=..., columns=...))

Anmol Deep
  • 463
  • 1
  • 5
  • 16
0

When you read the csv file into a df, you could pass a convert function (via the read_csv parameter converters), to transform client_name into a hash and downcast orders to an appropriate int type, in particular, an unsigned one.

This function lists the types and their ranges:

import numpy as np

def list_np_types():
    for k, v in np.sctypes.items():
        for i, d in enumerate(v):
            if np.dtype(d).kind in 'iu':
                # only int and uint have a definite range
                fmt = '{:>7}, {:>2}: {:>26}  From: {:>20}\tTo: {}'
                print(fmt.format(k, i, str(d),
                                 str(np.iinfo(d).min),
                                 str(np.iinfo(d).max)))

            else:
                print('{:>7}, {:>2}: {:>26}'.format(k, i, str(d)))


list_np_types()

Output:

    int,  0:       <class 'numpy.int8'>  From:                 -128 To: 127
    int,  1:      <class 'numpy.int16'>  From:               -32768 To: 32767
    int,  2:      <class 'numpy.int32'>  From:          -2147483648 To: 2147483647
    int,  3:      <class 'numpy.int64'>  From: -9223372036854775808 To: 9223372036854775807
   uint,  0:      <class 'numpy.uint8'>  From:                    0 To: 255
   uint,  1:     <class 'numpy.uint16'>  From:                    0 To: 65535
   uint,  2:     <class 'numpy.uint32'>  From:                    0 To: 4294967295
   uint,  3:     <class 'numpy.uint64'>  From:                    0 To: 18446744073709551615
  float,  0:    <class 'numpy.float16'>
  float,  1:    <class 'numpy.float32'>
  float,  2:    <class 'numpy.float64'>
complex,  0:  <class 'numpy.complex64'>
complex,  1: <class 'numpy.complex128'>
 others,  0:             <class 'bool'>
 others,  1:           <class 'object'>
 others,  2:            <class 'bytes'>
 others,  3:              <class 'str'>
 others,  4:       <class 'numpy.void'>
MCC
  • 109
  • 1
  • 5
  • Thanks for the suggestion. Would the downcast to `uint` improve speed? What do you mean by transforming `client_name` into a hash? Whats the idea/improvement behind that? – pythoneer Mar 30 '19 at 21:08
  • My bad on `client_name`: it's already categorical. To downcast `orders`, you need to know what is the max or upper bound; my guess is that you should not need `unit64` (pandas default). See [this](https://pbpython.com/pandas_dtypes.html) post. The idea is that, if you want your code to be as fast on a machine with half the RAM, this should help. – MCC Apr 01 '19 at 05:06
  • 1
    The minus score is super stranger given the OP thanked me. – MCC Apr 20 '21 at 20:34
  • I have provided an upvote for you – Sumit Sep 28 '22 at 13:30