22

I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.

How can I do a pivot on data this large with a limited ammount of RAM?

EDIT: adding sample code

The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.

import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os

pd.set_option('io.hdf.default_format','table') 

# create a small dataframe to simulate the real data.
def loadFrame():
    frame = pd.DataFrame()
    frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
    frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
    frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
    return frame

def pivotSegment(segmentNumber,passedFrame):
    segmentSize = 3 #take 3 rows at a time
    frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF

    # ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
    span = pd.DataFrame() 
    span['catid'] = range(1,5+1)
    span['shipmentid']=1
    span['qty']=0

    frame = frame.append(span)

    return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
                             aggfunc='sum',fill_value=0).reset_index()

def createStore():

    store = pd.HDFStore('testdata.h5')
    return store

segMin = 0
segMax = 4

store = createStore()
frame = loadFrame()

print('Printing Frame')
print(frame)
print(frame.info())

for i in range(segMin,segMax):
    segment = pivotSegment(i,frame)
    store.append('data',frame[(i*3):(i*3 + 3)])
    store.append('pivotedData',segment)

print('\nPrinting Store')   
print(store)
print('\nPrinting Store: data') 
print(store['data'])
print('\nPrinting Store: pivotedData') 
print(store['pivotedData'])

print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
    print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())

print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed') 
print(store['pivotedAndSummed'])

store.close()
os.remove('testdata.h5')
print('closed')
PTTHomps
  • 1,477
  • 2
  • 22
  • 38
  • It's worth noting that not only does python crash, it crashes the whole computer. Clearly not a case where I can just let it cook for a couple days. – PTTHomps Apr 03 '15 at 21:17
  • Depending on the nature of your data, you might want to try using [sparse DataFrames](http://pandas.pydata.org/pandas-docs/version/0.16.0/sparse.html). It could save you a lot of RAM. – dmvianna Apr 07 '15 at 22:50
  • Since my values for shipmentid are all numeric, I'm now experimenting with manually selecting from the pivotedData table one integer value of shipmentid at a time, incrementing from 0 to 5 million or so, then executing the sum() on the result, and appending it to a result table in the store. However, each select is taking a very long time, especially when no entries exist for a particular shipmentid. Will continue playing with compression settings to see if that might help. – PTTHomps Apr 10 '15 at 17:31
  • 2
    Why not use a RDMS to aggregate your dataset? An SQL engine is designed to store millions of records and handle basic processing like Sum() by groups. And as your pivot indicates, with what I assume are two byte-size fields (ids) and one integer (qty) field a temp db table should not be too extensive to store and query. Consider aggregating inside SQL Server, Oracle, MySQL, PostgreSQL or any other and pass the result into the Python [dataframe](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.sql.read_sql.html). – Parfait Apr 11 '15 at 00:30
  • From where is the data sourced? A database (if so, which?), .csv file, HDF5, etc. – Alexander Apr 11 '15 at 19:44
  • Originally comes from gbq, which sadly doesn't have a pivot function – PTTHomps Apr 12 '15 at 17:04
  • [Google BigQuery](https://cloud.google.com/bigquery/what-is-bigquery)? Why not run an equivalent [query](https://cloud.google.com/bigquery/docs/reference/v2/jobs/query) to your pivot: `SELECT [catid], Sum(qty) FROM [datasource] GROUP BY [catid]`? Then, import the result into a data frame. Also, Python has an experimental pandas library for [gbq](http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.io.gbq.read_gbq.html) with query option. – Parfait Apr 13 '15 at 03:45
  • If you try to pivot to more than 1024 columns SQL server and others don't allow more than 1024 columns in a table. He needs 2000 columns. – user798719 Aug 07 '17 at 09:42

1 Answers1

17

You could do the appending with HDF5/pytables. This keeps it out of RAM.

Use the table format:

store = pd.HDFStore('store.h5')
for ...:
    ...
    chunk  # the chunk of the DataFrame (which you want to append)
    store.append('df', chunk)

Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):

df = store['df']

You can also query, to get only subsections of the DataFrame.

Aside: You should also buy more RAM, it's cheap.


Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:

# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()

Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:

reduce(lambda x, y: x.add(y, fill_value=0),
       (df.groupby().sum() for df in store.select('df', chunksize=50000)))

In python 3 you must import reduce from functools.

Perhaps it's more pythonic/readable to write this as:

chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks)  # will raise if there are no chunks!
for c in chunks:
    res = res.add(c, fill_value=0)

If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Agreed on the ram. Unfortunately I'm already maxed out at 32 gb. Can the groupby().sum() be done in the hdf5 store? – PTTHomps Apr 03 '15 at 21:30
  • aws is unfortunately not an option, due to political reasons. – PTTHomps Apr 03 '15 at 21:39
  • 1
    @TraxusIV Hmmm, if you had a small number of groups, you could do it iteratively (by selecting each group and summing) - this'll be slow if you have lots of rows. I think this would make a great (new) question. A little google only found [this](http://sourceforge.net/p/pytables/mailman/message/6180494/) (from 2006!) suggesting no, you need to go the iteration way (I suggested)... things may have improved in the last 9 years?? – Andy Hayden Apr 03 '15 at 22:48
  • 3
    @TraxusIV For a groupby sum, that map/reduces, in the sense you can groupby and sum on chunks then add up the results. So chunk through df in the store. Something like: sum(df.groupby().sum() for df in store.select('df', chunksize=50000)) ? see http://pandas.pydata.org/pandas-docs/stable/io.html#iterator – Andy Hayden Apr 03 '15 at 22:49
  • 1
    @TraxusIV note: this'll be fast regardless of how many groups you have, the slowness was if you had to extract *each group* iteratively, doing chunks is what you want/fast. – Andy Hayden Apr 03 '15 at 23:04
  • will the sum(df.groupby().sum() for wawa) method properly group and sum when elements of the group occur in multiple chunks? Or will it potentially create multiple entries in the resulting df? – PTTHomps Apr 06 '15 at 15:14
  • 1
    @TraxusIV sum will add up the multiple entries - which is what you want. – Andy Hayden Apr 06 '15 at 16:12
  • The sum trick isn't working. If a key is not contained in each and every chin, it's being dropped. Any ideas? I'll try to get some sample code up. – PTTHomps Apr 06 '15 at 22:36
  • that should be 'chunk', not 'chin' – PTTHomps Apr 06 '15 at 23:02
  • 1
    @TraxusIV what version of pandas are you using, I thought I tested this and it worked – Andy Hayden Apr 07 '15 at 01:01
  • Pandas 0.16.0, python 3.4.3, via Anaconda 2.1.0 (64-bit). Sample code added to the question above. – PTTHomps Apr 07 '15 at 14:42
  • 1
    @TraxusIV hmmm, this worked in 0.15.2 but doesn't in 0.16!!! You can do the same with `reduce(lambda x, y: x.add(y, fill_value=0), (df.groupby().sum() for df in store.select('df', chunksize=50000)))`. Does that work for you? – Andy Hayden Apr 08 '15 at 02:13
  • Well, the reduce(lambda) method works for small datasets, but for my set of 50 million + rows, my memory gets used up by about the 4th iteration. – PTTHomps Apr 10 '15 at 17:16
  • New snag: I tried importing the data into Oracle DB to do the summation, but it failed due to a limit of 1000 columns. I have 2k in my data. – PTTHomps Apr 16 '15 at 14:59
  • @TraxusIV I don't see how your memory gets used up by the 4th iteration, I missed that comment, that doesn't sound right to me - it shouldn't be. Did you try the final few lines version? (Not sure why you'd *ever* want o import to OracleDB :p) – Andy Hayden Apr 16 '15 at 15:19
  • @TraxusIV I know this post is outdated, but i'm facing a similar problem and was wondering how you solved your problem in the end? – Hidden_NB May 26 '20 at 10:59