How to use pivot in pandas on chunks of data

Question

I am trying to pivot a data that has 1 billion rows and 3 columns. To do this I am trying to read the file in chunks and apply pivot on each chunk. The following script is only pivoting the last row but not the entire file. Does any one know how to apply this on complete data ?

input data

r_id       g_id exp
c1      g1      1
c2      g1      2
c3      g1      3
c1      g2      4
c2      g2      5
c3      g2      6
c1      g3      7
c2      g3      8
c3      g3      9

Script - Working

import pandas as pd

my_data1 = pd.read_csv("test.input", sep='\t')

my_data2 = pd.DataFrame(my_data1)

my_data3 = my_data2.pivot('r_id', 'g_id', 'exp')

my_data3.to_csv("test.output", sep='\t')

Chunk Script - not working

import pandas as pd

chunker = pd.read_csv('test.input',sep='\t', chunksize=1)

tot = pd.DataFrame()

for piece in chunker:
        tot = piece.pivot('r_id', 'g_id', 'exp')

tot.to_csv('test.output', sep='\t')

Desired output

r_id       g1      g2      g3
c1      1       4       7
c2      2       5       8
c3      3       6       9

I want to say it depends of data - so what is `print (df['r_id'].nunique())` and `print (df['g_id'].nunique())` ? — jezrael, May 29 '17 at 10:35
Unfortunately not, becasue I understand `pivot`. But I need know more information about your one billion DataFrame. So can you add output of `print (df['r_id'].nunique())`, `print (df['g_id'].nunique())` and `print (len(df[['r_id','g_id']].drop_duplicates().index))` to your question? thank you. — jezrael, May 29 '17 at 10:44
Sorry, you mean the no.of unique rows ids and column ids? unique.rows.ids=200million, unique.columns.ids=10k. — user1703276, May 29 '17 at 11:08
OK, thank you. Unfortunately I dont know nice solution for large data. — jezrael, May 29 '17 at 11:09
no problem. Do you have any suggestion for the above example data? thanks. — user1703276, May 29 '17 at 11:12
Maybe help check [this](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) — jezrael, May 29 '17 at 11:19

user1703276 · Answer 1 · 2017-05-31T09:39:42.337

2

I solved it myself. Thanks for the comments.

>>> chunker = pd.read_csv('test.input', sep='\t', chunksize=3)
>>> tot=pd.DataFrame()
>>> for piece in chunker:
...     tot=tot.add(piece.pivot('r_id', 'g_id', 'exp'), fill_value=0)

edited May 31 '17 at 09:39

answered May 29 '17 at 14:35

user1703276

353
1
4
14

How to use pivot in pandas on chunks of data

1 Answers1