2

I am trying to pivot a data that has 1 billion rows and 3 columns. To do this I am trying to read the file in chunks and apply pivot on each chunk. The following script is only pivoting the last row but not the entire file. Does any one know how to apply this on complete data ?

input data

r_id       g_id exp
c1      g1      1
c2      g1      2
c3      g1      3
c1      g2      4
c2      g2      5
c3      g2      6
c1      g3      7
c2      g3      8
c3      g3      9

Script - Working

import pandas as pd

my_data1 = pd.read_csv("test.input", sep='\t')

my_data2 = pd.DataFrame(my_data1)

my_data3 = my_data2.pivot('r_id', 'g_id', 'exp')

my_data3.to_csv("test.output", sep='\t')

Chunk Script - not working

import pandas as pd

chunker = pd.read_csv('test.input',sep='\t', chunksize=1)

tot = pd.DataFrame()

for piece in chunker:
        tot = piece.pivot('r_id', 'g_id', 'exp')

tot.to_csv('test.output', sep='\t')

Desired output

r_id       g1      g2      g3
c1      1       4       7
c2      2       5       8
c3      3       6       9
user1703276
  • 353
  • 1
  • 4
  • 14

1 Answers1

2

I solved it myself. Thanks for the comments.

>>> chunker = pd.read_csv('test.input', sep='\t', chunksize=3)
>>> tot=pd.DataFrame()
>>> for piece in chunker:
...     tot=tot.add(piece.pivot('r_id', 'g_id', 'exp'), fill_value=0)
user1703276
  • 353
  • 1
  • 4
  • 14