2

I have 100-300Go data under csv format(numerical + unicode text) and needs to do regular Pivot Table jobs on this. After googling/StackOverflow-ing, could not find satisfactory answer (only partial). Wondering which solution is the fastest for single machine (64Go RAM):

1) Convert and Insert into PostGres and proces all in PostGres DB through SQL ? (or MySQL....)

2) Load csv in chunk in Pandas and process one by one manually ?

3) Load csv + convert to HDF and process the HDF by chunk ?

4) Other solution.

Community
  • 1
  • 1
tensor
  • 3,088
  • 8
  • 37
  • 71

2 Answers2

0

For working on a single machine, PostgreSQL may be the best fit from the 3 options you list.

To keep memory usage in check you can use partitioning and handle the data by chunks.

PabTorre
  • 2,878
  • 21
  • 30
  • Any real experience on Pivot table on it ? – tensor Jan 09 '17 at 04:58
  • yeah, I've used those 3 tools to build pivot tables as well as olap cubes before. At the range of a few 100 Gb's postgreSQL works well as long as one partitions data so that it can work in memory and avoid using SWAP. – PabTorre Jan 09 '17 at 14:11
  • What if your pivot table has 2000 columns? How do you pivot using a relational database because the limit is 1024 columns for many of them – user798719 Aug 10 '17 at 15:39
  • You should post that on a new question @user798719 – PabTorre Aug 10 '17 at 20:50
0

The questioner must have solve the problem , for others whom might land upon this question , my answer might help. Try this solution (convert it as per your dataset) , I tried on 50-80 GB it worked adding numpy will improve the performance.

import pandas as pd
from datetime import date
from datetime import datetime 

print("1 Load rec =", datetime.now())
df6 = pd.read_csv('sampleDataframeFlux/sampleDataframeFlux.csv',low_memory=False, memory_map=True,engine='c',na_filter=False,index_col=False,usecols=["time", "Label","Server","value"])
print("shape",df6.shape)
   
print("2 Create dataframe =",datetime.now())
df6["Label"]=df6["Server"]+"|"+df6["Label"]
df6.drop(['Server'],axis=1,inplace=True)
    
print("3 Time trim =", datetime.now())
df6['time']=df6['time']//1000000000
print("shape",df6.shape)
  
print("4 Round Epoch to nearest multiple of 5 =", datetime.now())
df6['time']=5*round(df6['time']/5)
print("shape",df6.shape)
  
print("5 Pivot dataframe=", datetime.now())
df6=df6.pivot_table(index='time', columns=["Label"],values="value",fill_value=0)
print("shape",df6.shape)

print("6 Epoch to UTC =", datetime.now())
df6.index=pd.to_datetime(df6.index, unit='s')
    
print("7 Convert to type category to reduce memory =", datetime.now())
df6=df6.astype('category')
print("shape",df6.shape)
   
print("8 Start to write to a file  =", datetime.now())
df6.to_csv('file_11.csv', header=True, chunksize=500000)
print("9 Finish =", datetime.now())
Kimi
  • 332
  • 2
  • 11
  • 25