0

i currently have data simple data processing that involving groupby, merge, and parallel column to column operation. The not so simple part are the massive row used (its detailed cost/financial data). its 300-400 gb in size.

due to limited RAM, currently im using out of core computing with dask. However, its really slow.

I've previously read using CuDF to improve performance on map_partitions and groupby, however most example are using mid-high end gpu (1050ti at least, most are run on gv-based cloud vm) and the data could fit on gpu RAM.

My machine spec are E5-2620v3(6C/12T), 128gb, and K620 (only have 2gb dedicated vram).

Intermediate dataframe used are stored in parquet.

Will it make it faster if i use low end GPU using CuDF? and is it possible to do out of core computing in GPU? (im looking all around for example but yet to find)

Below are simplified pseudo code of what im trying to do

a.csv are data with ~300gb in size, consisting of 3 column (Hier1, Hier2, Hier3, value) Hier1-3 are hierarchy, in string. value are sales value b.csv are data with ~50gb in size, consisting of 3 column (Hier1, Hier2, valuetype, cost). Hier1-2 are hierarchy, in string. Value type are cost type, in string. cost are cost value

Basically, i need to do prorate topdown based on sales value from a.csv for each cost in b.csv. end of mind are i have each of cost available in Hier3 level (which is more detailed level)

First step is to create prorated ratio:

import dask.dataframe as dd
# read raw data, repartition, convert to parquet for both file
raw_reff = dd.read_csv('data/a.csv')
raw_reff = raw_reff.map_partitions(lambda df: df.assign(PartGroup=df['Hier1']+df['Hier2']))
raw_reff = raw_reff.set_index('PartGroup')
raw_reff.to_parquet("data/raw_a.parquet")

cost_reff = dd.read_csv('data/b.csv')
cost_reff = cost_reff.map_partitions(lambda df: df.assign(PartGroup=df['Hier1']+df['Hier2']))
cost_reff = cost_reff.set_index('PartGroup')
cost_reff.to_parquet("data/raw_b.parquet")

# create reference ratio
ratio_reff = dd.read_parquet("data/raw_a.parquet").reset_index()

#to push down ram usage, instead of dask groupby im using groupby on each partition. Should be ok since its already partitioned above on each group

ratio_reff = ratio_reff.map_partitions(lambda df: df.groupby(['PartGroup'])['value'].sum().reset_index())
ratio_reff = ratio_reff.set_index('PartGroup')
ratio_reff = ratio_reff.map_partitions(lambda df: df.rename(columns={'value':'value_on_group'}))
ratio_reff.to_parquet("data/reff_a.parquet")

and then do the merging to get the ratio

raw_data = dd.read_parquet("data/raw_a.parquet").reset_index()
reff_data = dd.read_parquet("data/reff_a.parquet").reset_index()
ratio_data = raw_data.merge(reff_data, on=['PartGroup'], how='left')
ratio_data['RATIO'] = ratio_data['value'].fillna(0)/ratio_data['value_on_group'].fillna(0)
ratio_data = ratio_data[['PartGroup','Hier3','RATIO']]
ratio_data = ratio_data.set_index('PartGroup')
ratio_data.to_parquet("data/ratio_a.parquet")

and then merge and multiply cost data on PartGroup to Ratio to have its prorated value

reff_stg = dd.read_parquet("data/ratio_a.parquet").reset_index()
cost_stg = dd.read_parquet("data/raw_b.parquet").reset_index()
final_stg = reff_stg.merge(cost_stg, on=['PartGroup'], how='left')
final_stg['allocated_cost'] = final_stg['RATIO']*final_stg['cost']
final_stg = final_stg.set_index('PartGroup')
final_stg.to_parquet("data/result_pass1.parquet")

in the real case there will be residual value caused by missing reference data etc and it will done in several pass using several reference, but basically above is the step

even with strictly parquet to parquet operation, it still tooks ~80gb of RAM out of my 128gb, all of my core running 100%, and 3-4 days to run. im looking for ways to have this done faster with current hardware. as you can see, its massively pararel problem which fit into definition for gpu-based processing

Thanks

Ditto
  • 25
  • 4

1 Answers1

1

@Ditto, unfortunately, this cannot be done with your current hardware. Your K620 has a Kepler architecture GPU and is below the minimum requirements for RAPIDS. You will need a Pascal card or better to run RAPIDS. The good news is that if purchasing a RAPIDS compatible video card is not a viable option, there are many inexpensive cloud provisioning options. Honestly, what you're asking to do, I'd want a little extra GPU processing speed and would recommend using a multi-GPU set up.

As for the larger data set than GPU RAM, you can use dask_cudf to allow for your dataset to be processed. There are several examples in our docs and notebooks. Please be advised that the resulting data set after dask.compute() needs to be able to fit in the GPU RAM.

https://rapidsai.github.io/projects/cudf/en/0.12.0/10min.html#10-Minutes-to-cuDF-and-Dask-cuDF

https://rapidsai.github.io/projects/cudf/en/0.12.0/dask-cudf.html#multi-gpu-with-dask-cudf

Once you can get a working, RAPID compatible, multi GPU set up and use dask_cudf, you should get a very worth while speed up, especially for that size of data exploration.

Hope this helps!

TaureanDyerNV
  • 1,208
  • 8
  • 9
  • Can KMeans be trained on data that doesn't fit on GPU RAM (or normal RAM, for that matter), using a single GPU? This is possible on CPU, but I thought it is not possible in cuML. – HappyFace Jan 27 '22 at 22:01
  • I think it can, as it works with dask for multi GPU. https://docs.rapids.ai/api/cuml/stable/api.html#id41 – TaureanDyerNV Feb 09 '22 at 22:38