1

I have a large dataframe (+200 million rows) that is in the following format

DeviceID    Date_Time
50135487    2018-03-01 00:00:44
50135487    2018-03-02 01:01:21
50135487    2018-03-01 02:01:58
50135484    2018-03-01 02:01:58
50135484    2018-03-01 02:50:13
50090879    2018-03-01 02:50:13
50090879    2018-03-01 02:50:13
50090860    2018-03-01 02:50:13
50090860    2018-03-01 02:50:13

Since the data frame has about 7700 unique 'DeviceID' values, I want to split the large data frame into 8 smaller dataframes so that I can run the analysis on them quicker.

I've tried using numpy:

import numpy as np
np.array_split(df, 3)

but it produced dataframes where the a specific DeviceID is found in multiple dataframes.

I'm imagining that the solution would include an if statement combined with groupby, but I'm not sure how to go about it.

Renaldo Moon
  • 165
  • 3
  • 14
  • 1
    Why do you assume that splitting it would allow you to run your analysis faster? I can only imagine that this case is true if you're using multiprocessing where each CPU core is performing the analysis in parallel – Tabbakhh Oct 14 '19 at 10:17
  • use more_itertools.... check https://stackoverflow.com/questions/48770013/pandas-split-dataframe-into-multiple-when-condition-is-true – spYder Oct 14 '19 at 10:23
  • i though of splitting it because i'm having trouble converting the timestamp using `pd.to_datetime` command and then split the formatted date and time into two different columns with this format: `%d/%m/%Y` and `%H:%M:%S` It is taking a lot of time to make that conversion. – Renaldo Moon Oct 14 '19 at 10:33
  • @RenaldoMoon - Anyway you need to go through the entire data frame so it shouldn't make any difference (I might be wrong, I just don't get how it would result in faster computations). However, I have ran into this project called [modin](https://github.com/modin-project/modin) that reads data frames using multiple core rather than only one (the case with pandas) and it really is fast. Take a look at it, it might help you achieve your goal – Tabbakhh Oct 14 '19 at 10:43
  • 1
    @RenaldoMoon I suggest you to skip `pd.to_datetime` and use `df["Date_Time"].astype("M8")` instead. – rpanai Oct 14 '19 at 11:22
  • 1
    Then you can have a look at [dask](https://docs.dask.org/en/latest/dataframe.html) or [vaex](https://vaex.readthedocs.io/en/latest/index.html) for out of core dataframe. These tools are going to manage the splitting for you so you can just run a groupby. – rpanai Oct 14 '19 at 11:30

1 Answers1

1

To split your DataFrame into a number of "bins", keeping each DeviceID in a single bin, take the following approach:

  • Compute value_counts for DeviceID. The result is a Series starting with most numerous groups.
  • Convert it to a DataFrame and add a column composed of bin numbers, cycling from 0 to binNo.

The code to do it is:

binNo = 3    # Number of bins
vc = df.DeviceID.value_counts().rename('cnt')
vc = vc.to_frame().assign(bin=[ i % binNo for i in range(vc.size) ]); vc

For my example data (slightly changed and expanded your sample), the result is:

          cnt  bin
50135487    2    0
50135484    2    1
50090860    2    2
50090879    2    0
50090869    1    1
50090850    1    2
50135488    1    0

cnt is the number of occurrences of each DeviceID and bin is the bin assignment for this DeviceID.

Then define a function generating list of DeviceIDs for bin n:

def genDf(n):
    return vc[vc.bin == n].index

And to generate the list of DataFrames (bins), execute:

dfLst = [ df[df.DeviceID.isin(genDf(i))] for i in range(binNo) ]
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41