4

I have 34 million row and only have a column. I want to split string into 4 column.

Here is my sample dataset (df):

    Log
0   Apr  4 20:30:33 100.51.100.254 dns,packet user: --- got query from 10.5.14.243:30648:
1   Apr  4 20:30:33 100.51.100.254 dns,packet user: id:78a4 rd:1 tc:0 aa:0 qr:0 ra:0 QUERY 'no error'
2   Apr  4 20:30:33 100.51.100.254 dns,packet user: question: tracking.intl.miui.com:A:IN
3   Apr  4 20:30:33 dns user: query from 9.5.10.243: #4746190 tracking.intl.miui.com. A

I want to split it into four column using this code:

df1 = df['Log'].str.split(n=3, expand=True)
df1.columns=['Month','Date','Time','Log']
df1.head()

Here is the result that i expected

     Month Date      Time                                              Log
0      Apr    4  20:30:33  100.51.100.254 dns,packet user: --- go...
1      Apr    4  20:30:33  100.51.100.254 dns,packet user: id:78a...
2      Apr    4  20:30:33  100.51.100.254 dns,packet user: questi...
3      Apr    4  20:30:33  dns transjakarta: query from 9.5.10.243: #474...
4      Apr    4  20:30:33  100.51.100.254 dns,packet user: --- se...

but the respond is like this:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-c9b2023fbf3e> in <module>
----> 1 df1 = df['Log'].str.split(n=3, expand=True)
      2 df1.columns=['Month','Date','Time','Log']
      3 df1.head()

TypeError: split() got an unexpected keyword argument 'expand'

Is there any solution to split the string using dask?

OctavianWR
  • 217
  • 1
  • 16

1 Answers1

8

Edit: this works now

Dask dataframe does support the str.split method's expand= keyword if you provide an n= keyword as well to tell it how many splits to expect.

Old answer

It looks like dask dataframes's str.split method doesn't implement the expand= keyword. You might raise an issue if one does not already exist.

As a short term workaround, you could make a Pandas function, and then use the map_partitions method to scale that across your dask dataframe

def f(df: pandas.DataFrame) -> pandas.DataFrame:
    """ This is your code from above, as a function """
    df1 = df['Log'].str.split(n=3, expand=True)
    df1.columns=['Month','Date','Time','Log']
    return df

ddf = ddf.map_partitions(f)  # apply to all pandas dataframes within dask dataframe

Because Dask dataframes are just collections of Pandas dataframes it's relatively easy to build things yourself when Dask dataframe doesn't support them.

MRocklin
  • 55,641
  • 23
  • 163
  • 235