7

I have two questions about dask. First: The documentation for dask clearly states that you can rename columns with the same syntax as pandas. I am using dask 1.0.0. Any reason why I am getting these errors below?

df = pd.DataFrame(dictionary)
df

enter image description here

# I am not sure how to choose values for divisions, meta, and name. I am also pretty unsure about what these really do.
ddf = dd.DataFrame(dictionary, divisions=[8], meta=pd.DataFrame(dictionary), name='ddf')    
ddf

enter image description here

cols = {'Key':'key', '0':'Datetime','1':'col1','2':'col2','3':'col3','4':'col4','5':'col5'}

ddf.rename(columns=cols, inplace=True)

TypeError: rename() got an unexpected keyword argument 'inplace'

Ok so i removed the inplace=True and tried this:

ddf = ddf.rename(columns=cols)

ValueError: dictionary update sequence element #0 has length 6; 2 is required

The pandas dataframe is showing a real dataframe, but when I call ddf.compute() I get an empty dataframe.

enter image description here

My second question is that I am slightly confused about how to assign divisions, meta, and name. How is this useful/hurtful if I use dask to parallelize on a single machine vs a cluster?

Matt Elgazar
  • 707
  • 1
  • 8
  • 21
  • 1
    FWIW, creating a dictionary to remap each column name (even the ones I don't want to change, and then using `ddf = ddf.rename(columns=cols)` worked just fine for me. – SummerEla Dec 22 '18 at 06:14

4 Answers4

10

Regarding the renaming, this is how I usually go about changing feature names when I'm using dask, perhaps this will work for you too:

new_columns = ['key', 'Datetime', 'col1', 'col2', 'col3', 'col4', 'col5']
df = df.rename(columns=dict(zip(df.columns, new_columns)))

As for the determining the number of partitions, the documentation gives a pretty good example using time series data for deciding how to divide the dataframe: http://docs.dask.org/en/latest/dataframe-design.html#partitions.

Sam Comber
  • 1,237
  • 1
  • 15
  • 35
4

I could not get this line to work (because I was passing dictionary as a basic Python dictionary, which is not the right input)

ddf = dd.DataFrame(dictionary, divisions=[2], meta=pd.DataFrame(dictionary,
                                              index=list(range(2))), name='ddf')

print(ddf.compute())
() # this is the output of ddf.compute(); clearly something is not right

So, I had to create some dummy data and use that in my approach to creating a dask dataframe.

Generate dummy data in a dictionary

d = {0: [388]*2,
 1: [387]*2,
 2: [386]*2,
 3: [385]*2,
 5: [384]*2,
 '2012-06-13': [389]*2,
 '2012-06-14': [389]*2,}

Create Dask dataframe from dictionary dask bag

  • this means you must first use pandas to convert the dictionary to a pandas DataFrame and then use .to_dict(..., orient='records') to get the sequence (list of row-wise dictionaries) you need to create a dask bag

So, here is how I created the required sequence

d = pd.DataFrame(d, index=list(range(2))).to_dict('records')

print(d)
[{0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389},
 {0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389}]

Now I use the list of dictionaries to create a dask bag

dask_bag = db.from_sequence(d, npartitions=2)

print(dask_bag)
dask.bag<from_se..., npartitions=2>

Convert dask bag to dask dataframe

df = dask_bag.to_dataframe()

Rename columns in dask dataframe

cols = {0:'Datetime',1:'col1',2:'col2',3:'col3',5:'col5'}
df = df.rename(columns=cols)

print(df)
Dask DataFrame Structure:
              Datetime   col1   col2   col3   col5 2012-06-13 2012-06-14
npartitions=2                                                           
                 int64  int64  int64  int64  int64      int64      int64
                   ...    ...    ...    ...    ...        ...        ...
                   ...    ...    ...    ...    ...        ...        ...
Dask Name: rename, 6 tasks

Compute the dask dataframe (will not get output of () this time !)

print(ddf.compute())
   Datetime  col1  col2  col3  col5  2012-06-13  2012-06-14
0       388   387   386   385   384         389         389
0       388   387   386   385   384         389         389

Notes:

  1. Also from the .rename documentation: inplace is not supported.
  2. I think your renaming dictionary contained strings '0', '1', etc. for the column names that were integers. It could be the case for your data (as is the case with the dummy data here) that the dictionary should just have been integers 0, 1, etc.
  3. Per the dask docs, I used this approach based on a 1-1 renaming dictionary and column names not included in the renaming dict will be left unchanged
    • this means you don't need to pass in column names that you do not need to be renamed
edesz
  • 11,756
  • 22
  • 75
  • 123
0

If you only want to lowercase and delete spaces, you can do:

data = dd.read_csv('*.csv').rename(columns=lambda x: x.lower().replace(' ', '_'))
Pepe
  • 454
  • 5
  • 18
0

You can build a dict like this:

columns = {0:'Datetime',1:'col1', ...}

After you read your data:

# you can use dask to read your data
import dask.DataFrame as dd
df = dd.read_json(dictionary)
df = df.rename(columns=columns).compute()

You problem is the key and also the original column name type:

cols = {'Key':'key', '0':'Datetime','1':'col1','2':'col2','3':'col3','4':'col4','5':'col5'}

You should delete 'Key':'key' and also use int number instead of str number

Newt
  • 787
  • 8
  • 15