2

I have a csv which is too large to read into memory so I am attempting to use Dask to solve my problem. I am a regular Pandas user but lack experience with Dask. In my data I have a column "MONTHSTART" which I wish to interact with as a datetime object. However I cannot seem to get an output from the Dask dataframe despite my code working in a Pandas example.

I have read my csv in using dask

df = dd.read_csv(filename, dtype='str')

Then I am converting the dtype of the column to a datetime object

def parse_dates(df):
return pd.to_datetime(df['MONTHSTART'], format='%d%b%Y')

meta = ('time', pd.Timestamp)
df.map_partitions(parse_dates, meta=meta)

Finally I am trying to apply a lambda function to create a new column based on my datetime column

 df['MONTHS_AGO'] = df.apply(
                        lambda y: (dt.date.today().year - y['MONTHSTART'].dt.year) * 12 +
                        (dt.date.today().month - y['MONTHSTART'].dt.month),
                        axis=1,
                        meta=meta)

I am unsure about the use of meta here, without it my code failed asking me to specify a meta.

Without meta I get

ValueError: Metadata inference failed, please provide `meta` keyword

With meta I get

AttributeError: ("'str' object has no attribute 'dt'", 'occurred at index 0')

Am I approaching this problem incorrectly? Is there a trick to applying lambda functions in Dask that I am missing?

Edit: I have obfuscated this information, and dropped lots of columns. I have tried to keep enough that the issue can be solved. df.head(2).to_dict example:

{'AGE_1': {0: '57', 1: '57'},
 'APREM': {0: '347.08581006', 1: '347.08581006'},
 'BUSINESS_1': {0: 'COMPUTERSERVICES', 1: 'COMPUTERSERVICES'},
 'COMPULSORYEXCESSAD': {0: '0', 1: '0'},
 'COVERTYPE': {0: 'Comprehensive', 1: 'Comprehensive'},
 'DRIVINGRESTRICTION': {0: 'IOD', 1: 'IOD'},
 'EARNEDTECH': {0: '35.438383793', 1: '15.356632977'},
 'ENDDATE': {0: '13AUG2017', 1: '13AUG2017'},
 'EXPMONTH': {0: 'EVY01APR2017', 1: 'EVY01AUG2017'},
 'INFORCEATEOM': {0: '1', 1: '0'},
 'LICENCETYPE_1': {0: 'FullUKCarLicence', 1: 'FullUKCarLicence'},
 'MARITALSTATUS_1': {0: 'Partnered', 1: 'Partnered'},
 'MILEAGERESTRICTION': {0: '8000', 1: '8000'},
 'MIN_AGE': {0: '57', 1: '57'},
 'MIN_EXP': {0: '18', 1: '18'},
 'MIN_EXP_AGE': {0: '57', 1: '57'},
 'MIN_EXP_LICENCETYPE': {0: 'FullUKCarLicence', 1: 'FullUKCarLicence'},
 'MONTHEND': {0: '30APR2017', 1: '31AUG2017'},
 'MONTHSTART': {0: '01APR2017', 1: '01AUG2017'},
 'REGION': {0: 'East Anglia', 1: 'East Anglia'},
 'STARTDATE': {0: '16FEB2017', 1: '16FEB2017'},
 'TENURE': {0: '4th Renewal', 1: '4th Renewal'},
 'TotalIncurredExclRI': {0: nan, 1: nan},
 'VEHICLECOUNT': {0: '1', 1: '1'},
 'VEHICLEKEPTOVERNIGHT': {0: 'Drive', 1: 'Drive'},
 'VEHICLEMODIFICATION': {0: 'false', 1: 'false'},
 'VEHICLENUMBER': {0: '1', 1: '1'},
 'VEHICLEUSAGE': {0: 'Personal Business Use', 1: 'Personal Business Use'},
 'VOLUNTARYEXCESS': {0: '250', 1: '250'}}
Violatic
  • 374
  • 2
  • 18
  • Do you mind to provide a [mcve](/help/mcve)? It will be great to have a sample of the original df and the expected output. – rpanai Mar 28 '19 at 13:00

1 Answers1

2

You might want to rename columns and convert the date in your favorite format but this is working for me:

# First we create our df
import pandas as pd
import numpy as np
import dask.dataframe as dd
import datetime as dt

N = 10
df =  pd.DataFrame({"date":pd.date_range(start='2017-01-01', periods=N),
                    "y":np.random.rand(N)})

df["date"] = df["date"].dt.strftime("%d%b%Y")
df.to_csv("data.csv", index=False)

Then

# read
df = dd.read_csv("data.csv", dtype='str')
# convert date to datetime
df["date"] = df["date"].astype("M8[us]")
# assign today date
td = dt.datetime.today()
# assign months_ago
df = df.assign(months_ago=((td.year - df["date"].dt.year)*12 +
                            td.month - df["date"].dt.month))

Using assign you don't need to deal with meta

rpanai
  • 12,515
  • 2
  • 42
  • 64
  • This worked, can you explain the difference between assign and apply? Is there a place I should be using apply? – Violatic Mar 29 '19 at 08:55
  • I am still getting an error with ValueError: Error parsing datetime string "01NOV2017" at position 2. Because my date is in the format '%d%b%Y' – Violatic Mar 29 '19 at 09:24
  • The file data.csv has date in that format. Could you please check if in your file there are whitespace on the left or on the right? Or post a small sample of your df as df.head().to_dict() – rpanai Mar 29 '19 at 10:32
  • 1
    Added df.head(2).to_dict() with a chunk of columns dropped to hide private info – Violatic Mar 29 '19 at 11:40
  • I tried my code with the data you provided and I couldn't find any errors. Do you mind to share a chuck of `df` that produce this error? You can use just 'MONTHEND', 'MONTHSTART' as columns so there is no problems with private data. – rpanai Mar 29 '19 at 12:40
  • Regarding `assign` vs `apply` you might want to read [this](https://stackoverflow.com/questions/33557022/variable-column-name-in-dask-assign-or-apply) and relative documentation. – rpanai Mar 29 '19 at 12:56
  • ValueError: Error parsing datetime string "01OCT2018" at position 2 is the error I am getting. Is there anything significant about that date? Unfortunately I cannot share any more of the df. – Violatic Mar 29 '19 at 14:48
  • @Violatic I tried with these dates but i can't reproduce the error. Could you split your csv in several parts and check if you can convert 'MONTHSTART' to `datetime`? – rpanai Mar 29 '19 at 15:33