I am trying to reduce the time taken to apply a complex function from the cantools library on each row within a dataframe (up to 2 million rows):
Timestamp Type ID Data
0 16T122109957 0 522 b'0006'
1 16T122109960 0 281 b'0000ce52d2290000'
2 16T122109960 0 279 b'0000000000000000'
3 16T122109960 0 304 b'0000'
4 16T122109961 0 277 b'400000'
Using the above dataframe and a dbc file read in. A dbc file is a set of rules on how to encode/decode data.
Using DataFrame apply can take up to 10 minutes:
df['decoded'] = df.apply(lambda x: dbc.decode_message(df['ID'][x], df['Data']))
Putting the two columns into lists and then iterating over the lists only takes about a minute to complete but when the new array is saved to a dataframe the error ValueError: array is too big
is present. Which is expected as it is HUGE.
Example loop code:
id_list = df['ID'].tolist()
datalist = df['Data'].tolist()
for i in range(len(id_list)):
listOfDicts.append(dbc.decode_message(id_list[i], datalist[i]))
Data = DataFrame(listOfDicts)
I tried python vectorization which is apparently the fastest and was greeted with the error TypeError: 'Series' objects are mutable, thus they cannot be hashed
which I can't seem to fix.
example:
Data['dict'] = dbc.decode_message(df['ID'], df['Data'])
Are there any other ways to speed up the apply process or should I try work on the vectorization?
MINIMAL example:
import cantools
import pandas as pd
df = pd.read_csv('file.log', skiprows=11, sep=';')
dbc = cantools.database.load_file('file.dbc')
# option 1 SLOW
df['decoded'] = df.apply(lambda x: dbc.decode_message(x['ID'], x['Data']))
# option 2 Faster...
id_list = df['ID'].tolist()
datalist = df['Data'].tolist()
for i in range(len(id_list)):
listOfDicts.append(dbc.decode_message(id_list[i], datalist[i]))
Data = DataFrame(listOfDicts) #< -- causes error for being to big
#option 3
df['dict'] = dbc.decode_message(df['ID'], df['Data']) #< --Error