3

I am trying to reduce the time taken to apply a complex function from the cantools library on each row within a dataframe (up to 2 million rows): Timestamp Type ID Data 0 16T122109957 0 522 b'0006' 1 16T122109960 0 281 b'0000ce52d2290000' 2 16T122109960 0 279 b'0000000000000000' 3 16T122109960 0 304 b'0000' 4 16T122109961 0 277 b'400000'

Using the above dataframe and a dbc file read in. A dbc file is a set of rules on how to encode/decode data.

Using DataFrame apply can take up to 10 minutes:

df['decoded'] = df.apply(lambda x: dbc.decode_message(df['ID'][x], df['Data']))

Putting the two columns into lists and then iterating over the lists only takes about a minute to complete but when the new array is saved to a dataframe the error ValueError: array is too big is present. Which is expected as it is HUGE.

Example loop code:

id_list = df['ID'].tolist()
datalist = df['Data'].tolist()
for i in range(len(id_list)):
    listOfDicts.append(dbc.decode_message(id_list[i], datalist[i]))
Data = DataFrame(listOfDicts)

I tried python vectorization which is apparently the fastest and was greeted with the error TypeError: 'Series' objects are mutable, thus they cannot be hashed which I can't seem to fix. example:

Data['dict'] = dbc.decode_message(df['ID'], df['Data'])

Are there any other ways to speed up the apply process or should I try work on the vectorization?

MINIMAL example:

import cantools
import pandas as pd

df = pd.read_csv('file.log', skiprows=11, sep=';')
dbc = cantools.database.load_file('file.dbc')

# option 1 SLOW
df['decoded'] = df.apply(lambda x: dbc.decode_message(x['ID'], x['Data']))

# option 2 Faster...
id_list = df['ID'].tolist()
datalist = df['Data'].tolist()
for i in range(len(id_list)):
    listOfDicts.append(dbc.decode_message(id_list[i], datalist[i]))
Data = DataFrame(listOfDicts) #< -- causes error for being to big

#option 3
df['dict'] = dbc.decode_message(df['ID'], df['Data']) #< --Error
RMRiver
  • 625
  • 1
  • 5
  • 19
  • What library does the `.decode_message()` method come from? We need to know a bit about the function we're dealing with here, right? Not directly related to your question, but why in your loop code did you convert the Series to lists? – AMC Jan 17 '20 at 16:21
  • The library used is cantools. A dbc file is read in with the rules and these can be used to decode messages using the id and data. I converted it to lists as it seemed to perform quicker.... I think I read that sometimes iterating over the rows can be faster. – RMRiver Jan 17 '20 at 16:25
  • Can you make a [mcve]? _I think I read that sometimes iterating over the rows can be faster._ That depends entirely on what you mean by _iterating_. – AMC Jan 17 '20 at 16:26
  • Added example - Would you also like example dbcs and dataframe? – RMRiver Jan 17 '20 at 16:38
  • I guess so? _I tried python vectorization which is apparently the fastest and was greeted with the error TypeError: 'Series' objects are mutable, thus they cannot be hashed which I can't seem to fix._ The function probably isn't designed to work that way. – AMC Jan 17 '20 at 17:25
  • In the `.apply()`, why are you doing `df['ID'][x]`? Shouldn't it be `x['ID']`? – AMC Jan 17 '20 at 17:26
  • Vectorization can only work when your functions support working on whole Serieses. How complicated is the dbc? Could you possibly implement the conversions yourself by means of panda supported functions? – MSpiller Jan 18 '20 at 20:13
  • @M.Spiller the dbc file is complicated and very big so creating a dbc file and decoder is a last resort.... – RMRiver Jan 20 '20 at 10:02
  • @AMC You are correct it should be that – RMRiver Jan 20 '20 at 10:03

2 Answers2

1

Posting this as an answer, but YMMV:

As long as the cantools library does not support working on Series or DataFrame objects, vectorization will not work. So using apply is the only way to go.

Since the dbc conversion works row by row without any inter-row dependencies you should be able to parallelize it.

You need to

  • Write a function doing the conversion taking a dataframe:

    def decode(df):
        df.apply(lambda x: dbc.decode_message(x['ID'], x['Data']), axis=1)
        return df
    
  • call it like this:

    import pandas as pd
    import numpy as np
    import multiprocessing as mp
    
    def parallelApply(df, func, numChunks=4):
        df_split = np.array_split(df, numChunks)
        pool = mp.Pool(numChunks)
        df = pd.concat(pool.map(func, df_split))
        pool.close()
        pool.join()
        return df
    
    df = parallelApply(df, decode)
    

What parallelApply does is splitting the dataframe in numChunks chunks and creating a multiprocessing pool with that many entries.

Then apply the function func (which is decode in your case) to each of the chunks in a separate process.

decode returns the dataframe chunk it has updated and pd.concat will merge them again.


There is also a very convenient library called pandarallel that will do this for you, but you would be forced to using WSL when running on Windows.:

pip install pandarallel

After calling

from pandarallel import pandarallel
pandarallel.initialize()

you simply convert the call from

df.apply(...)

to

df.parallel_apply(func)

The library will spin up multiple processes and let each process handle a subset of data.

MSpiller
  • 3,500
  • 2
  • 12
  • 24
  • Would `func` be the lambda function? or do I need to create a function that takes the dataframe and dbc and calls the `dbc.decode_message()` from the `cantools` library – RMRiver Jan 20 '20 at 12:19
  • lambda should be fine. Whatever works with `df.apply` should work with `df.parallel.apply`. Please note, in case you are running this with Windows, the python interpreter should be run from WSL. See the note on the pandarallel site. – MSpiller Jan 20 '20 at 12:21
  • Hmmmm.. The application will be run on Windows so though it is a solution to the question, it would be a hassle to deploy this to multiple users - it would've been ideal if the WSL wasn't required :( Thanks for the help – RMRiver Jan 20 '20 at 12:25
  • It is definitely possible to do this on _bare-metal_ without using pandarallel and thus not being forced to WSL. Let me change my answer – MSpiller Jan 20 '20 at 12:55
  • should the `decode` function use `df.apply(lambda x: dbc.decode_message(x['ID'], x['Data']))` rather than decoding using a series? The `decode_message` can't handle series. I also used `pool.starmap` to pass the dbc to the function but am still having issues. Thanks for your help - I'll take it from here :) – RMRiver Jan 20 '20 at 14:20
  • Yes, of course. That should be `df.apply(lambda ...` and most likely also `axis=1`. I fixed that. – MSpiller Jan 20 '20 at 14:23
  • To get it to work you need to pass the `dbc` variable to the decode message. This was completed by using `pd.concat(pool.starmap(func, zip(df_split, repeat(dbc))))` and passing it to the `parallel_apply` call before hand. An import `from itertools import repeat` is needed. I'm not sure what the etiquette is for changing peoples answers.... – RMRiver Jan 20 '20 at 15:01
  • Can't you just make sure that `dbc` is imported when running `func`? E.g. by adding `import dbc` into the implementation of `func`. That makes more sense then duplicating the module. All this depends only on how your code is structured. I think it is not suitable to add this to any answer. If your working solution is completely different, feel free to add your own answer. – MSpiller Jan 20 '20 at 15:08
  • `dbc` is an object not a module. It just contains the set of rules needed to decode the data. I believe it would just be easier to pass it twice... I'll add an additional answer. Again - thanks for your help :D – RMRiver Jan 20 '20 at 15:16
0

Adapted from M. Spiller answer - differences are shown in brackets:

(imports) These must be imported: from multiprocessing.dummy import freeze_support

import cantools
import pandas as pd
from itertools import repeat
import multiprocessing as mp

Write a function doing the conversion taking a dataframe (and pass the dbc to decode):

def decode(df, dbc):
    df2 = df.apply(lambda x: dbc.decode_message(x['ID'], x['Data']), axis=1)
    df2 = pd.DataFrame
    return df2

call it like this (passing the dbc through the functions):

def parallel_apply(df, func, dbc=None, numChunks=mp.cpu_count()):
    df_split = np.array_split(df, numChunks)
    pool = mp.Pool(numChunks)

    df2 = pd.concat(pool.starmap(func, zip(df_split, repeat(dbc))))
    pool.close()
    pool.join()
    return df2

Freeze_support()
#read in dbc
#read in df with encoded CAN messages
df2 = parallel_apply(df, decode, dbc)

Implement the reading functions where the comments have been placed. This solution will use all cores on the CPU and split the task into 4 chunks - parrallel process and rejoin the dataframe at the end.

RMRiver
  • 625
  • 1
  • 5
  • 19