0

I know this question was asked multiple times but I could not find a case similar to mine.

I have this function:

def load_data(list_of_files, INP_DIR, return_featues=False):
    data = []


    # ------- I want to multithread this block------#

    for file_name in tqdm(list_of_files): 

        subject , features = load_subject(INP_DIR,file_name)

        data.append(subject.reset_index())

    # -------------#


    data = pd.concat(data, axis=0, ignore_index=True)

    target = data['label']


    if return_featues: 
        return data,target, features
    else: 
        return data,target 

The above function use load_subject and for your references, it's defined as follow:

def load_subject(INP_DIR,file_name):

    subject= pd.read_csv(INP_DIR+ file_name, sep='|')

    < do some processing ...>

    return subject, features

I have 64 cores on CPU but I am not able to use all of them.

I tried this with multiprocessing

train_files= ['p011431.psv', 'p008160.psv', 'p007253.psv', 'p018373.psv', 'p017040.psv',]
from multiprocessing import Pool
if __name__ == '__main__':
    with Pool(processes=64) as pool:  
        pool.map(load_data, train_files)

as you see, train_files is a list of name of files.

When I run the above lines, I get this error:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
TypeError: load_subject() missing 1 required positional argument: 'file_name'
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-24-96a3ce89ebb8> in <module>()
      2 if __name__ == '__main__':
      3     with Pool(processes=2) as pool:
----> 4         pool.map(load_subject, train_files)  # process data_inputs iterable with pool

/anaconda3/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    264         in a list that is returned.
    265         '''
--> 266         return self._map_async(func, iterable, mapstar, chunksize).get()
    267 
    268     def starmap(self, func, iterable, chunksize=None):

/anaconda3/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
    642             return self._value
    643         else:
--> 644             raise self._value
    645 
    646     def _set(self, i, obj):

TypeError: load_subject() missing 1 required positional argument: 'file_name'

Updates:

After the answer of Tom, I could find another way to pass only one argument.

Here are the functions and you will see the error I am getting:

def load_data(list_of_files):
    data = []


    # ------- I want to multithread this block------#

    for file_name in tqdm(list_of_files): 

        subject , features = load_subject(INP_DIR,file_name)

        data.append(subject.reset_index())

    # -------------#


    data = pd.concat(data, axis=0, ignore_index=True)

    target = data['label']


    return data,target 


def load_subject(file_name):

    subject= pd.read_csv(file_name, sep='|')

    < do some processing ...>

    return subject, features




train_files= ['p011431.psv', 'p008160.psv', 'p007253.psv', 'p018373.psv']

from multiprocessing import Pool
if __name__ == '__main__':
    with Pool(processes=64) as pool:  
        pool.map(load_data, train_files)

When I run the above lines, I get a new error:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "<ipython-input-21-494105028a08>", line 407, in load_data
    subject , features = load_subject(file_name)
  File "<ipython-input-21-494105028a08>", line 170, in load_subject
    subject= pd.read_csv(file_name, sep='|')
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 737, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
"""

The above exception was the direct cause of the following exception:

ParserError                               Traceback (most recent call last)
<ipython-input-22-d6dcc5840b63> in <module>()
      4 
      5 with Pool(processes=3) as pool:
----> 6     pool.map(load_data, files)

/anaconda3/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    264         in a list that is returned.
    265         '''
--> 266         return self._map_async(func, iterable, mapstar, chunksize).get()
    267 
    268     def starmap(self, func, iterable, chunksize=None):

/anaconda3/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
    642             return self._value
    643         else:
--> 644             raise self._value
    645 
    646     def _set(self, i, obj):

ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

What I am missing here? How can I make this to work properly?

smerllo
  • 3,117
  • 1
  • 22
  • 37
  • 1
    you did not pass `INP_DIR `. – LiuXiMin Jun 26 '19 at 01:05
  • check new updates – smerllo Jun 26 '19 at 01:15
  • Your `load_data` accept `list_of_files `, then you can not pass `list_of_files ` to `pool.map`. It should be `list of list_of_files `. – LiuXiMin Jun 26 '19 at 01:18
  • In this case, I don't think python will use all the cores. will it? – smerllo Jun 26 '19 at 01:19
  • All right I guess you are right. I passed `list of list_of_files` and it seems working. How can I give you an upvote for this? ;) thank you btw – smerllo Jun 26 '19 at 01:24
  • I post it as an answer, so you can upvote, thanks :-) – LiuXiMin Jun 26 '19 at 01:26
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/195546/discussion-between-chami-soufiane-and-liuximin). – smerllo Jun 26 '19 at 01:37
  • By the way, using multiprocessing, python will use all the cores because it spawns as many new processes (not just threads) as you specify in the ```Pool()``` constructor. Each process will have its own GIL. – Tom Lubenow Jun 26 '19 at 16:28
  • tbh, even if this code worked for me without error. I did not see any speedup in my script or reduction of the run time. It doesn't seem working unfortunately – smerllo Jun 26 '19 at 16:31

2 Answers2

1

multiprocessing's Pool.map() function only can pass one argument at a time. I believe there's a "proper" workaround for this in Python 3, but I used the following hack in Python 2 all the time and see no reason why it wouldn't still work.

Define a wrapper for load_subject which only takes one argument, define a special object to use for that argument.

def wrapped_load_subject(param):
    return load_subject(param.inp_dir, param.file_name)

class LoadSubjectParam:
    def __init__(inp_dir, file_name):
        self.inp_dir = inp_dir
        self.file_name = file_name

train_files = []  # Make this a list of LoadSubjectParam objects

with Pool(processes=64) as pool:  
    pool.map(wrapped_load_subject, train_files)

edit: Also, there's this post.

Tom Lubenow
  • 1,109
  • 7
  • 15
  • Sorry, Tom, your answer was very helpful. But I did not pay attention to this error but It should be pool.map(load_data, train_files), not pool.map(load_data, train_files) – smerllo Jun 26 '19 at 01:36
1

Your load_data accept list_of_files , then you can not pass list_of_files to pool.map. It should be list of list_of_files .

Get result like this:

with Pool(processes=64) as pool:  
    res = pool.map(load_data, train_files)
LiuXiMin
  • 1,225
  • 8
  • 17
  • One last question, how to get the output `data`? I used to get data from `load_data`. How can I get this in here? – smerllo Jun 26 '19 at 01:27
  • @CHAMISoufiane in multiprocess pool, you dont have to use res out of with, sorry for the first version. – LiuXiMin Jun 26 '19 at 01:32
  • tbh, even if this code worked for me without error. I did not see any speedup in my script or reduction of the run time. It doesn't seem working unfortunately – smerllo Jun 26 '19 at 16:32
  • @CHAMISoufiane How you passed the train_files? If Your `len(train_files)` is too small, the map can not help you speedup. – LiuXiMin Jun 26 '19 at 22:51
  • Absolutely not small. `len(train_files)` is equal to `5000` elements. It took me 3 hours to finish the job :( . I feel a bit frustrated – smerllo Jun 27 '19 at 00:56
  • @CHAMISoufiane Are you sure? `train_files` should be `list of list_of_files`, it should not be too small or too big. I think you are saying you have total 5000 files. You should split them into `64` groups, less or more is okay, but at the same magnitude. – LiuXiMin Jun 27 '19 at 01:13
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/195599/discussion-between-chami-soufiane-and-liuximin). – smerllo Jun 27 '19 at 01:19