1

I have been using the following in pandas to replace some character with another using regular expression:

df = df.replace(r'\t|\r|\n', '', regex=True)

But as mentioned here, we have mask in dask. But I do not find how I can use regex in this function. Any help is appreciated.

Nabin
  • 11,216
  • 8
  • 63
  • 98

1 Answers1

4

The most common way to deal with row-wise operations such as this is to use map_partitions, which allows you to work on each chunk of the dask-dataframe, each chunk being a real pandas dataframe.

In this example

df2 = df.map_partitions(lambda d: d.replace(r'\t|\r|\n', '', regex=True))

where df is a dask dataframe. Note that the function used with map_partitions expects a pandas dataframe and returns a pandas dataframe.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Looks promising. I will try and let you know. Thanks – Nabin Jul 08 '18 at 09:10
  • I tried your solution and it worked. Just wondering though, what do you mean when you say that map_partitions expects pandas dataframe? In your example, you used a dask dataframe. – Yousuf May 05 '20 at 23:25
  • The input to the lambda function (`d`) is a pandas dataframe, parts of the larger dask dataframe. – mdurant May 06 '20 at 12:14