2

Datatable is popular for R, but it also has a Python version. However, I don't see anything in the docs for applying a user defined function over a datatable.

Here's a toy example (in pandas) where a user function is applied over a dataframe to look for po-box addresses:

df = pd.DataFrame({'customer':[101, 102, 103],
                   'address':['12 main st', '32 8th st, 7th fl', 'po box 123']})

customer | address
----------------------------
101      | 12 main st
102      | 32 8th st, 7th fl
103      | po box 123


# User-defined function:
def is_pobox(s):
    rslt = re.search(r'^p(ost)?\.? *o(ffice)?\.? *box *\d+', s)
    if rslt:
        return True
    else:
        return False

# Using .apply() for this example:
df['is_pobox'] = df.apply(lambda x: is_pobox(x['address']), axis = 1)

# Expected Output:
customer | address          | rslt
----------------------------|------
101      | 12 main st       | False
102      | 32 8th st, 7th fl| False
103      | po box 123       | True

Is there a way to do this .apply operation in datatable? Would be nice, because datatable seems to be quite a bit faster than pandas for most operations.

Pasha
  • 6,298
  • 2
  • 22
  • 34
AdmiralWen
  • 701
  • 6
  • 16
  • 2
    There is an open issue https://github.com/h2oai/datatable/issues/1960 to support such functionality. You can give it a thumbs-up and subscribe to get notified when the feature is implemented. – Pasha Oct 28 '20 at 20:27
  • 2
    Also note that your example can be implemented without a UDF: `DT["is_pobox"] = f.address.re_match(r'^p(ost)?\.? *o(ffice)?\.? *box *\d+')` – Pasha Oct 28 '20 at 20:33
  • 2
    also note that `datatable.FExper.re_match()` marked as deprecated as of 0.11: since version 1.0 this function will be available in the `re.` submodule. – topchef Oct 29 '20 at 03:46

1 Answers1

0

If you're just applying the function to a column and then adding it back to the table, there are some ways to approach it that I'm still not finding anywhere online:

is_pobox_dt = dt.Frame(list(map(is_pobox, df[:,'address'].to_list()[0])))
dt[:,update(is_pobox=is_pobox_dt)]

You can add a range or another datatable or even an numpy array. Just not a list. So this works as well:

is_pobox_array = np.array(list(map(is_pobox, df[:,'address'].to_list()[0])))
dt[:,update(is_pobox=is_pobox_array)]
Anthony Lam
  • 101
  • 1
  • 7