0

I am looking to apply a function to multiple columns to a datatable in Python. With R's data.table one would:

# columns to apply function to
x <- c('col_1', 'col_2')

# apply
df[, (x) := lapply(.SD, function(x) as.Date(x, "%Y-%m-%d")), .SDcols=x]

How would one do the same using Python's datatable? I have some knowledge of apply and lambda with pandas e.g.:

# create dummy data
df = pd.DataFrame({'col_1': ['2021-12-01']
                   , 'col_2': ['2021-12-02']
                   , 'col_3': ['foobar']
                   }
                  )

# columns to apply function to
x = ['col_1', 'col_2']

# apply
df[x] = df[x].apply(lambda x: pd.to_datetime(x, format='%Y-%m-%d'))

but what is its equivalent in Python's datatable? This is assuming I insist on the use of apply and lambda. Thank you.

edit* I have changed from an UDF to a standard function pd.to_datetime as some of us mentioned the former is not possible while the latter is. Feel free to use any examples to illustrate apply in conjunction with datatable. Thank you

Sweepy Dodo
  • 1,761
  • 9
  • 15
  • `py-datatable` is a third-party library just like Pandas; it is not "Python's" package. If you want to know how to use it, you should start by reading the [documentation](https://datatable.readthedocs.io/en/latest/manual/comparison_with_pandas.html). Alternately, consider why you aren't just using Pandas, if you know how to solve the problem that way. – Karl Knechtel Aug 23 '21 at 21:03
  • 2
    Currently, it's not possible. See: https://github.com/h2oai/datatable/issues/1960 – Mohammad Aug 23 '21 at 21:06
  • @KarlKnechtel Thank you for your link. That documentation is one among many I have reviewed. I could not find a solution on it. I am looking for a `py-datatable` solution due to its speed [URL](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping) @Mohammad Thank you too for the function request – Sweepy Dodo Aug 23 '21 at 21:51
  • can you create a reproducible example; folks here can then answer using solutions within pydatatable. using anonymous functions via apply is not yet supported. applying a function to multiple functions is possible; not anonymous functions though – sammywemmy Aug 24 '21 at 03:53
  • 1
    For the example shared, you can use the `as_type` method, : ``DT[:, dt.as_type(f[:], dt.Type.date32)]``. datatable is not as robust as pandas (still some missing features); with time, and contributions from the community, it can only get better – sammywemmy Sep 07 '21 at 21:14
  • @sammywemmy Apologies for late reply Your solution works great. Below is mine to specify which cols: `x = ['col_1', 'col_2']` `df[:,x] = df[:, dt.as_type(f[x], dt.Type.date32)]` Though this merely prints it but not reassign. In R it would be `df[, (x) := foobar]` What is `:=` 's equivalent? I could not get `update` to work. Thank you – Sweepy Dodo Sep 20 '21 at 21:35
  • @sammywemmy And yes, I'm 1 of the early(ish) R `data.table` adopters. Loved its simple syntax and speed , thus, am trying out `Pydatatable` too. Can't wait for more features. – Sweepy Dodo Sep 20 '21 at 21:48
  • Added an answer, hopefully it is helpful. Also made a documentation PR on datatable. Please feel free to comment and update it; the more docs the better it will be for the community to effectively use datatable – sammywemmy Sep 20 '21 at 22:10

1 Answers1

1

I recently made a PR showing ways to transform columns in datatable; it should be merged soon. Please feel free to comment and update it.

To the question, you can directly assign, and also use the update method:

from datatable import dt, f, update, Type, as_type

DT0 = dt.Frame({'col_1': ['2021-12-01']
                   , 'col_2': ['2021-12-02']
                   , 'col_3': ['foobar']
                   }
                  )

cols = ['col_1', 'col_2']

DT0
   | col_1       col_2       col_3 
   | str32       str32       str32 
-- + ----------  ----------  ------
 0 | 2021-12-01  2021-12-02  foobar
[1 row x 3 columns]

Via reassignment:

DT = DT0.copy()

DT[:, cols] = DT[:, as_type(f[cols], Type.date32)]

DT
   | col_1       col_2       col_3 
   | date32      date32      str32 
-- + ----------  ----------  ------
 0 | 2021-12-01  2021-12-02  foobar
[1 row x 3 columns]

With Direct assignment, you can assign the f-expression to the column; this only works for single assignment:

DT = DT0.copy()

DT['col_1'] = as_type(f.col_1, Type.date32)

DT['col_2'] = as_type(f.col_2, Type.date32)

DT
 
   | col_1       col_2       col_3 
   | date32      date32      str32 
-- + ----------  ----------  ------
 0 | 2021-12-01  2021-12-02  foobar
[1 row x 3 columns]

The update function works as well; I like the feature, especially for SQL window like operations, where I do not want the the order of the columns to change (datatable sorts when performing a groupby):

DT = DT0.copy()

DT[:, update(col_1 = dt.as_type(f.col_1, Type.date32), 
             col_2 = dt.as_type(f.col_2, Type.date32))]
DT
   | col_1       col_2       col_3 
   | date32      date32      str32 
-- + ----------  ----------  ------
 0 | 2021-12-01  2021-12-02  foobar
[1 row x 3 columns]

Note that update is in-place; no re-assignment is required. For multiple columns, a dictionary can help to automate the process:

columns = {col : as_type(f[col], Type.date32) for col in cols}

print(columns)
{'col_1': FExpr<as_type(f['col_1'], date32)>,
 'col_2': FExpr<as_type(f['col_2'], date32)>}

# unpack the dictionary within the datatable brackets
DT = DT0.copy()
DT[:, update(**columns)]

DT
   | col_1       col_2       col_3 
   | date32      date32      str32 
-- + ----------  ----------  ------
 0 | 2021-12-01  2021-12-02  foobar
[1 row x 3 columns]
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • `DT[:, cols] = DT[:, as_type(f[cols], Type.date32)]` is perhaps the neatest option for now. I had to ask as I was hoping for R `data.table`'s neat `dt[, (x) := foobar]` I had come across the dictionary expression but did not think it would be applicable here. Big thank you for listing out all (current) options. Great way to learn. About the vastly different methods in applying changes to multiple cols vs R `data.table`. Why is this? Is it not technically possible to R `data.table`'s syntax? And should this be reserved for the Github link you sent? Also thanks for the link – Sweepy Dodo Sep 22 '21 at 20:44
  • they are different languages; for one, R has this delayed execution, where you can pass variables and wont be applied until much later; compared to python that has an eager execution. what pydatatable does is try to mimic R's datatable's syntax as much as possible, while keeping true to python's idiomatic use, and working within python's limitations. – sammywemmy Sep 22 '21 at 21:51
  • I had the suspicion it wasn't possible to exactly replicate R `data.table`'s syntax. Thank you a bunch @sammywemmy – Sweepy Dodo Sep 23 '21 at 20:42