-1

I am using python's datatable.

I have 2 csv files.

CSV 1

A,B
1,2
3,4
5,6

CSV 2

NAME,EXPR
A_GREATER_THAN_B, A>B
A_GREATER_THAN_10, A>10
B_GREATER_THAN_5, B>5

Expected Output

A,B,A_GREATER_THAN_B,A_GREATER_THAN_10,B_GREATER_THAN_5
1,2,0,0,0
3,4,0,0,0
5,6,0,0,1

Code

exprdt = dt.fread("csv_2.csv")
exprdict = dict(exprdt.to_tuples())
dt1[:, dt.update(**exprdict)]
print(dt1)

Current output

   |     A      B      C   A_G_B          A_G_1     B_G_4   
   | int32  int32  int32   str32          str32     str32   
-- + -----  -----  -----   -------------  --------  --------
 0 |     0      1      1   dt.f.A>dt.f.B  dt.f.A>1  dt.f.B>4
 1 |     1      5      6   dt.f.A>dt.f.B  dt.f.A>1  dt.f.B>4

I am trying to use the extend function to process first datatable using the expressions from second datatable. When i use fread to read the csv files, the expression is processed as string and not as expression.

How do i use the 2nd datatable (csv) to update the first datatable using the NAME and EXPR columns?

l a s
  • 3,836
  • 10
  • 42
  • 61
  • Can you provide an example of the expected output? Am I right in assuming you expect something like `A_GREATER_THAN_B, False` for the first line of the second table? Note that `datatable` is not "Python's", it is a third party library that users need to install to be able to use your script - are you using it for a specific reason, or were you mistakenly thinking it was standard Python. I have nothing against it, but if all you need is a list or an array, you may want to steer clear - or use something that better aligns with wherever you want to use the result? (you may have a good reason) – Grismar Jan 27 '22 at 22:23
  • I have added the output i am trying to achieve. I am trying to use datatable instead of dataframe. – l a s Jan 27 '22 at 23:23
  • I feel it would be easier to get csv2 as a dictionary, then build your expression, that you can pass to `if else` to get you desired output – sammywemmy Jan 28 '22 at 00:34
  • Yes, i can convert the 2nd csv to dict but still the value will be of datatype string and not an expression. – l a s Jan 28 '22 at 00:42
  • kindly share the dictionary, let's see. If you can find a way to attach datatable's `f` function , you are half way there – sammywemmy Jan 28 '22 at 00:43
  • looks unnecessarily complicated though; you still need to convert the `>` to an operator (operator.gt) or some other way – sammywemmy Jan 28 '22 at 00:48
  • I feel you're making it very hard on yourself, by having expressions as text in your data, needing to evaluate those expressions against the column names in another table, whilst using a library that doesn't directly support text expressions like these - what have you tried yourself so far? Is the solution you propose required in all its complexity and flexibility? – Grismar Jan 28 '22 at 00:49
  • pandas dataframe eval is an option. i am looking for similar functionality in datatable and my ony problem is converting string to FExpr. – l a s Jan 28 '22 at 01:04

1 Answers1

0

You can do what you want, but just that you can do something doesn't mean it's a good idea. Any solution that requires eval() is probably more complicated than it needs to be, and introduces great risks if you don't have complete control over the data going in.

Having said that, this script shows a naive approach without fancy expressions from a table, and the approach you suggest - which I strongly recommend you not use and figure out a better way to achieve what you need:

from io import StringIO
import re
import datatable as dt

csv1 = """A,B
1,2
3,4
5,6"""

csv2 = """NAME,EXPR
A_GREATER_THAN_B, A>B
A_GREATER_THAN_10, A>10
B_GREATER_THAN_5, B>5"""


def naive():
    # naive approach
    d = dt.fread(StringIO(csv1))
    d['A_GREATER_THAN_B'] = d[:, dt.f.A > dt.f.B]
    d['A_GREATER_THAN_10'] = d[:, dt.f.A > 10]
    d['B_GREATER_THAN_5'] = d[:, dt.f.B > 5]

    print(d)


def update_with_expressions(d, expressions):
    for n in range(expressions.nrows):
        col = expressions[n, :][0, 'NAME']
        expr = re.sub('([A-Za-z]+)', r'dt.f.\1', expressions[n, :][0, 'EXPR'])
        # here's hoping that expression is trustworthy...
        d[col] = d[:, eval(expr)]


def fancy():
    # fancy, risky approach
    d = dt.fread(StringIO(csv1))
    update_with_expressions(d, dt.fread(StringIO(csv2)))
    print(d)
    

if __name__ == '__main__':
    naive()
    fancy()

Result (showing you get the same result from either approach):

   |     A      B  A_GREATER_THAN_B  A_GREATER_THAN_10  B_GREATER_THAN_5
   | int32  int32             bool8              bool8             bool8
-- + -----  -----  ----------------  -----------------  ----------------
 0 |     1      2                 0                  0                 0
 1 |     3      4                 0                  0                 0
 2 |     5      6                 0                  0                 1
[3 rows x 5 columns]

   |     A      B  A_GREATER_THAN_B  A_GREATER_THAN_10  B_GREATER_THAN_5
   | int32  int32             bool8              bool8             bool8
-- + -----  -----  ----------------  -----------------  ----------------
 0 |     1      2                 0                  0                 0
 1 |     3      4                 0                  0                 0
 2 |     5      6                 0                  0                 1
[3 rows x 5 columns]

Note: if someone knows of a nicer way to iterate over rows in a datatable.Frame, please leave a comment, because I'm not a fan of this part:

    for n in range(expressions.nrows):
        col = expressions[n, :][0, 'NAME']

Note that StringIO is only imported to have the .csv files in the code, you wouldn't need them.

Grismar
  • 27,561
  • 4
  • 31
  • 54