python datatable, string manipulation on column

Question

from datatable import dt, f, g, by, update, join, sort

tt = dt.Frame({'a' : ['A1','A2','A3'], 'b':[100,200,300]})

print(tt)
   | a     b
-- + --  ---
 0 | A1  100
 1 | A2  200
 2 | A3  300

[3 rows x 2 columns]

How can I remove the 'A' in the a column and assign it to a new column 'c' as a number in the datatable way (w/o pandas that is)?

It would look like this with the help of pandas

tt['c'] = tt.to_pandas()['a'].str.replace('A','').astype(int)

A datatable native version does not quite work

tt[:, update(c = [int(x.replace('A','')) for x in f.a])]
TypeError: 'datatable.FExpr' object is not iterable

By the way, for a frequent user of python pandas and R data.table, is there an advanced/complete cookbook that can help the transition from R data.table to py-datatable? There is a page on the website, but not quite enough.

Hi @jf328; at the moment, python's datatable has sparse support for string operations. You could raise an issue or a thumbs up for existing issues that are important to you, so that the library maintainers can see its importance and pay more attention to it. Have a look at the [user guide](https://datatable.readthedocs.io/en/latest/manual/index-manual.html) for examples on using the library. I intend adding more documentation, so if you have examples you would like to have in a cookbook, please feel free to list them on the [github](https://github.com/h2oai/datatable/issues/2604) page. — sammywemmy, Oct 13 '20 at 01:26

sammywemmy · Answer 1 · 2020-10-13T01:41:06.483

Here is a hack that does not particularly scale well :

Step 1 : dump the a column into native python and create a tuple of values :

tuples = [(entry[0], entry[-1]) for entry  in tt['a'].to_list()[0]]

Step 2: cbind back to the tt frame :

tt.cbind(dt.Frame(tuples))

tt


    a   b   C0  C1
0   A1  100 A   1
1   A2  200 A   2
2   A3  300 A   3

If you only need the A's, then you could use the code below, which, still does not scale well (imagine you have null values in the columns), and is gnarly(we have to index into the list to get what we want) :

tt["A_only"] = dt.Frame([entry[0] for entry in tt['a'].to_list()[0]])

tt

     a   b   A_only
0   A1  100     A
1   A2  200     A
2   A3  300     A

As stated earlier, this does not scale well. Besides, it does not offer the speed that datatable aims for.

At the moment, datatable does not have great string operations support (the library maintainers, I believe are currently working on that, as well as some other requested features)

score 1 · Answer 2 · answered Jul 03 '21 at 13:43

I know this is an old question, but in case someone is still looking for this -- in the just-released 1.0.0 one can do the following:

tt = dt.Frame({'a' : ['A1','A2','A3'], 'b':[100,200,300]})
tt["A_only"] = tt[:, f.a[0:1]]
tt["num_only"] = tt[:, f.a[1:]]
tt["num_only"] = dt.Type.int8  # Change the type of the column to `int`
tt.ltypes

The above uses slices on the string column, i.e. relies on a fixed format. There is a .re section too for regular expressions, but I only see match, not extract.

Joe Ferndz · Answer 3 · 2020-10-12T16:33:03.343

Here's a hack I did to get what you wanted. I am still learning Datatables so bear with me while I get fully into it.

First, convert the datatable into a dataframe. perform the operations i listed earlier, then convert the dataframe back to a datatable. Walla, you now have a datatable with the desired results.

So here's how i did it.

from datatable import dt, f, g, by, update, join, sort
tt = dt.Frame({'a' : ['A1','A2','A3'], 'b':[100,200,300]})
df = tt.to_pandas()
df = df.join(df.a.str.extract('([a-zA-Z])([0-9])', expand=True).add_prefix('a'))
df = df.rename(columns = {'a0': 'c', 'a1': 'd'})
tt = dt.Frame(df)
tt

Output of this will be:

You can split the column and rename the field.

import pandas as pd
df = pd.DataFrame({'a' : ['A1','A2','A3'], 'b':[100,200,300]})
print (df)
df = df.join(df['a'].str.split(r'(\d.*)', expand=True).add_prefix('a'))
df.drop('a2',axis = 1,inplace=True)
df = df.rename(columns = {'a0': 'c', 'a1': 'd'})
print (df)

The output will be:

The initial DataFrame will be:

The new DataFrame will look like this:

    a    b  c  d
0  A1  100  A  1
1  A2  200  A  2
2  A3  300  A  3

Alternate, you can also do it using extract and regular expression.

import pandas as pd
df1 = pd.DataFrame({'a' : ['A1','A2','A3'], 'b':[100,200,300]})
df1 = df1.join(df1.a.str.extract('([a-zA-Z])([0-9])', expand=True).add_prefix('a'))
df1 = df1.rename(columns = {'a0': 'c', 'a1': 'd'})
print (df1)

It will give you the same results:

In this option, it does not create the additional column that need to be deleted

    a    b  c  d
0  A1  100  A  1
1  A2  200  A  2
2  A3  300  A  3

Thanks Joe. I've edited the question to make it clearer that I want a datatable native solution — jf328, Oct 12 '20 at 08:45
I don't know enough information on datatables to manipulate and get this done natively using datatable. The only options I find are using pandas or dictionary or lists. Convert them then manipulate then return it back to datatable. — Joe Ferndz, Oct 12 '20 at 17:18

python datatable, string manipulation on column

3 Answers3