Extract dictionary value from column in data frame with Vaex

Question

I applied on my dataframe the next command

df['date_article'] = df.pagePath.str.extract_regex(pattern='(?P<digit>/\d{4}/\d{2}/\d{2}/)')

And this created the column 'date_article'

pagePath	date_article
'/empresas/2021/10/22/tiendas-no-participan-buen'	{'digit': '/2021/10/22/'}
'/finanzas-personales/2021/10/22/pueden-cobrar-c	{'digit': '/2021/10/22/'}

Now I want to left only the date in 'date_article'.

Expected output

pagePath	date_article
'/empresas/2021/10/22/tiendas-no-participan-buen'	'/2021/10/22/'
/finanzas-personales/2021/10/22/pueden-cobrar-c	'/2021/10/22/'

I tried many things but nothing seems to work

Thank you in advance for help

Seems to be a dict. The documentation says this -> -Returns: an expression containing a struct with field names corresponding to capture group identifiers. — Jorge Guzmán, Feb 02 '22 at 05:41
https://vaex.io/docs/api.html#vaex.expression.StringOperations.extract_regex — Jorge Guzmán, Feb 02 '22 at 05:42

score 1 · Answer 1 · answered Feb 02 '22 at 05:47

It appears that extract_regex returns a struct series:

Extract substrings defined by a regular expression using Apache Arrow (Google RE2 library).

Parameters
pattern (str) – A regular expression which needs to contain named capture groups, e.g. ‘letter’ and ‘digit’ for the regular expression
‘(?P[ab])(?Pd)’.

Returns
an expression containing a struct with field names corresponding to capture group identifiers.

So you will need to extract the field you want out of the struct. I'm not a Vaex expert but maybe something like:

struct_series = df.pagePath.str.extract_regex(pattern='(?P<digit>/\d{4}/\d{2}/\d{2}/)')
df['date_article'] = struct_series.struct.get('digit')

Thank you! As you said df['date_article'] = df['date_article'].struct.get('digit') solve the problem. — Jorge Guzmán, Feb 02 '22 at 06:03

score 0 · Answer 2 · answered Feb 02 '22 at 05:29

0

How about the following:

df['date_article'] = df.apply(lambda x: x['digit'], axis=1)

answered Feb 02 '22 at 05:29

oreopot

3,392
2
19
28

I tried this and give me this error: TypeError: apply() got an unexpected keyword argument 'axis' – Jorge Guzmán Feb 02 '22 at 05:39

keramat · Answer 3 · 2022-02-02T08:07:37.630

Use:

df = pd.DataFrame({'date_article':[{'digit': '/2021/10/22/'}]})
df['date_article'] = df['date_article'].apply(lambda x: x['digit'])

This uses a lambda function which returns the value of digit key on the specified column and assigns it again. Why you do not use just the following:

df = pd.DataFrame({'date_article':['sdfsdf/2021/10/22/']})
df['date_article'] = df['date_article'].str.extract('(/\d{4}/\d{2}/\d{2}/)')

Extract dictionary value from column in data frame with Vaex

3 Answers3