0

I applied on my dataframe the next command

df['date_article'] = df.pagePath.str.extract_regex(pattern='(?P<digit>/\d{4}/\d{2}/\d{2}/)')

And this created the column 'date_article'

pagePath date_article
'/empresas/2021/10/22/tiendas-no-participan-buen' {'digit': '/2021/10/22/'}
'/finanzas-personales/2021/10/22/pueden-cobrar-c {'digit': '/2021/10/22/'}

Now I want to left only the date in 'date_article'.

Expected output

pagePath date_article
'/empresas/2021/10/22/tiendas-no-participan-buen' '/2021/10/22/'
/finanzas-personales/2021/10/22/pueden-cobrar-c '/2021/10/22/'

I tried many things but nothing seems to work

Thank you in advance for help

3 Answers3

1

It appears that extract_regex returns a struct series:

Extract substrings defined by a regular expression using Apache Arrow (Google RE2 library).

Parameters

pattern (str) – A regular expression which needs to contain named capture groups, e.g. ‘letter’ and ‘digit’ for the regular expression

‘(?P[ab])(?Pd)’.

Returns

an expression containing a struct with field names corresponding to capture group identifiers.

So you will need to extract the field you want out of the struct. I'm not a Vaex expert but maybe something like:

struct_series = df.pagePath.str.extract_regex(pattern='(?P<digit>/\d{4}/\d{2}/\d{2}/)')
df['date_article'] = struct_series.struct.get('digit')
Pace
  • 41,875
  • 13
  • 113
  • 156
0

How about the following:

df['date_article'] = df.apply(lambda x: x['digit'], axis=1)
oreopot
  • 3,392
  • 2
  • 19
  • 28
0

Use:

df = pd.DataFrame({'date_article':[{'digit': '/2021/10/22/'}]})
df['date_article'] = df['date_article'].apply(lambda x: x['digit'])

This uses a lambda function which returns the value of digit key on the specified column and assigns it again. Why you do not use just the following:

df = pd.DataFrame({'date_article':['sdfsdf/2021/10/22/']})
df['date_article'] = df['date_article'].str.extract('(/\d{4}/\d{2}/\d{2}/)')
keramat
  • 4,328
  • 6
  • 25
  • 38