1

I am currently rearranging a data frame based on the characters in the first column. I have used functions like the one below to rearrange the data.

df['RegionName'] = df.loc[df.text.str.contains('(', regex=False), 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', expand=False)

The issue I am having is that the last step requires choosing the left over data after I have done my initial rearranging. I believe I need an if else statement where the else will allow me to complete the final step. In my attempts I keep getting an error that my Boolean statements are ambiguous. How can I use the code above in an if else statement to complete my task?

Thanks!

Vince Miller
  • 190
  • 1
  • 2
  • 15

1 Answers1

1

It seems you need:

#if need only values where mask is True, else get NaNs
mask = df.text.str.contains('(', regex=False)
df.loc[mask, 'RegionName'] = df.loc[mask, 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', 
                                                               expand=False)

Or:

#if need processes values only where mask is True, else get original data
mask = df.text.str.contains('(', regex=False)
df['RegionName'] = df['text'].mask(mask, df['text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', 
                                                             expand=False))

Or:

#if need processes values only if mask is True, else get another value like aaa or df['col']
mask = df.text.str.contains('(', regex=False)
df['RegionName']=np.where(mask,df['text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*',expand=0),
                              'aaa')

For better understanding:

df = pd.DataFrame({'text':[' (1', '(', '4', '[7', '{8', '{7', ' [1']})
print (df)
  text
0   (1
1    (
2    4
3   [7
4   {8
5   {7
6   [1

mask1 = df.text.str.contains('(', regex=False)
mask2 = df.text.str.contains('{', regex=False)
mask3 = df.text.str.contains('[', regex=False)

df['d'] = np.where(mask1, 1, 
          np.where(mask2, 3,
          np.where(mask3, 2, 4)))
print (df)
  text  d
0   (1  1
1    (  1
2    4  4
3   [7  2
4   {8  3
5   {7  3
6   [1  2

Another more complicated sample:

df = pd.DataFrame({'text':[' (1', '(', '4', '[ur', '{dFd', '{fGf', ' [io']})
print (df)

mask1 = df.text.str.contains('(', regex=False)
mask2 = df.text.str.contains('{', regex=False)
mask3 = df.text.str.contains('[', regex=False)

df['parsed'] = np.where(mask1, df.text.str.extract(r'(\d+)', expand=False), 
               np.where(mask2, df.text.str.extract(r'([A-Z]+)', expand=False),
               np.where(mask3, df.text.str.extract('([uo])+', expand=False), 4)))
print (df)

   text parsed
0    (1      1
1     (    NaN
2     4      4
3   [ur      u
4  {dFd      F
5  {fGf      G
6   [io      o
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Can you please elaborate further.... what do I need to check mask with to be True? @jezrael – Vince Miller Jun 02 '17 at 13:07
  • mask is condition like `df.text.str.contains('(', regex=False)` - it return `True / False` Series and is is called mask. – jezrael Jun 02 '17 at 13:15
  • Also very nice comment about it is [here](https://stackoverflow.com/questions/44329734/filtering-dataframe-based-on-column-value-counts-pandas#comment75663797_44329751) – jezrael Jun 02 '17 at 13:20
  • Although I don't understand why I realize that this works without using the if/else statement. I left a part out of the question because I wanted it to be clear, and I figured I could figure it out after some direction(this is not the case). The issue is that now I actually need some of the values that return false but not all for example. I need all the items with '[' in one column, and all the items with '(' and neither in one column. The issue is the neither. Its converting the neither in the first one to NaN, and in the second one it keeps the original so its keeping the '[' ones. – Vince Miller Jun 02 '17 at 13:24
  • Ok, try explain more. In pandas process series (simplifying 1d array) and `DataFrame` (simplifying 2d array). So for classic `if (condition): code else code` is not possible use, because it works with arrays. – jezrael Jun 02 '17 at 13:26
  • Ok I have edited my comment, also thank you for the comment on if statements. – Vince Miller Jun 02 '17 at 13:28
  • Yes, now it is more clear. So do youneed precess data by scalars? Something like [this](https://stackoverflow.com/a/39566156/2901002) ? – jezrael Jun 02 '17 at 13:32
  • Or is possible use more `numpy.where` with more masks (conditions) like [here](https://stackoverflow.com/a/40769792/2901002) – jezrael Jun 02 '17 at 13:38
  • I found the solution! Thank you for the help jezrael! After processing the data by scalars I cleaned the data using df.str.split('[').str[0].str.split(' \(').str[0].str.rstrip() My other extract was only operating on the items with '(' – Vince Miller Jun 02 '17 at 14:38
  • Glad can help, nice weekend. – jezrael Jun 02 '17 at 14:46