4

I have a pandas dataframe with a text column.

I'd like to create a new column in which values are conditional on the start of the text string from the text column.

So if the 30 first characters of the text column:

== 'xxx...xxx' then return value 1

== 'yyy...yyy' then return value 2

== 'zzz...zzz' then return value 3

if none of the above return 0
theletz
  • 1,713
  • 2
  • 16
  • 22
Pierre
  • 43
  • 1
  • 6
  • I forgot to mention that the number of characters we want to look at in the text column varies with each condition. So for value 1 we look at the first 30 characters / value 2, 25 characters, value 3, 35 characters – Pierre Feb 14 '17 at 12:25

2 Answers2

8

There is possible use multiple numpy.where but if more conditions use apply:

For select strings from strats use indexing with str.

df = pd.DataFrame({'A':['xxxss','yyyee','zzzswee','sss'],
                   'B':[4,5,6,8]})

print (df)
         A  B
0    xxxss  4
1    yyyee  5
2  zzzswee  6
3      sss  8
#check first 3 values
a = df.A.str[:3]
df['new'] = np.where(a == 'xxx', 1, 
            np.where(a == 'yyy', 2, 
            np.where(a == 'zzz', 3, 0)))

print (df)
         A  B  new
0    xxxss  4    1
1    yyyee  5    2
2  zzzswee  6    3
3      sss  8    0
def f(x):
    #print (x)
    if x == 'xxx':
        return 1
    elif x == 'yyy':
        return 2
    elif x == 'zzz':
        return 3
    else:
        return 0

df['new'] = df.A.str[:3].apply(f)
print (df)
         A  B  new
0    xxxss  4    1
1    yyyee  5    2
2  zzzswee  6    3
3      sss  8    0

EDIT:

If length is different, only need:

df['new'] = np.where(df.A.str[:3] == 'xxx', 1, 
            np.where(df.A.str[:2] == 'yy', 2, 
            np.where(df.A.str[:1] == 'z', 3, 0)))

print (df)
         A  B  new
0    xxxss  4    1
1    yyyee  5    2
2  zzzswee  6    3
3      sss  8    0

EDIT1:

Thanks for idea to Quickbeam2k1 use str.startswith for check starts of each string:

df['new'] = np.where(df.A.str.startswith('xxx'), 1, 
            np.where(df.A.str.startswith('yy'), 2, 
            np.where(df.A.str.startswith('z'), 3, 0)))

print (df)
         A  B  new
0    xxxss  4    1
1    yyyee  5    2
2  zzzswee  6    3
3      sss  8    0
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks - it works but only with a fixed number of characters (in your example 3) but what would you do if the number of characters vary with each condition? – Pierre Feb 14 '17 at 12:27
  • @jezrael, you even might want to use (startswith)[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.startswith.html] – Quickbeam2k1 Feb 14 '17 at 12:33
  • @Quickbeam2k1 - Thank you, I add it to answer. – jezrael Feb 14 '17 at 12:35
  • @jezrael thank you - with your np.where method why do I get value 1 when the text string is NaN? Shouldn't it be 0? – Pierre Feb 14 '17 at 13:27
  • Yes, it is numpy a there is bug. But this work nice `df['new'] = np.where(df.A.str.startswith('xxx', na=False), 1, np.where(df.A.str.startswith('yy', na=False), 2, np.where(df.A.str.startswith('z', na=False), 3, 0)))` – jezrael Feb 14 '17 at 13:30
  • You need parameter `na` set to `False` – jezrael Feb 14 '17 at 13:31
  • @ jezrael, any idea on how to make the solution below more pandaic or numpyesce? – Quickbeam2k1 Feb 14 '17 at 15:34
  • @Quickbeam2k1 - Maybe it can be better with numpy, but I have only basic knowledge about numpy, so I cannot help you unfortunately. But maybe if create question in site codereview some numpy guru can help. – jezrael Feb 14 '17 at 16:11
0

A different and slower solution: However, the advantage is that the mapping from patterns is a function parameter (with implicit default 0 value)

def map_starts_with(pat_map):
    def map_string(t):
        pats = [pat for pat in pat_map.keys() if t.startswith(pat)]
        return pat_map.get(pats[0]) if len(pats) > 0 else 0 
# get only value of "first" pattern if at least one pattern is found
    return map_string

df = pd.DataFrame({'col':[ 'xx', 'aaaaaa', 'c']})
      col
0      xx
1  aaaaaa
2       c

mapping = { 'aaa':4 ,'c':3}
df.col.apply(lambda x: map_starts_with(mapping)(x))

0    0
1    4
2    3

Note the we also used currying here. I'm wondering if this approach can be implemented using additional pandas or numpy functionality.

Note that the "first" pattern match may depend on the traversal order of the dict keys. This is irrelephant if there is no overlap in the keys. (Jezrael's solution, or its direct generalization thereof, will also choose one element for the match, but in a more predictable manner)

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
  • just wanted to share some improvements on my answer from [code review](http://codereview.stackexchange.com/questions/155405/curried-function) – Quickbeam2k1 Mar 02 '17 at 08:29