4

I want to extend the question asked here

The solutions in the above question return True or False. And the boolean values can be used to subset the right values.

However, I want to get the search value that matched a substring.

For example,(borrowing from the above question)

s = pd.Series(['cat','hat','dog','fog','pet'])
searchfor = ['og', 'at']

I want to know that 'cat' matched with 'at' and dog matched with 'og'

cs95
  • 379,657
  • 97
  • 704
  • 746
Sharvari Gc
  • 691
  • 1
  • 11
  • 25

2 Answers2

5

IIUC, you want the values to reflect the index of the item in the searchfor list that matched your word. You can start by modifying your searchfor object -

m = {'^.*{}.*$'.format(s) : str(i) for i, s in enumerate(searchfor)}

This is a dictionary of <pattern : index> mappings. Now, call pd.Series.replace with regex=True -

s = s.replace(m, regex=True)
s[:] = np.where(s.str.isdigit(), pd.to_numeric(s, errors='coerce'), -1)

s

0    1
1    1
2    0
3    0
4   -1
dtype: int64

If you want a list of matched values by pattern, you'll need str.extract + groupby + apply -

p = '(^.*({}).*$)'.format('|'.join(searchfor))

s.str.extract(p, expand=True)\
 .groupby([1])[0]\
 .apply(list)

1
at    [cat, hat]
og    [dog, fog]
Name: 0, dtype: object
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thanks. That worked. However, I realized what I actually want is to return all the matched strings as comma separated values. I will ask another question for it. – Sharvari Gc Feb 05 '18 at 02:55
  • @SharvariGc No, don't worry about it. I'll edit my answer. Edit: Done, see my latest edit. – cs95 Feb 05 '18 at 03:11
  • Thanks. How would I get ['og,at'] for the first element of the series, s = pandas.Series(['cat dog','hat cat','dog','fog cat','pet']) while searching for searchfor = ['og', 'at'] – Sharvari Gc Feb 05 '18 at 03:44
  • It is really hard think outside the np.where :-) – BENY Feb 05 '18 at 03:51
  • @SharvariGc Ah... that's quite hard. Can you open another q? – cs95 Feb 05 '18 at 03:53
  • asked here [link](https://stackoverflow.com/questions/48615699/pandas-return-all-matched-keys-for-each-strings-value-in-a-series) @COLDSPEED – Sharvari Gc Feb 05 '18 at 04:03
2

This is by using defaultdict + replace finally I made it ..

d=dict(zip(searchfor,[""]*2))

s1=s.replace(d,regex=True)
import collections
d = collections.defaultdict(dict)
for x,y in zip(s1.index,s1):
    d[x][y]=''

s.to_frame('a').T.replace(dict(d), regex=True).T.a


Out[765]: 
0    at
1    at
2    og
3    og
4      
Name: a, dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234