3

In python, how to check if a string is an element of a list of strings?

The example data I am working with is :

testData=pd.DataFrame({'value':['abc','cde','fgh']})

Then why the result of the following code is "False":

testData['value'][0] in testData['value']
EdChum
  • 376,765
  • 198
  • 813
  • 562
cone001
  • 1,143
  • 3
  • 13
  • 15
  • Sorry the data will be stored as a Series containing individual strings in your sample df but is your real df data really a list of strings for each row? As that is fundamentally different – EdChum Oct 28 '16 at 14:11
  • @EdChum answer is a good one. To help fix your original error, you simply need to check the values of testData['value'] so your last line will be 'testData['value'][0] in testData['value'].values' and you will get a True – A.Kot Oct 28 '16 at 14:14
  • @EdChum, I guess the example data is a more accurate description of my problem. The fundamental difference you mentioned might be the thing I overlooked. – cone001 Oct 28 '16 at 14:16
  • actually `testData['value'][0] in testData['value']` I can't explain, somehow when the scalar value is the lhs it's somehow able to evaluate the `Series` array into a scalar boolean which is weird – EdChum Oct 28 '16 at 14:19
  • I found the answer to your last question – EdChum Oct 28 '16 at 14:28
  • @EdChum Whats the confusion. Changing `testData['value']` to `testData['value'].values` corrects the error – A.Kot Oct 28 '16 at 15:03
  • @A.Kot the confusion is why `testData['value'][0] in testData['value']` gives `False` not that `testData['value'][0] in testData['value'].values` works because that will use `np.array.__contains__` which does what you expect whilst for pandas `Series` you're checking for membership of the index, see the bottom of my answer – EdChum Oct 28 '16 at 15:05
  • @EdChum Oh yeah makes sense. – A.Kot Oct 28 '16 at 15:10

1 Answers1

5

You can use the vectorised str.contains to test if a string is present/contained in each row :

In [262]:
testData['value'].str.contains(testData['value'][0])

Out[262]:
0     True
1    False
2    False
Name: value, dtype: bool

If you're after whether it's present in any row then use any:

In [264]:
testData['value'].str.contains(testData['value'][0]).any()

Out[264]:
True

OK to address your last question:

In [270]:
testData['value'][0] in testData['value']

Out[270]:
False

This is because pd.Series.__contains__ is implemented:

def __contains__(self, key):
    """True if the key is in the info axis"""
    return key in self._info_axis

If we look at what _info_axis actually is:

In [269]:
testData['value']._info_axis

Out[269]:
RangeIndex(start=0, stop=3, step=1)

Then we can see when we do 'abc' in testData['value'] we're really testing whether 'abc' is actually in the index which is why it returns False

Example:

In [271]:
testData=pd.DataFrame({'value':['abc','cde','fgh']}, index=[0, 'turkey',2])
testData

Out[271]:
       value
0        abc
turkey   cde
2        fgh

In [272]:
'turkey' in testData['value']

Out[272]:
True

We can see that is returns True now because we're testing if 'turkey' is present in the index

EdChum
  • 376,765
  • 198
  • 813
  • 562