Python pandas check if the last element of a list in a cell contains specific string

Question

my dataframe df:

index                        url
1           [{'url': 'http://bhandarkarscollegekdp.org/'}]
2             [{'url': 'http://cateringinyourhome.com/'}]
3                                                     NaN
4                  [{'url': 'http://muddyjunction.com/'}]
5                       [{'url': 'http://ecskouhou.jp/'}]
6                     [{'url': 'http://andersrice.com/'}]
7       [{'url': 'http://durager.cz/'}, {'url': 'http:andersrice.com'}]
8            [{'url': 'http://milenijum-osiguranje.rs/'}]
9       [{'url': 'http://form-kind.org/'}, {'url': 'https://osiguranje'},{'url': 'http://beseka.com.tr'}]

I would like to select the rows if the last item in the list of the row of url column contains 'https', while skipping missing values.

My current script

df[df['url'].str[-1].str.contains('https',na=False)]

returns False values for all the rows while some of them actually contains https.

Can anybody help with it?

as your dtype is list you'd have to use `apply`: `df['url'].apply(lambda x: 'https' in x[-1])` — EdChum, Oct 03 '16 at 12:21
@EdChum I have tried that. It dies not work either. TypeError: 'float' object is not subscriptable — UserYmY, Oct 03 '16 at 12:23
that means you have missing values so you need to drop them first: `df['url'].dropna().apply(lambda x: 'https' in x[-1])` — EdChum, Oct 03 '16 at 12:23
try with axis=1 such has `df['url'].apply(lambda x: "https" in x[-1], axis=1)` — Steven G, Oct 03 '16 at 12:23
@EdChum it works, but it gives False for the rows although some of them has https. my own script has the same problem as well. — UserYmY, Oct 03 '16 at 12:25
try `df['url'].dropna().apply(lambda x: 'https' in x[-1]['url'])` — EdChum, Oct 03 '16 at 12:27
@EdChum This works Thanks. Can you post it as an answer? Also can I modify it in a way to get columns that satisfy the condition only? — UserYmY, Oct 03 '16 at 12:28
You want `df[df['url'].dropna().apply(lambda x: 'https' in x[-1]['url'])]` I think — EdChum, Oct 03 '16 at 12:47
@EdChum no that does not work I have tried already IndexingError: Unalignable boolean Series key provided — UserYmY, Oct 03 '16 at 12:48
I think the problem here is that by dropping the null rows the series returned will be a different length so you can either fill those using `fillna` or maybe `df[df['url'].apply(lambda x: pd.notnull(x) and 'https' in x[-1]['url'])]` will work — EdChum, Oct 03 '16 at 12:50

jezrael · Accepted Answer · 2016-10-03T12:50:06.757

I think you can first replace NaN to empty url and then use apply:

df = pd.DataFrame({'url':[[{'url': 'http://bhandarkarscollegekdp.org/'}],
                          np.nan,
                         [{'url': 'http://cateringinyourhome.com/'}],  
                         [{'url': 'http://durager.cz/'}, {'url': 'https:andersrice.com'}]]},
                  index=[1,2,3,4])

print (df)
                                                 url
1     [{'url': 'http://bhandarkarscollegekdp.org/'}]
2                                                NaN
3        [{'url': 'http://cateringinyourhome.com/'}]
4  [{'url': 'http://durager.cz/'}, {'url': 'https...

df.loc[df.url.isnull(), 'url'] = [[{'url':''}]]
print (df)
                                                 url
1     [{'url': 'http://bhandarkarscollegekdp.org/'}]
2                                      [{'url': ''}]
3        [{'url': 'http://cateringinyourhome.com/'}]
4  [{'url': 'http://durager.cz/'}, {'url': 'https...

print (df.url.apply(lambda x: 'https' in x[-1]['url']))
1    False
2    False
3    False
4     True
Name: url, dtype: bool

First solution:

df.loc[df.url.notnull(), 'a'] = 
df.loc[df.url.notnull(), 'url'].apply(lambda x: 'https' in x[-1]['url'])

df.a.fillna(False, inplace=True)
print (df)
                                                 url      a
1     [{'url': 'http://bhandarkarscollegekdp.org/'}]  False
2                                                NaN  False
3        [{'url': 'http://cateringinyourhome.com/'}]  False
4  [{'url': 'http://durager.cz/'}, {'url': 'https...   True

score 0 · Answer 2 · answered Oct 03 '16 at 12:34

0

not sure url is str or other types

you can do like this:

"https" in str(df.url[len(df)-1])

or

str(df.ix[len(df)-1].url).__contains__("https")

answered Oct 03 '16 at 12:34

Howardyan

667
1
6
15

Python pandas check if the last element of a list in a cell contains specific string

2 Answers2