remove URL id from URL in pandas column to leave the base url without id

Question

I am trying to remove the ID's from URLs so that they can be counted in reporting. With the ID's included they are counted as unique urls when they are not. i.e. there are 1000's instead of 10's

so I would like to take a URL like this

https://www.website.co.uk/page/home-page/id93847562

and cut off the ID so it is like this

https://www.website.co.uk/page/home-page/

the length of URL varies so I cannot cut using a certain amount of characters from the end or start or use a set amount of backslashes.

I am trying to change the URLS in a column in a pandas dataframe.

the closest to an answer here i could find was this: extract id from the URL using Python

but I haven't been able to translate it to my scenario

here's my code

df.loc[df['URL'].str.contains('id'),'URL' = 'URL'[:id]

I've tried to write ' if the URL string contains 'id' replace with the URL from start to id.

the error I get is:

File "<ipython-input-18-42dc8b2df1ff>", line 3
    df.loc[df['URL'].str.contains('id'),'URL' = 'URL'[:id]
                                              ^
SyntaxError: invalid syntax

any ideas what I can do to make it work?

thank you in advance for any help and advice

I should mention the datatype is 'object' – Mizz H Sep 27 '20 at 11:44 — Mizz H, Sep 27 '20 at 11:44
So `Id` always come at the end of url? – Shubham Sharma Sep 27 '20 at 11:44 — Shubham Sharma, Sep 27 '20 at 11:44
yes but can be different lengths – Mizz H Sep 27 '20 at 11:45 — Mizz H, Sep 27 '20 at 11:45

score 2 · Accepted Answer · answered Sep 27 '20 at 11:49

2

You can use str.replace

df['url'] = df['url'].str.replace('\/id.*', '/', regex=True)

Output:

                                         url
0  https://www.website.co.uk/page/home-page/

answered Sep 27 '20 at 11:49

deadshot

8,881
4
20
39

1

perfect deadshot thank you. worked first attempt and is in a format I can use to do similar edits. – Mizz H Sep 27 '20 at 12:02
1

which comma are you talking about? this may help [str.replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html) – deadshot Sep 27 '20 at 12:14

score 1 · Answer 2 · answered Sep 27 '20 at 11:47

1

You can do rsplit with optional parameter n=1 to limit the number of splits:

df['URL'] = df['URL'].str.rsplit('/', n=1).str[0]

0    https://www.website.co.uk/page/home-page
Name: URL, dtype: object

answered Sep 27 '20 at 11:47

Shubham Sharma

68,127
6
24
53

how does it know which forward slash to use when there are many? – Mizz H Sep 27 '20 at 12:05
1

@MizzH That's why we are using `rsplit` instead of `split` as it splits the string in the Series from the end.. – Shubham Sharma Sep 27 '20 at 12:07

score 1 · Answer 3 · answered Sep 27 '20 at 11:52

1

are the IDs always following a forward slash and at the end of the string? The following code works for me (with those assumptions). I also added a restriction that following "id" there has to be 2-10 digits, but you can of course edit that to fit your use case. Good luck! :)

import re
import pandas as pd

df = pd.DataFrame({"url": ["https://www.website.co.uk/page/home-page/id93847562"]})
df["url"] = df["url"].map(lambda x: re.sub(r"/id[0-9]{2,10}$", "/", x))
df

answered Sep 27 '20 at 11:52

SOwla

376
2
9

df["URL"] = df["URL"].map(lambda x: re.sub(r"/id[0-9]{2,10}$", "/", x)) but it gives: TypeError: expected string or bytes-like object – Mizz H Sep 27 '20 at 11:59
strange.. it works in my notebook, but I'm new to python, so I'm not sure where our differences might've come from.. Glad the other answers worked (I didn't see them before submitting mine)! – SOwla Sep 27 '20 at 12:05
could be me, I'm new too! – Mizz H Sep 27 '20 at 12:13

remove URL id from URL in pandas column to leave the base url without id

3 Answers3