1

I am trying to remove the ID's from URLs so that they can be counted in reporting. With the ID's included they are counted as unique urls when they are not. i.e. there are 1000's instead of 10's

so I would like to take a URL like this

https://www.website.co.uk/page/home-page/id93847562

and cut off the ID so it is like this

https://www.website.co.uk/page/home-page/

the length of URL varies so I cannot cut using a certain amount of characters from the end or start or use a set amount of backslashes.

I am trying to change the URLS in a column in a pandas dataframe.

the closest to an answer here i could find was this: extract id from the URL using Python

but I haven't been able to translate it to my scenario

here's my code

df.loc[df['URL'].str.contains('id'),'URL' = 'URL'[:id]

I've tried to write ' if the URL string contains 'id' replace with the URL from start to id.

the error I get is:

File "<ipython-input-18-42dc8b2df1ff>", line 3
    df.loc[df['URL'].str.contains('id'),'URL' = 'URL'[:id]
                                              ^
SyntaxError: invalid syntax

any ideas what I can do to make it work?

thank you in advance for any help and advice

Mizz H
  • 67
  • 6

3 Answers3

2

You can use str.replace

df['url'] = df['url'].str.replace('\/id.*', '/', regex=True)

Output:

                                         url
0  https://www.website.co.uk/page/home-page/
deadshot
  • 8,881
  • 4
  • 20
  • 39
  • 1
    perfect deadshot thank you. worked first attempt and is in a format I can use to do similar edits. – Mizz H Sep 27 '20 at 12:02
  • 1
    which comma are you talking about? this may help [str.replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html) – deadshot Sep 27 '20 at 12:14
1

You can do rsplit with optional parameter n=1 to limit the number of splits:

df['URL'] = df['URL'].str.rsplit('/', n=1).str[0]

0    https://www.website.co.uk/page/home-page
Name: URL, dtype: object
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
1

are the IDs always following a forward slash and at the end of the string? The following code works for me (with those assumptions). I also added a restriction that following "id" there has to be 2-10 digits, but you can of course edit that to fit your use case. Good luck! :)

import re
import pandas as pd

df = pd.DataFrame({"url": ["https://www.website.co.uk/page/home-page/id93847562"]})
df["url"] = df["url"].map(lambda x: re.sub(r"/id[0-9]{2,10}$", "/", x))
df
SOwla
  • 376
  • 2
  • 9
  • df["URL"] = df["URL"].map(lambda x: re.sub(r"/id[0-9]{2,10}$", "/", x)) but it gives: TypeError: expected string or bytes-like object – Mizz H Sep 27 '20 at 11:59
  • strange.. it works in my notebook, but I'm new to python, so I'm not sure where our differences might've come from.. Glad the other answers worked (I didn't see them before submitting mine)! – SOwla Sep 27 '20 at 12:05
  • could be me, I'm new too! – Mizz H Sep 27 '20 at 12:13