-1

I've tried numerous codes to remove links from tweets, but none seem to work.

Original link example : https:/ /t.co/WfWWOukD9l/

How it appears after: httpstcowfwwoukd9l

The entire function:

def cleantext(text):
    text = re.sub(r'@[A-Za-z0-9]+', '', text) 
    text = re.sub(r'[^0-9A-Za-z \t]+', '', text)
    text = re.sub(r'#', '', text)
    text = re.sub(r'RT[\s]+', '', text)
    text = re.sub(r'https?://\S+', '', text)
    text = re.sub(r'(<a href[\s\S]*?>[\s\S]*?)|(\b(http|https):\/\/.*[^ alt]\b)', '', text)
    text = re.sub(r'http[s]?:\/\/\S+', '', text)
    text = text.lower() 

    return text

Other than links, everything else is cleaned in the text. Using Python 3.10.9 Regex 2022.10.31

Code I used:

    text = re.sub(r'https?://\S+', '', text)
    text = re.sub(r'(<a href[\s\S]*?>[\s\S]*?)|(\b(http|https):\/\/.*[^ alt]\b)', '', text)
    text = re.sub(r'http[s]?:\/\/\S+', '', text) 
  • 1
    Please make sure in the future to make your [example] (or examples) actually reproducible. Running this function on `"https://t.co(something)"` returns `"httpstcosomething"`, not `"httpstcowfwwoukd9l"`. We could be dumb and wonder where the hell `"wfwwoukd9l"` came from and where `"something"` went, completely missing the point of your question. Thankfully this one is simple; next time something like this might get your question closed. – Amadan Mar 14 '23 at 17:51
  • Thank you for pointing this out for me! It all works now. – user21385645 Mar 14 '23 at 18:10

2 Answers2

1

The issue is that you replace various special characters before you replace the link. By the time you reach the link replace, your string does not contain : or /, so http[s]?:\/\/\S+ cannot match. Move it to the start of the function, so that the link is intact before you try to match it.

Also, depending on what you want (which I can't see, on account of a bad example), \S+ might or might not be matching more than you want. If so, you will have to change it into something more restrictive, like [^\s()]+.

Finally, as a note, [s]? is equivalent to s? - the brackets here are unnecessary.

Amadan
  • 191,408
  • 23
  • 240
  • 301
0

When you just want to remove every time the same Chars, then just remove a first x number of Chars.

linuxchr
  • 26
  • 2