I've tried numerous codes to remove links from tweets, but none seem to work.
Original link example : https:/ /t.co/WfWWOukD9l/
How it appears after: httpstcowfwwoukd9l
The entire function:
def cleantext(text):
text = re.sub(r'@[A-Za-z0-9]+', '', text)
text = re.sub(r'[^0-9A-Za-z \t]+', '', text)
text = re.sub(r'#', '', text)
text = re.sub(r'RT[\s]+', '', text)
text = re.sub(r'https?://\S+', '', text)
text = re.sub(r'(<a href[\s\S]*?>[\s\S]*?)|(\b(http|https):\/\/.*[^ alt]\b)', '', text)
text = re.sub(r'http[s]?:\/\/\S+', '', text)
text = text.lower()
return text
Other than links, everything else is cleaned in the text. Using Python 3.10.9 Regex 2022.10.31
Code I used:
text = re.sub(r'https?://\S+', '', text)
text = re.sub(r'(<a href[\s\S]*?>[\s\S]*?)|(\b(http|https):\/\/.*[^ alt]\b)', '', text)
text = re.sub(r'http[s]?:\/\/\S+', '', text)