Cannot extract URLs from a text file

Question

I'm trying to parse an online text file's contents then extract all URLs. Everything works fine except the URL extraction part. It doesn't happen. I tried the same process on a local file, and it worked. What is wrong?

COMMAND

import requests
import re
from io import StringIO

link = "https://pastebin.com/raw/B8QauiXU"
urls = requests.get(link)

with open(urls.text) as file, io.StringIO() as output:
    for line in file:
        urls = re.findall('https?://[^\s<>"]+[|www\.^\s<>"]+', line)
        print(*urls, file=output)

urls = output.getvalue()

print(urls)

OUTPUT

https://google.com and https://bing.com are both the two largest search engines in the world. They are followed by https://duckduckgo.com.

Maurice Meyer · Accepted Answer · 2022-02-02T12:37:04.193

2

Making your regular expression a raw string works fine:

import requests, re
from io import StringIO

with StringIO() as output:
    link = "https://pastebin.com/raw/B8QauiXU"
    data = requests.get(link).text
    urls = re.findall(r'https?://[^\s<>"]+[|www\.^\s<>"]+', data)

    for i, url in enumerate(urls):
        output.write(f"{i}: {url}\n")
    print(output.getvalue())

Out:

0: https://google.com
1: https://bing.com
2: https://duckduckgo.com.

edited Feb 02 '22 at 12:37

answered Feb 02 '22 at 11:54

Maurice Meyer

17,279
4
30
47

THANK YOU! Quick question: how can I print to in-memory buffer (the `io.StringIO() as output` part)? – facialrecognition Feb 02 '22 at 12:00

score 1 · Answer 2 · answered Feb 02 '22 at 11:51

1

you did not escape //

I fixed the regex for you

https?:\/\/[^\s<>"]+[|www\.^\s<>"]+

By the way, you should import re.

answered Feb 02 '22 at 11:51

Alik.Koldobsky

334
1
10

It still doesn't work :( URL extraction works okay on local files, just not remote files. – facialrecognition Feb 02 '22 at 11:53

Cannot extract URLs from a text file

COMMAND

OUTPUT

2 Answers2