0

I'm working on a project whose aim is to retrieve all the information from a news article (media website), for this I'm using the library newspaper3K which works quite well.

however I have a problem concerning some urls (redirected link), according to my research newspaper3k does not load the redirection url, it only treats the sent url as a parameter.

Here is an example of a link I would like to deal with:

url = "wtm.actualite.20minutes.fr/redirection.html?m=3e2b20a2f1f6dd3c60608f54d7ad4dc5&c=fr&u=https%3A%2F%2Fwww.20minutes.fr%2Fmonde%2F2943823-20210103-bahamas-disparition-bateau-20-personnes-bord%3Fxtor%3DEREC-182-%5Bactualite%5D&dc=yt0U%2FI8COMJyjwQQ1fA2kVEXpoP0nsZydMTZS6jTm2DdKasFuV%2FVA7rEphhqMfGAy%2FlztUlVN4MJt5tg%2FQXfJwmXMRQL8g3Gfwhl%2BsjkkYmd%2BDxDUhb%2BpPRL%2BNsiDETNQeP3MmrQ6ATGJT%2Blf46Zg4DHd%2FzaXy%2B7UAuxatp2UcVd39HKuuMfQHmyDV%2BAxSAJrd4x5CxHqy3uTtZoQEjwGdZ%2FRtoa7YLOWLKhN9tg4TM%3D"

so the goal here with this url is to get the right url (after redirection) and then send it to newspaper3K.

I have tried the following solutions but they don't work on my side;

1 - using the library resquests as follows response = requests.get(url, verify=False, allow_redirects=True)

2- using the mechanize library as follows:

br = mechanize.Browser()
resp = br.open(url)

I would like to have the same process as when I use webbrowser (without opening the browser)

import webbrowser
webbrowser.open_new(url)

and finally have the right

url : https://www.20minutes.fr/monde/2943823-20210103-bahamas-disparition-bateau-20-personnes-bord?xtor=EREC-182-[actualite]

thank you in advance for your reply :)

Nounes MEZ
  • 71
  • 1
  • 3

2 Answers2

0

The redirect is not happening from path forwarding but instead from the actual html content. You can verify this by downloading the text from response with the following code.

with open ("actualite.html", "w") as f:
    f.write(response.text)

If you open the local file, it will then redirect. The browser does the redirect instead of a domain server.

To solve this you could use a tool that uses the browser like selenium.

Edit: Here is how you could use selenium to do this:

from selenium import webdriver
url = "https://wtm.actualite.20minutes.fr/redirection.html?m=3e2b20a2f1f6dd3c60608f54d7ad4dc5&c=fr&u=https%3A%2F%2Fwww.20minutes.fr%2Fmonde%2F2943823-20210103-bahamas-disparition-bateau-20-personnes-bord%3Fxtor%3DEREC-182-%5Bactualite%5D&dc=yt0U%2FI8COMJyjwQQ1fA2kVEXpoP0nsZydMTZS6jTm2DdKasFuV%2FVA7rEphhqMfGAy%2FlztUlVN4MJt5tg%2FQXfJwmXMRQL8g3Gfwhl%2BsjkkYmd%2BDxDUhb%2BpPRL%2BNsiDETNQeP3MmrQ6ATGJT%2Blf46Zg4DHd%2FzaXy%2B7UAuxatp2UcVd39HKuuMfQHmyDV%2BAxSAJrd4x5CxHqy3uTtZoQEjwGdZ%2FRtoa7YLOWLKhN9tg4TM%3D"

options = webdriver.ChromeOptions()
options.add_argument('ignore-certificate-errors')
driver = webdriver.Chrome(chrome_options=options, executable_path=r"C:/Users/james/Documents/Selenium/chromedriver.exe")
driver.get(url)
print(driver.current_url)
James
  • 387
  • 1
  • 11
0

@James Thank you very much for your answer! It helped me a lot.

I'm currently working on aws glue so I'm forced to use only certain libraries (Selenium is not available I guess) however here is my way to find the link (following your logic of course):

from bs4 import BeautifulSoup
import re
from urllib.parse import unquote

url = "https://wtm.actualite.20minutes.fr/redirection.html?m=3e2b20a2f1f6dd3c60608f54d7ad4dc5&c=fr&u=https%3A%2F%2Fwww.20minutes.fr%2Fmonde%2F2943823-20210103-bahamas-disparition-bateau-20-personnes-bord%3Fxtor%3DEREC-182-%5Bactualite%5D&dc=yt0U%2FI8COMJyjwQQ1fA2kVEXpoP0nsZydMTZS6jTm2DdKasFuV%2FVA7rEphhqMfGAy%2FlztUlVN4MJt5tg%2FQXfJwmXMRQL8g3Gfwhl%2BsjkkYmd%2BDxDUhb%2BpPRL%2BNsiDETNQeP3MmrQ6ATGJT%2Blf46Zg4DHd%2FzaXy%2B7UAuxatp2UcVd39HKuuMfQHmyDV%2BAxSAJrd4x5CxHqy3uTtZoQEjwGdZ%2FRtoa7YLOWLKhN9tg4TM%3D"
response = requests.get(url, verify=False, allow_redirects=True)

if response.status_code == 200:
    page = response.text
    # parse the html using beautifulsoup
    html_content = BeautifulSoup(page, 'html.parser')
    soup = html_content
    
href = soup.find("link", href = True)
href = href['href']

new_url = unquote(unquote(href))

thanks again for your help, you are a hero :)

Nounes MEZ
  • 71
  • 1
  • 3