0

I need to scrape all headlines on autism topic from Le Monde newspaper's archive (from 1980). I'm not a programmer but humanitarian who is trying to be "digital"...

I managed to get a list of all (daily) issues and, from another side, parsing with soup a one url at time and extract headlines works as well. But both together don't. I fill my problem is on the parsing+iteration step but am not able to solve it.

from bs4 import BeautifulSoup
import requests
import re
from datetime import date, timedelta

start = date(2018, 1, 1)
end = date.today()
all_url =[]

#this chunk is working and returns a nice list of all url of all issues
day = timedelta(days=1)
one_url = "https://www.lemonde.fr/archives-du-monde/"
mydate = start

while mydate < end:
    mydate += day
    if one_url not in all_url:
        all_url.append(one_url + "{date.day:02}/{date.month:02}/{date.year}".format(date=mydate) + '/')

#this function is working as well when applied with one single url
def titles(all_url):
    
    for url in all_url:
        page = BeautifulSoup(requests.get(url).text, "lxml")
        
        regexp = re.compile(r'^.*\b(autisme|Autisme)\b.*$')
        
        for headlines in page.find_all("h3"):
            h = headlines.text
        
            for m in regexp.finditer(h):
                print(m.group())
        
titles(all_url)

This script is just stuck...

  • The start date in this code is the beginning of 2018 so as to try facilitate to run... – Mary Gland May 12 '19 at 16:50
  • The script is going to take a while to run, it is not stuck. You can add a `print` inside the `title` function's outer loop in order to check if it is still running. I did notice however that Le Monde seems to use urls terminated with dates in the format `01-01-2018` for its archive, so changing the separators there might help. – bla May 12 '19 at 17:34
  • I just ran the script for all the days from last year to today (496 urls), it took almost 5 minutes on my machine. – bla May 12 '19 at 17:35

2 Answers2

1

The script is not stuck. I have added print statements so that you could visualize that it is working. But initially I thought may the issue is in your regex pattern.

When I actually opened one of that web link (https://www.lemonde.fr/archives-du-monde/25/03/2018/), the server responded with 404 as this page does not exist on server. enter image description here Since you have created page urls with code so it is highly likely that those link correspond to none on server side.

from bs4 import BeautifulSoup
import requests
import re
from datetime import date, timedelta

start = date(2018, 1, 1)
end = date.today()
all_url =[]

#this chunk is working and returns a nice list of all url of all issues
day = timedelta(days=1)
one_url = "https://www.lemonde.fr/archives-du-monde/"
mydate = start

while mydate < end:
    mydate += day
    if one_url not in all_url:
        all_url.append(one_url + "{date.day:02}/{date.month:02}/{date.year}".format(date=mydate) + '/')

#this function is working as well when applied with one single url
def titles(all_url):

    counter = 0
    for url in all_url:
        print("[+] (" + str(counter) + ") Fetching URL " + url)
        counter += 1
        page = BeautifulSoup(requests.get(url).text, "lxml")

        regexp = re.compile(r'^.*\b(autisme|Autisme)\b.*$')

        found = False
        for headlines in page.find_all("h3"):
            h = headlines.text

            for m in regexp.finditer(h):
                found = True
                print(m.group())

        if not found:
            print("[-] Can't Find any thing relevant this page....")
            print()

titles(all_url)

Script Output:

[+] (0) Fetching URL https://www.lemonde.fr/archives-du-monde/02/01/2018/
[-] Can't Find any thing relevant this page....

[+] (1) Fetching URL https://www.lemonde.fr/archives-du-monde/03/01/2018/
[-] Can't Find any thing relevant this page....

[+] (2) Fetching URL https://www.lemonde.fr/archives-du-monde/04/01/2018/
[-] Can't Find any thing relevant this page....

[+] (3) Fetching URL https://www.lemonde.fr/archives-du-monde/05/01/2018/
[-] Can't Find any thing relevant this page....

[+] (4) Fetching URL https://www.lemonde.fr/archives-du-monde/06/01/2018/
[-] Can't Find any thing relevant this page....

[+] (5) Fetching URL https://www.lemonde.fr/archives-du-monde/07/01/2018/
[-] Can't Find any thing relevant this page....

[+] (6) Fetching URL https://www.lemonde.fr/archives-du-monde/08/01/2018/
[-] Can't Find any thing relevant this page....

[+] (7) Fetching URL https://www.lemonde.fr/archives-du-monde/09/01/2018/
[-] Can't Find any thing relevant this page....

[+] (8) Fetching URL https://www.lemonde.fr/archives-du-monde/10/01/2018/
[-] Can't Find any thing relevant this page....

[+] (9) Fetching URL https://www.lemonde.fr/archives-du-monde/11/01/2018/
[-] Can't Find any thing relevant this page....

[+] (10) Fetching URL https://www.lemonde.fr/archives-du-monde/12/01/2018/
[-] Can't Find any thing relevant this page....

You can see each url by inspecting in web browsers. Kindly let me know if you need more help.

Zaid Afzal
  • 362
  • 2
  • 7
0

The main problem is that the date format used in Le Monde's archive urls is day-month-year, not day/month/year. To fix it change:

all_url.append(one_url + "{date.day:02}/{date.month:02}/{date.year}".format(date=mydate) + '/')

to

all_url.append(one_url + "{date.day:02}-{date.month:02}-{date.year}".format(date=mydate) + '/')

The feeling that the program is stuck is simply due to lack of feedback. @Zaid's answer shows how to solve that in an elegant manner.

If you need a more speedy approach to making a bunch of HTTP requests you should consider using something asynchronous. I suggest using Scrapy, which is a framework built for this kind of task (web scraping).

I made a simple spider to fetch all of the headlines containing 'autism' in the archive (from the beginning of 2018 to today):

import re
from datetime import date
from datetime import timedelta

import scrapy

BASE_URL = 'https://www.lemonde.fr/archives-du-monde/'


def date_range(start, stop):
    for d in range((stop - start).days):
        yield start + timedelta(days=d)


class LeMonde(scrapy.Spider):
    name = 'LeMonde'

    def start_requests(self):
        for day in date_range(date(2018, 1, 1), date.today()):
            url = BASE_URL + '{d.day:02}-{d.month:02}-{d.year}'.format(d=day) + '/'
            yield scrapy.Request(url)

    def parse(self, response):
        for headline in response.xpath('//h3/a/text()').getall():
            headline = headline.strip()

            if 'autism' in headline.lower():
                yield { 'headline': headline }

I was able to scrape the headlines in 47 seconds using the above code. If you are interested, you may run it with:

scrapy runspider spider_file.py -o headlines.csv

this will generate a csv file (headlines.csv) containing the headlines.

bla
  • 1,840
  • 1
  • 13
  • 17
  • Many thanks. Indeed, I was wrong with date's format: well seen! Thanks as well for this chunk of Scrapy code which seems to be so easy... I saw there is Scrapy as solution but understood it's much more tough (than Beautiful Soup). Now will dare try it! – Mary Gland May 12 '19 at 18:57
  • bla, just by chance...what Scrapy version do you use? I downloaded 1.6.0 and can't run your script as it doesn't find some modules ("twisted" for instance). – Mary Gland May 14 '19 at 08:30
  • I used scrapy 1.6.0 with python 3.7.3. That is strange... How did you install scrapy? Twisted should be installed as its dependency. – bla May 14 '19 at 11:47
  • Sorry for late answer. I installed by pip. And from Jupyter-notebook doesn't work neither...Wonder if there is any intreference with anaconda. – Mary Gland May 16 '19 at 18:55
  • Does `pip freeze` list `Twisted`? If it doesn't you may want to try and reinstall scrapy. [This page](https://docs.scrapy.org/en/latest/intro/install.html) has more information regarding scrapy installation, it may also help. – bla May 16 '19 at 19:01
  • According to the installation docs you can install scrapy from the `conda-forge` channel using `conda install -c conda-forge scrapy`. I hope it helps. I have no experience with anaconda at all, :(. – bla May 16 '19 at 19:43
  • pip freeze lists Twisted==19.2.0; when run on IDLE the Shell shows the next:Traceback (most recent call last): File "/Users/admin/Desktop/CS6507/scrapy.py", line 5, in import scrapy File "/Users/admin/Desktop/CS6507/scrapy/__init__.py", line 27, in from . import _monkeypatches File "/Users/admin/Desktop/CS6507/scrapy/_monkeypatches.py", line 20, in import twisted.persisted.styles # NOQA ModuleNotFoundError: No module named 'twisted' May be it's because I use Python 3.7.0b4? – Mary Gland May 17 '19 at 08:05
  • It looks like some other people had similar problems. Check out [this](https://stackoverflow.com/questions/33772381/importerror-no-module-named-twisted-persisted-styles) and [this](https://stackoverflow.com/questions/30763614/installing-twisted-through-pip-broken-on-one-server) questions. – bla May 17 '19 at 13:31