1

My python code successfully scrapes text from https://www.groupeactual.eu/offre-emploi and saves them in a csv file.

However, there are multiple pages available at the site above in which I would like to be able to scrape.

For example, with the url above, when I click the link to "page 2" the overall url changes but when I used that url in my code, I get the results from page 1.

How can my code be changed to scrape data from all the available listed pages?

My code :

from bs4 import BeautifulSoup
import requests
import pandas as pd 

response = requests.get('https://www.groupeactual.eu/offre-emploi').text

soup = BeautifulSoup(response, "html.parser")

[Rest of the code goes here .... ]
mustaqSHAH
  • 11
  • 3

1 Answers1

0

The data is loaded via Ajax from different URL. This script goes through all pages and prints titles, links from each page:

import re
import requests
from bs4 import BeautifulSoup


data = {
    '_token': "",
    'limit': "21",
    'order': "",
    'adresse': "",
    'google_adresse': "",
    'distance': "",
    'niveau-experience': "0;10",
    'relations[besoin][contrat][debut]': "",
    'js_range_demarrage_dates': "",
    'informations[remunerations]': "10000;100000",
    'page': ""
}

headers = {
    'X-Requested-With': 'XMLHttpRequest'
}

url = 'https://www.groupeactual.eu/offre-emploi?limit=21&order=&adresse=&distance=&niveau-experience=0%3B10&relations%5Bbesoin%5D%5Bcontrat%5D%5Bdebut%5D=&js_range_demarrage_dates=&informations%5Bremunerations%5D=10000%3B100000&page=1'
api_url = 'https://www.groupeactual.eu/offre-emploi/search'


urls = []
with requests.session() as s:
    soup = BeautifulSoup(s.get(url).content, 'html.parser')
    data['_token'] = soup.select_one('meta[name="csrf-token"]')['content']

    page = 1
    while True:
        data['page'] = page
        print('Page {}...'.format(page))
        soup = BeautifulSoup(s.post(api_url, data=data, headers=headers).content, 'html.parser')
        cards = soup.select('.card')
        if not cards:
            break

        for i, card in enumerate(cards, 1):
            u = re.search(r"'(.*?)'", card['onclick']).group(1)
            print('{:<5} {:<60} {}'.format(i, card.h3.text, u))
            urls.append(u)

        page += 1

print(urls)

Prints:

Page 1...
1     Coffreur bancheur (H/F)                                      https://www.groupeactual.eu/offre-emploi/coffreur-bancheur-hf-ernee-RE0046450A46458?utm_medium=api&utm_campaign=Coffreur+bancheur+%28H%2FF%29-46458
2     PEINTRE H/F                                                  https://www.groupeactual.eu/offre-emploi/peintre-hf-laval-RE0046827A50628?utm_medium=api&utm_campaign=PEINTRE+H%2FF-50628
3     PEINTRE H/F                                                  https://www.groupeactual.eu/offre-emploi/peintre-hf-augny-AG5640208TAA50789?utm_medium=api&utm_campaign=PEINTRE+H%2FF-50789
4     Technicien Fibre Optique (h/f)                               https://www.groupeactual.eu/offre-emploi/technicien-fibre-optique-hf-forbach-AG5640208BCA50790?utm_medium=api&utm_campaign=Technicien+Fibre+Optique+%28h%2Ff%29-50790
5     CONDUCTEUR D'ENGINS H/F                                      https://www.groupeactual.eu/offre-emploi/conducteur-dengins-hf-amblainville-RE0047896A51376?utm_medium=api&utm_campaign=CONDUCTEUR+D%27ENGINS+H%2FF-51376
6     Technicien Informatique (h/f)                                https://www.groupeactual.eu/offre-emploi/technicien-informatique-hf-metz-RE0047858A52066?utm_medium=api&utm_campaign=Technicien+Informatique+%28h%2Ff%29-52066
7     Opérateur Traitement de Surface H/F                          https://www.groupeactual.eu/offre-emploi/operateur-traitement-de-surface-hf-bressuire-RE0050805A53145?utm_medium=api&utm_campaign=Op%C3%A9rateur+Traitement+de+Surface+H%2FF-53145
8     CHAUFFEUR PL SPL (H/F)                                       https://www.groupeactual.eu/offre-emploi/chauffeur-pl-spl-hf-boulogne-sur-mer-RE0047560A53509?utm_medium=api&utm_campaign=CHAUFFEUR+PL+SPL+%28H%2FF%29-53509
9     Technicien d'Installations Électriques (H/F)                 https://www.groupeactual.eu/offre-emploi/technicien-dinstallations-electriques-hf-metz-RE0048762A53801?utm_medium=api&utm_campaign=Technicien+d%27Installations+%C3%89lectriques+%28H%2FF%29-53801
10    Cuisinier en industrie agroalimentaire (H/F)                 https://www.groupeactual.eu/offre-emploi/cuisinier-en-industrie-agroalimentaire-hf-talmont-saint-hilaire-RE0073692A93442?utm_medium=api&utm_campaign=Cuisinier+en+industrie+agroalimentaire+%28H%2FF%29-93442
11    Préparateur de commandes (H/F)                               https://www.groupeactual.eu/offre-emploi/preparateur-de-commandes-hf-sevremoine-RE0074893A94943?utm_medium=api&utm_campaign=Pr%C3%A9parateur+de+commandes+%28H%2FF%29-94943

... and so on (until page 135)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thank you for the answer, I actually want to store the url of different pages in a list. How can I do that ? Sorry if my question is a bit naive, I am new to codding ! – mustaqSHAH Jul 17 '20 at 15:04
  • thank you for the reply, but can I store the url of the page number of individual pages in a list ? – mustaqSHAH Jul 17 '20 at 15:18
  • Thank you very much ! Actually I was wondering if it was possible to get the links of the page numbers themselves in a list, for example : www.example.com/page=1 , www.example.com/page=2, .... and so on. I would really appreciate your help ! – mustaqSHAH Jul 17 '20 at 15:27
  • When I try to increase the last number (page=...) in URL, it just redirects back to page no. 1 ! This the main problem I am facing. – mustaqSHAH Jul 17 '20 at 15:32
  • @mustaqSHAH That's what I'm trying to say...the data you're looking for is loaded from elsewhere (see my script, the URL is `https://www.groupeactual.eu/offre-emploi/search`) – Andrej Kesely Jul 17 '20 at 15:33
  • Thank you very much for the help. If it is possible for you, can you possibly explain and teach me the how the code is working ? I really want to learn this type of advanced coding. If its not possible for you no problem. Have a great day :) – mustaqSHAH Jul 17 '20 at 16:06
  • @mustaqSHAH You can open Network tab under Firefox developer tools (Chrome has something similar too) and see where the page is making requests. Click on next page and you will see there's POST requests to `https://www.groupeactual.eu/offre-emploi/search` with some parameters. – Andrej Kesely Jul 17 '20 at 16:09
  • Okay I understand, you may be referring to chrome dev tools ! Do you have any social media where we can connect ? I have lots of questions regarding this type of topics. If its not okay for you no problem. I want to thank you again for your help ! – mustaqSHAH Jul 17 '20 at 16:14
  • I have one more question, can we put a limit to the page number where we end our search ? – mustaqSHAH Jul 17 '20 at 16:15
  • @mustaqSHAH Of course put `if page==10: break` inside the while loop. Change the `10` to number you want. – Andrej Kesely Jul 17 '20 at 16:16
  • can you please explain the last part of the code, [ cards = soup.select(".card") and rest below ] where can I find these parameters ? – mustaqSHAH Jul 23 '20 at 16:03
  • @mustaqSHAH `cards = soup.select('.card')` is CSS selector, it will select all tags with `class="card"`. You can see the tags if you do, for example `print(soup.prettify())` and observe the HTML structure. – Andrej Kesely Jul 23 '20 at 16:06
  • I went through the requests in dev tools. I have a few more questions, I will list them : (1) How did you form the data dictionary, did you use cURL to python method ? (2) How did you form the headers dictionary ? (3) How and why did you assign a value to the '_token' key in the data dictionary ? (4) Can you please explain what is happening in the while True loop ? – mustaqSHAH Jul 23 '20 at 19:07
  • if possible please help – mustaqSHAH Jul 24 '20 at 10:55