How to properly scrape 1 URL at a time with attached attribute?

Question

I am looking to scrape multiple website domains for various href's within their careers pages.

I only want the links to the jobs and nothing else, and the easiest way I have found to do that is to parse the scrapy response and pull the href's from a specific CSS path.

So far my solution is to create 2 dictionaries each with a generic key, this being URL and Attribute. The keys then have a pre-identified CSS path and the careers page URL.

I am going to create multiple dictionaries automatically in the future from a file of data.

I am storing all of these dictionaries in a list in Python and my plan was to call each dictionary, one at a time, from the list and to use the associated URL and attribute as the required input for scrapy.

# Each List contains two dictionaries,
# One containing the website's careers URL,
# the other containing the location on their jobs container on that page.
# The below is an example but I will name the lists 1,2,3 etc so in a database I can call them easier.
List1= ["https://exampleurl.com/careers", ".joblist a::attr(href)"]
List2 = ["https://exampleurl.com/en/Company/Career-Opportunities", ".content a::attr(href)"]
Dicti = {"URL" : List1[0], "Att" : List1[1]}

This is essentially how I have the list of dictionaries set up.

I am then using

start_urls = [
        List1[Dicti["URL"]],
        List2[Dicti["URL"]]
    ]

I am then also parsing the data like so,

jobs = response.css(Dicti["Att"]).extract()

I think this is potentially where I am going wrong. Although it does load each URL and scrape the HTML from each URL it then isn't parsing the from the attributes correctly.

I tried scraping the lists one at a time, though only having 1 list in the starting URL. This works perfectly, it's when I try to input more than 1 list into the start url.

What exactly am I doing wrong, maybe I misunderstand how the spider works after reading the information. I essentially want to run list1, then stop the spider and run a new instance for list2, all while saying the extracted data.

Any advice on how to overcome this would be massively appreciated.

The proper solution is to edit your earlier post, or just delete it (but please understand that *repeatedly* deleting your questions is problematic behavior, too). For now, I have nominated your earlier question as a duplicate of this one, though I'm afraid this one too is rather vague. What problem *exactly* are you running into? — tripleee, Oct 16 '20 at 08:50
I am looking to scrape and pull the attribute of the css path and url simultaneously. But it’s only pulling URLs and scraping the wrong attribute I want to run the dictionary in the spider one by one — Tommy543, Oct 16 '20 at 09:21
I doubt that this `List1[Dicti["URL"]]` actually works? You need an index (integer in list range) to access a list -- or am I missing something? — Timus, Oct 16 '20 at 10:54
It doesnt work, this is what I need help with! I can get it to pull from a singular list but not from a list of lists and this is what i need. — Tommy543, Oct 16 '20 at 11:04
Try `start_urls = [Dicti["URL"], Dicti["URL"]]` or simpler `start_urls = [List1[0], List2[0]]` ... — Timus, Oct 16 '20 at 11:08

Timus · Answer 1 · 2020-10-17T09:07:09.263

Either organzize your data as a list of short lists

urls = [
    ["https://exampleurl.com/careers", ".joblist a::attr(href)"],
    ["https://exampleurl.com/en/Company/Career-Opportunities", ".content a::attr(href)"],
    ...
]

and then iterate over urls and access the components like

for entry in urls:
    url = entry[0]
    attribute = entry[1]

or shorter

for url, attribute in urls:
    ...

or make a list of small dictionaries

urls = [
    {'URL': "https://exampleurl.com/careers", 'ATT': ".joblist a::attr(href)"},
    {'URL': "https://exampleurl.com/en/Company/Career-Opportunities", 'ATT': ".content a::attr(href)"},
    ...
]

and then iterate over urls and access the components like

for dict_ in urls:
    url = dict_['URL']
    attribute = dict_['ATT']

This looks like a very promising answer, I will give it a go and let you know if it works! Thank you very much. — Tommy543, Oct 16 '20 at 12:27

How to properly scrape 1 URL at a time with attached attribute?

1 Answers1