-3

I am looking to scrape multiple website domains for various href's within their careers pages.

I only want the links to the jobs and nothing else, and the easiest way I have found to do that is to parse the scrapy response and pull the href's from a specific CSS path.

So far my solution is to create 2 dictionaries each with a generic key, this being URL and Attribute. The keys then have a pre-identified CSS path and the careers page URL.

I am going to create multiple dictionaries automatically in the future from a file of data.

I am storing all of these dictionaries in a list in Python and my plan was to call each dictionary, one at a time, from the list and to use the associated URL and attribute as the required input for scrapy.

# Each List contains two dictionaries,
# One containing the website's careers URL,
# the other containing the location on their jobs container on that page.
# The below is an example but I will name the lists 1,2,3 etc so in a database I can call them easier.
List1= ["https://exampleurl.com/careers", ".joblist a::attr(href)"]
List2 = ["https://exampleurl.com/en/Company/Career-Opportunities", ".content a::attr(href)"]
Dicti = {"URL" : List1[0], "Att" : List1[1]}

This is essentially how I have the list of dictionaries set up.

I am then using

start_urls = [
        List1[Dicti["URL"]],
        List2[Dicti["URL"]]
    ]

I am then also parsing the data like so,

jobs = response.css(Dicti["Att"]).extract()

I think this is potentially where I am going wrong. Although it does load each URL and scrape the HTML from each URL it then isn't parsing the from the attributes correctly.

I tried scraping the lists one at a time, though only having 1 list in the starting URL. This works perfectly, it's when I try to input more than 1 list into the start url.

What exactly am I doing wrong, maybe I misunderstand how the spider works after reading the information. I essentially want to run list1, then stop the spider and run a new instance for list2, all while saying the extracted data.

Any advice on how to overcome this would be massively appreciated.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Tommy543
  • 1
  • 1
  • The proper solution is to edit your earlier post, or just delete it (but please understand that *repeatedly* deleting your questions is problematic behavior, too). For now, I have nominated your earlier question as a duplicate of this one, though I'm afraid this one too is rather vague. What problem *exactly* are you running into? – tripleee Oct 16 '20 at 08:50
  • I am looking to scrape and pull the attribute of the css path and url simultaneously. But it’s only pulling URLs and scraping the wrong attribute I want to run the dictionary in the spider one by one – Tommy543 Oct 16 '20 at 09:21
  • I doubt that this `List1[Dicti["URL"]]` actually works? You need an index (integer in list range) to access a list -- or am I missing something? – Timus Oct 16 '20 at 10:54
  • It doesnt work, this is what I need help with! I can get it to pull from a singular list but not from a list of lists and this is what i need. – Tommy543 Oct 16 '20 at 11:04
  • Try `start_urls = [Dicti["URL"], Dicti["URL"]]` or simpler `start_urls = [List1[0], List2[0]]` ... – Timus Oct 16 '20 at 11:08

1 Answers1

1

Either organzize your data as a list of short lists

urls = [
    ["https://exampleurl.com/careers", ".joblist a::attr(href)"],
    ["https://exampleurl.com/en/Company/Career-Opportunities", ".content a::attr(href)"],
    ...
]

and then iterate over urls and access the components like

for entry in urls:
    url = entry[0]
    attribute = entry[1]

or shorter

for url, attribute in urls:
    ...

or make a list of small dictionaries

urls = [
    {'URL': "https://exampleurl.com/careers", 'ATT': ".joblist a::attr(href)"},
    {'URL': "https://exampleurl.com/en/Company/Career-Opportunities", 'ATT': ".content a::attr(href)"},
    ...
] 

and then iterate over urls and access the components like

for dict_ in urls:
    url = dict_['URL']
    attribute = dict_['ATT']
Timus
  • 10,974
  • 5
  • 14
  • 28
  • This looks like a very promising answer, I will give it a go and let you know if it works! Thank you very much. – Tommy543 Oct 16 '20 at 12:27