0

I'm scraping ~10 websites for the same information, and currently have a script for each one of them that works on its own. These scripts all have the same base (iterate over available pages, scrape information, save it), but different attributes.

As an example, these are examples of how I'm extracting the author element from two pages:

page.at('b[itemprop="author"]').children.text.strip
page.at('.author-username').text.strip

My goal is to refactor this so the main logic is handled by in a class, but I'm having trouble figuring out how to pass in the above extractors depending on the source. I'm aware that I can pass CSS selectors as arguments, but as you can see there is some additional logic for each extraction.

While I could have a separate method to handle this (as outlined in the previous link), this would quickly get out of hand with ~10 sources.

What is the best way to refactor this code?

Community
  • 1
  • 1
Manonthemoon
  • 95
  • 1
  • 1
  • 4

1 Answers1

0

I would probably go with a hash.

Assuming there's not too much detail, put it all in a sort of Rosetta Stone hash that supplies the relevant info for each page. This can be used in conjunction with a case...when statement to load the relevant details.

Something like:

site_attributes = {
  site_1: ['attribute_1', 'attribute_2', ... ],
  site_2: ['attribute_3', 'attribute_4', ... ],
  ...
}

It may need to be a little more complex if you need to call different methods on the results of different attributes. Then your array of attributes for each site would need to be hashes instead of strings. Something like:

[
  {
    attr: 'attribute_1',
    methods: [:children, :text, :strip]
  }, {
    attr: 'attribute_2',
    methods: [:text, :strip]
  },
  ...
]

Then you can each through the attributes, use them with page.at(), and iteratively call the additional methods on the result.

Scott Schupbach
  • 1,284
  • 9
  • 21
  • Thanks for your answer Scott, that's helpful. Could you please elaborate how I would do the method calls that are supplied through those strings (`['children', 'text', 'strip']`)? As an example, how would I run `attribute_1.children.text.strip`? – Manonthemoon Nov 25 '16 at 13:06
  • Ah I figured it out, I'm using `page.send()` to call each method. – Manonthemoon Nov 25 '16 at 17:20
  • Right. And really, those lists of methods should have been symbols rather than strings. Either should work the same, but symbols are more memory efficient. – Scott Schupbach Nov 28 '16 at 23:16