4

I'm using scrapely to extract data from some HTML, but I'm having difficulties extracting a list of items.

The scrapely github project describes only a simple example:

from scrapely import Scraper
s = Scraper()

s.train(url, data)
s.scrape(another_url)

This is nice if, for example, you are trying to extract data as described:

Usage (API)

Scrapely has a powerful API, including a template format that can be edited externally, that you can use to build very capable scrapers.

What follows that section is a quick example of the simplest possible usage, that you can run in a Python shell.

However, I'm not sure how to extract data if you found something like

Ingredientes

- 50 gr de hojas de albahaca
- 4 cucharadas (60 ml) de piñones
- 2 - 4 dientes de ajo
- 120 ml (1/2 vaso) de aceite de oliva virgen extra
- 115 gr de queso parmesano recién rallado
- 25 gr de queso pecorino recién rallado ( o queso de leche de oveja curado)

I know I can't extract this by using xpath or css selector, but I'm more interested in using parsers that can extract data for similar pages.

Community
  • 1
  • 1
rkmax
  • 17,633
  • 23
  • 91
  • 176
  • 5
    isn't this a bit too broad? Can you post the link to the page for example and explain what is it exactly that you want from there? Then someone answering can generalize – e4c5 Jun 04 '16 at 05:52
  • 5
    As e4c5 says, this is too broad to answer right now... If you can't link the sample page, could you explain why your list isn't suitable for normal xpath/css selectors - e.g. is this list inside a block of preformatted text? – Peter Brittain Jun 04 '16 at 10:25
  • 2
    I've tidied up the grammar of your question, but you really need to add: 1. The exact HTML of a small sample page with the data you are trying to extract 2. The data you expect to retrieve from that page (e.g. quantity, name of ingredient, maybe?) 3. What Python code you've tried so far. As it stands, the question can't be answered without a lot of quesswork. See [how to ask](http://stackoverflow.com/help/how-to-ask) and how to construct a [Minimal, complete and verifiable example](http://stackoverflow.com/help/mcve) for help to ask the question in a way that gets the best help. Good luck! – J Richard Snape Jun 06 '16 at 09:54

2 Answers2

6

Scrapely can be trained to extract a list of items. The trick is to pass the first and last items of the list to be extracted as a Python list when training. Here an example inspired by the question: (Training: 10-item ingredient list from url1, test: 7-item list from url2.)

from scrapely import Scraper

s = Scraper()

url1 = 'http://www.sabormediterraneo.com/recetas/postres/leche_frita.htm'
data = {'ingreds': ['medio litro de leche',   # first and last items
  u'canela y az\xfacar para espolvorear']}
s.train(url1, data)

url2 = 'http://www.sabormediterraneo.com/recetas/cordero_horno.htm'
print s.scrape(url2)

Here the output:

[{u'ingreds': [
  u' 2 piernas o dos paletillas de cordero lechal o recental ',
  u'3 dientes de ajo',
  u'una copita de vino tinto / o / blanco',
  u'una copita de agua',
  u'media copita de aceite de oliva',
  u'or\xe9gano, perejil',
  u'sal, pimienta negra y aceite de oliva']}]

Training on the question's ingredient list (http://www.sabormediterraneo.com/cocina/salsas6.htm) did not generalize directly to the "recetas" pages. One solution would be to train several scrapers and then check which one works on a given page. (Training one scraper on several pages did not give a general solution in a quick test of mine.)

Ulrich Stern
  • 10,761
  • 5
  • 55
  • 76
3

Scrapely can extract lists of items from structures lists (e.g. <ul> or <ol>) — see the other answer. However, because it extracts content using HTML/document fragments it is not able to extract text-formatted data contained within a single tag with no delimiting tags (<li></li>), which seems to be what you're trying to do here.

But, if you are able to select the ingredients block as a whole, you can easily post-process the data you receive to get the required output. For example, in your example case .split('\n')[3:-2] would give you the ingredients as the following list:

['- 50 gr de hojas de albahaca',
 '- 4 cucharadas (60 ml) de piñones',
 '- 2 - 4 dientes de ajo',
 '- 120 ml (1/2 vaso) de aceite de oliva virgen extra',
 '- 115 gr de queso parmesano recién rallado',
 '- 25 gr de queso pecorino recién rallado ( o queso de leche de oveja curado)']

If you want to do this regularly (or need to add post-processing to multiple fields) you could subclass the Scraper class as follows to add a custom method:

class PostprocessScraper(Scraper):

    def scrape_page_postprocess(self, page, processors=None):
        if processors == None:
            processors = {}

        result = self.scrape_page(page)
        for r in result:
            for field, items in r.items():
                if field in processors:
                    fn = processors[field]
                    r[field] = [fn(i) for i in items]

        return result

This new method scrape_page_postprocess accepts a dictionary of postprocessors to run across the returned data keyed by field. For example:

processors = {'ingredients': lambda s: s.split('\n')[3:-2]}
scrape_page_postprocess(page, processors)
Community
  • 1
  • 1
mfitzp
  • 15,275
  • 7
  • 50
  • 70