0

This is my Scrapy custom regex pipeline code:

for p in item['code']:
        for search_type, pattern in RegEx.regexp.iteritems():
            s = re.findall(pattern, p)
                if s:
                    return item
                else: 
                    raise DropItem

And this is my ReGex code:

class RegEx(object):
regexp = {
    'email' : re.compile('liczba'), 'whatever' : re.compile(r'mit'), 'blu' : re.compile(r'houseLocked'),}

Not real compiled regex as just for demo purposes.

This works, but once a match is found, and "return item" is triggered, the rest is dropped.

Is is possible to continue iterating in the Scrapy pipeline?

I've been at this for 4 days and tried every permutation you can imagine, but always the same result.

I'm either missing the obvious or this is not straightforward.

If not possible in this manner, any recommendations for a new route greatly appreciated.

Stuart
  • 11
  • " and "return item" is triggered, the rest is dropped." what do you mean by rest is dropped? rest of the items? or that the loop just breaks and stops at that point? – Granitosaurus Jan 30 '17 at 18:57
  • If I set 3 regex's as above and I already know one of each exists in one of the web pages to be scraped, only one match is returned and one URL of scraped data. I just don't know why this is happening. I believe it continuous scraping, but simply assigns them as "dropped". It's weird. – Stuart Jan 30 '17 at 19:02

1 Answers1

1

The process_item() method in scrapy pipeline should only processes one item. If you raise DropItem or return something, you break the loop and discard the rest of the parsing.

Your loop will break after first regex match you are doing because both return item and DropItem break the loop and stops the current pipeline - in other words it will break on the first loop.

To remedy that just move DropItem outside of the main loop:

def process_item(self, item):
    for p in item['code']:
        for search_type, pattern in RegEx.regexp.iteritems():
            if re.findall(pattern, p):
                return item  # one match found == item is valid, return
    # if this is reached, it means no matches were found
    # and we don't want this item
    raise DropItem
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82