1

In my Scrapy project I'm storing the scraped data in MongoDB using PyMongo. There are duplicate records while crawling the web pages in page by page manner, I just want to remove those duplicate records which are with same name at the time of inserting them in to database. Please suggest me the best solution. Here is my code in "pipelines.py" . Please guide me how to remove duplicates in the method "process_item". I found few queries to remove duplicates from the database in the internet but want a solution in Python.

from pymongo import MongoClient
from scrapy.conf import settings
class MongoDBPipeline(object):

    def __init__(self):
        connection = MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        self.collection.insert(dict(item))
        return item
Krish Allamraju
  • 703
  • 1
  • 7
  • 29

1 Answers1

1

It slightly depends on what's in the item but I would use update with upsert like this

def process_item(self, item, spider):
    # pseudo example
    _filter = item.get('website')
    update = item.get('some_params')
    if _filter:
        # example
        # self.collection.update_one(
        #     {"website": "abc"}, 
        #     {"div foo": "sometext"}, 
        #     upsert=True
        #     )

        self.collection.update_one(_filter, update, upsert=True)
    return item

You could also play around with filter. Basically, you wouldn't even have to remove dupes. It works like if-else condition if applied properly. If the object doesn't exist, create one. Else, update with given properties on given keys. Like in a dictionary. Worst case scenario it updates with the same values. So it's faster than inserting, querying and deleting found duplicates.

docs

There's no literal if-else in MongoDB and @tanaydin advice with automatically dropping dupes also works in Python. It could be better than my advice, depending on what you really need.

If you really want to remove documents given some criteria, then there's delete_one and delete_many in pymongo.

docs

Tom Wojcik
  • 5,471
  • 4
  • 32
  • 44