In my Scrapy project I'm storing the scraped data in MongoDB using PyMongo. There are duplicate records while crawling the web pages in page by page manner, I just want to remove those duplicate records which are with same name at the time of inserting them in to database. Please suggest me the best solution.
Here is my code in "pipelines.py"
. Please guide me how to remove duplicates in the method "process_item"
. I found few queries to remove duplicates from the database in the internet but want a solution in Python.
from pymongo import MongoClient
from scrapy.conf import settings
class MongoDBPipeline(object):
def __init__(self):
connection = MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT'])
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
self.collection.insert(dict(item))
return item