Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB.
How do i check if the scraped data is already in the DB and if so, dont enter into the DB....
Here is my pipeline.py code
from sqlalchemy.orm import sessionmaker
from models import Channels, db_connect, create_channel_table
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class YtscraperPipeline(object):
"""YTscraper pipeline for storing scraped items in the database"""
def __init__(self):
#Initializes database connection and sessionmaker.
#Creates deals table.
engine = db_connect()
create_channel_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save youtube channel in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
channel = Channels(**item)
try:
session.add(channel)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
Here is my models.py
from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL
import settings
DeclarativeBase = declarative_base()
def db_connect():
"""
Performs database connection using database settings from settings.py.
Returns sqlalchemy engine instance
"""
return create_engine(URL(**settings.DATABASE))
def create_channel_table(engine):
""""""
DeclarativeBase.metadata.create_all(engine)
class Channels(DeclarativeBase):
"""Sqlalchemy deals model"""
__tablename__ = "ytchannels"
id = Column(Integer, primary_key=True)
ctitle = Column('title', String)
clink = Column('link', String, nullable=True)
csubs = Column('subs', String, nullable=True)
date = Column('date', DateTime, nullable=True)
I would like to not have duplicates added to the DB. How can i do that?
this is what i get when i dump the table each run, basically adds the same info over and over.
id | title | link | subs | date
----+----------------------+----------------------------------------------------------+---------+----------------------------
1 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:09:48.236281
2 | DataDash | https://www.youtube.com/channel/UCCatR7nWbYrkVXdxXb4cGXw | 315,691 | 2019-02-02 15:09:49.517085
3 | Tone Vays | https://www.youtube.com/channel/UCbiWJYRg8luWHnmNkJRZEnw | 82,588 | 2019-02-02 15:09:52.502221
4 | Crypt0 | https://www.youtube.com/user/obham001 | 119,046 | 2019-02-02 15:09:52.895278
5 | The Modern Investor | https://www.youtube.com/channel/UC-5HLi3buMzdxjdTdic3Aig | 122,228 | 2019-02-02 15:09:52.990033
6 | Decentralized TV | https://www.youtube.com/channel/UCueLJ4vLHTwMpYILmdBjRlg | 79,211 | 2019-02-02 15:09:53.108132
7 | Crypto Daily | https://www.youtube.com/channel/UC67AEEecqFEc92nVvcqKdhA | 121,341 | 2019-02-02 15:09:53.138157
8 | RoadtoRoota | https://www.youtube.com/user/RoadtoRoota | 54,954 | 2019-02-02 15:09:54.386956
9 | Altcoin Buzz | https://www.youtube.com/channel/UCGyqEtcGQQtXyUwvcy7Gmyg | 210,547 | 2019-02-02 15:09:54.412399
10 | TheChartGuys | https://www.youtube.com/channel/UCnqZ2hx679DqRi6khRUNw2g | 113,431 | 2019-02-02 15:09:55.36888
11 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:09:55.563061
12 | Altcoin Daily | https://www.youtube.com/channel/UCbLhGKVY-bJPcawebgtNfbw | 62,543 | 2019-02-02 15:09:56.327525
13 | The Moon | https://www.youtube.com/channel/UCc4Rz_T9Sb1w5rqqo9pL1Og | 37,291 | 2019-02-02 15:09:56.376596
14 | Alessio Rastani | https://www.youtube.com/user/alessiorastani | 176,025 | 2019-02-02 15:09:56.439162
15 | CryptosRUs | https://www.youtube.com/channel/UCI7M65p3A-D3P4v5qW8POxQ | 51,387 | 2019-02-02 15:09:56.482699
16 | Crypto Zombie | https://www.youtube.com/channel/UCiUnrCUGCJTCC7KjuW493Ww | 46,715 | 2019-02-02 15:09:56.582438
17 | Crypto Love | https://www.youtube.com/channel/UCu7Sre5A1NMV8J3s2FhluCw | 93,999 | 2019-02-02 15:09:56.792019
18 | Crypto Kirby Trading | https://www.youtube.com/channel/UCOaew10hdmtfa0MinTjOBqg | 31,333 | 2019-02-02 15:09:58.092356
19 | sunny decree | https://www.youtube.com/user/d3cr33 | 80,294 | 2019-02-02 15:09:58.127674
20 | Crypto Jebb | https://www.youtube.com/channel/UCviqt5aaucA1jP3qFmorZLQ | 17,531 | 2019-02-02 15:09:58.396679
21 | Chico Crypto | https://www.youtube.com/channel/UCHop-jpf-huVT1IYw79ymPw | 29,144 | 2019-02-02 15:09:58.467988
22 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:44:46.905164
23 | DataDash | https://www.youtube.com/channel/UCCatR7nWbYrkVXdxXb4cGXw | 315,688 | 2019-02-02 15:44:49.13279
24 | Crypto Daily | https://www.youtube.com/channel/UC67AEEecqFEc92nVvcqKdhA | 121,342 | 2019-02-02 15:44:50.450665
25 | The Modern Investor | https://www.youtube.com/channel/UC-5HLi3buMzdxjdTdic3Aig | 122,226 | 2019-02-02 15:44:50.513322
26 | Tone Vays | https://www.youtube.com/channel/UCbiWJYRg8luWHnmNkJRZEnw | 82,589 | 2019-02-02 15:44:50.546499
27 | Crypt0 | https://www.youtube.com/user/obham001 | 119,040 | 2019-02-02 15:44:50.642958
28 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:44:50.951154
29 | Decentralized TV | https://www.youtube.com/channel/UCueLJ4vLHTwMpYILmdBjRlg | 79,211 | 2019-02-02 15:44:51.191991
30 | Altcoin Buzz | https://www.youtube.com/channel/UCGyqEtcGQQtXyUwvcy7Gmyg | 210,546 | 2019-02-02 15:44:51.266842
31 | Alessio Rastani | https://www.youtube.com/user/alessiorastani | 176,027 | 2019-02-02 15:44:51.420558
32 | The Moon | https://www.youtube.com/channel/UCc4Rz_T9Sb1w5rqqo9pL1Og | 37,294 | 2019-02-02 15:44:52.020989
33 | RoadtoRoota | https://www.youtube.com/user/RoadtoRoota | 54,954 | 2019-02-02 15:44:52.177793
34 | TheChartGuys | https://www.youtube.com/channel/UCnqZ2hx679DqRi6khRUNw2g | 113,437 | 2019-02-02 15:44:52.245701
35 | Altcoin Daily | https://www.youtube.com/channel/UCbLhGKVY-bJPcawebgtNfbw | 62,538 | 2019-02-02 15:44:52.864349
36 | Crypto Zombie | https://www.youtube.com/channel/UCiUnrCUGCJTCC7KjuW493Ww | 46,716 | 2019-02-02 15:44:53.042814
37 | CryptosRUs | https://www.youtube.com/channel/UCI7M65p3A-D3P4v5qW8POxQ | 51,388 | 2019-02-02 15:44:53.246394
38 | Crypto Kirby Trading | https://www.youtube.com/channel/UCOaew10hdmtfa0MinTjOBqg | 31,333 | 2019-02-02 15:44:53.54117
39 | sunny decree | https://www.youtube.com/user/d3cr33 | 80,294 | 2019-02-02 15:44:54.288063
40 | Crypto Love | https://www.youtube.com/channel/UCu7Sre5A1NMV8J3s2FhluCw | 93,998 | 2019-02-02 15:44:54.591665
41 | Crypto Jebb | https://www.youtube.com/channel/UCviqt5aaucA1jP3qFmorZLQ | 17,531 | 2019-02-02 15:44:54.769744
42 | Chico Crypto | https://www.youtube.com/channel/UCHop-jpf-huVT1IYw79ymPw | 29,148 | 2019-02-02 15:44:55.791358