2

I'm trying to build a small app for a university project with Scrapy. The spider is scraping the items, but my pipeline is not inserting data into mysql database. In order to test whether the pipeline is not working or the pymysl implementation is not working I wrote a test script:

Code Start

#!/usr/bin/python3

import pymysql

str1 = "hey"
str2 = "there"
str3 = "little"
str4 = "script"

db = pymysql.connect("localhost","root","**********","stromtarife" )

cursor = db.cursor()

cursor.execute("SELECT * FROM vattenfall")
cursor.execute("INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)", (str1, str2, str3, str4))
cursor.execute("SELECT * FROM vattenfall")
data = cursor.fetchone()
print(data)
db.commit()
cursor.close()

db.close()

Code End

After i run this script my database has a new record, so its not my pymysql.connect() function, which is broke.

I'll provide my scrapy code:

vattenfall_form.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from stromtarife.items import StromtarifeItem

from scrapy.http import FormRequest

class VattenfallEasy24KemptenV1500Spider(scrapy.Spider):
    name = 'vattenfall-easy24-v1500-p87435'

    def start_requests(self):
        return [
            FormRequest(
                "https://www.vattenfall.de/de/stromtarife.htm",
                formdata={"place": "87435", "zipCode": "87435", "cityName": "Kempten",
                      "electricity_consumptionprivate": "1500", "street": "", "hno": ""},
            callback=self.parse
        ),
    ]

    def parse(self, response):
        item = StromtarifeItem()
        item['jahrespreis'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[3]/td[2]/text()').extract_first()
        item['treuebonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[2]/td/strong/text()').extract_first()
        item['sofortbonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[1]/td/strong/text()').extract_first()
        item['tarif'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[1]/h2/span/text()').extract_first()
        yield item



class VattenfallEasy24KemptenV2500Spider(scrapy.Spider):
    name = 'vattenfall-easy24-v2500-p87435'

    def start_requests(self):
        return [
                    FormRequest(
                    "https://www.vattenfall.de/de/stromtarife.htm",
                    formdata={"place": "87435", "zipCode": "87435", "cityName": "Kempten",
                              "electricity_consumptionprivate": "2500", "street": "", "hno": ""},
                    callback=self.parse
                ),
    ]

    def parse(self, response):
        item = StromtarifeItem()
        item['jahrespreis'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[3]/td[2]/text()').extract_first()
        item['treuebonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[2]/td/strong/text()').extract_first()
        item['sofortbonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[1]/td/strong/text()').extract_first()
        item['tarif'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[1]/h2/span/text()').extract_first()
        yield item



process = CrawlerProcess()
process.crawl(VattenfallEasy24KemptenV1500Spider)
process.crawl(VattenfallEasy24KemptenV2500Spider)
process.start()

pipelines.py

import pymysql
from stromtarife.items import StromtarifeItem


class StromtarifePipeline(object):
    def __init__(self):
        self.connection = pymysql.connect("localhost","root","**********","stromtarife")
        self.cursor = self.connection.cursor()


    def process_item(self, item, spider):
        self.cursor.execute("INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)", (item['tarif'], item['sofortbonus'], item['treuebonus'], item['jahrespreis']))
        self.connection.commit()
        self.cursor.close()
        self.connection.close()

settings.py (i changed only that line)

ITEM_PIPELINES = {
   'stromtarife.pipelines.StromtarifePipeline': 300,
}

So what is wrong with my code ? I couldn't figure it out and would be really happy if someone is seeing something i'm missing. Thanks in advance!

3 Answers3

2

You should not close your pymsql connection every time you process an item.

You should write the close_spider function in your pipeline like this, so the connection is closed just once, at the end of the execution:

 def close_spider(self, spider):
        self.cursor.close()
        self.connection.close()

Moreover you neeed to return your item at the end of process_item

Your file pipeline.py should look like this:

import pymysql
from stromtarife.items import StromtarifeItem


class StromtarifePipeline(object):
    def __init__(self):
        self.connection = pymysql.connect("localhost","root","**********","stromtarife")
        self.cursor = self.connection.cursor()


    def process_item(self, item, spider):
        self.cursor.execute("INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)", (item['tarif'], item['sofortbonus'], item['treuebonus'], item['jahrespreis']))
        self.connection.commit()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.connection.close()

UPDATE :

I tried your code, the problem is in the pipeline, there are two problems:

  • You try to index the euro symbol and I think mysql does not like it.
  • Your query string is not well built.

I managed to get things done by writting the pipeline like this:

def process_item(self, item, spider):
    query = """INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)""" % ("1", "2", "3", "4")
    self.cursor.execute(query)
    self.connection.commit()
    return item

I thing you should remove the from the prices you try to insert.

Hope this helps, let me know.

Adrien Blanquer
  • 2,041
  • 1
  • 19
  • 31
  • Thank you for your answer. I changed process_item() and added close_spider(), but i still don't get anything into my database. If i get the results i could make the next step and follow rrschmidt's advice. I really cannot figure out what's wrong with my code .. – tolgaIsThere Apr 27 '17 at 17:04
  • i replaced in process_item() in the cursor.execute() function that part: ...%s,%s)", (item['tarif'], item['sofortbonus'], item['treuebonus'], item['jahrespreis'])) with strings: ..%s, %s)", ("hey", "how", "are", "you")) And it's still not working.. – tolgaIsThere Apr 27 '17 at 17:28
  • I updated the previous answer, let me know if this works for you – Adrien Blanquer Apr 28 '17 at 08:07
  • is closing the `cursor` only necessary after using `noCursorTimeout()`???? – oldboy Jun 30 '18 at 23:30
0

There is another problem with your scraper besides the fact that your SQL Pipeline closes the SQL connection after writing the first item (as Adrien pointed out).

The other problem is: your scraper only scrapes one single item per results page (and also visits only one results page). I checked Vattenfall and there are usually multiple results displayed and I guess you want to scrape them all.

Means you'll also have to iterate over the results on the page and create multiple items while doing so. The scrapy tutorial here gives a good explanation how to do this: https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-quotes-and-authors

Done Data Solutions
  • 2,156
  • 19
  • 32
0

First of all, in Code Start print(data) must come after db.commit(), otherwise the data which was just inserted into your database will not show up in the print.

Lastly, judging by the names of your columns, it's probably an issue of encoding if the idea above doesn't work.

oldboy
  • 5,729
  • 6
  • 38
  • 86