I'm currently working on writing a web scraper for a website called Mountain Project and have come across an issue while inserting items into a MySQL (MariaDB) database.
The basic flow of my crawler is this:
- Get response from link extractor (I'm subclassing CrawlSpider)
- Extract data from response and send extracted items down the item pipeline
- Items get picked up by the SqlPipeline and inserted into the database
An important note about step 2 is that the crawler is sending multiple items down the pipeline. The first item will be the main resource (either a route or an area), and then any items after that will be other important data about that resource. For areas, those items will be data on different weather conditions, for routes those items will be the different grades assigned to those routes.
My area and route tables look like this:
CREATE TABLE `area` (
`area_id` INT(10) UNSIGNED NOT NULL,
`parent_id` INT(10) UNSIGNED NULL DEFAULT NULL,
`name` VARCHAR(200) NOT NULL DEFAULT '',
`latitude` FLOAT(12) NULL DEFAULT -1,
`longitude` FLOAT(12) NULL DEFAULT -1,
`elevation` INT(11) NULL DEFAULT '0',
`link` VARCHAR(300) NOT NULL,
PRIMARY KEY (`area_id`) USING BTREE
)
CREATE TABLE `route` (
`route_id` INT(10) UNSIGNED NOT NULL,
`parent_id` INT(10) UNSIGNED NOT NULL,
`name` VARCHAR(200) NOT NULL DEFAULT '' COLLATE 'utf8_general_ci',
`link` VARCHAR(300) NOT NULL COLLATE 'utf8_general_ci',
`rating` FLOAT(12) NULL DEFAULT '0',
`types` VARCHAR(50) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`pitches` INT(3) NULL DEFAULT '0',
`height` INT(5) NULL DEFAULT '0',
`length` VARCHAR(5) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
PRIMARY KEY (`route_id`) USING BTREE,
INDEX `fk_parent_id` (`parent_id`) USING BTREE,
CONSTRAINT `fk_parent_id` FOREIGN KEY (`parent_id`) REFERENCES `mountainproject`.`area` (`area_id`) ON UPDATE CASCADE ON DELETE CASCADE
)
And here's an example of one of my condition tables:
CREATE TABLE `temp_avg` (
`month` INT(2) UNSIGNED NOT NULL,
`area_id` INT(10) UNSIGNED NOT NULL,
`avg_high` INT(3) NOT NULL,
`avg_low` INT(3) NOT NULL,
PRIMARY KEY (`month`, `area_id`) USING BTREE,
INDEX `fk_area_id` (`area_id`) USING BTREE,
CONSTRAINT `fk_area_id` FOREIGN KEY (`area_id`) REFERENCES `mountainproject`.`area` (`area_id`) ON UPDATE CASCADE ON DELETE CASCADE
)
Here's where things start to get troublesome. If I run my crawler and just extract areas, everything works fine. The area is inserted into the database, and all the conditions data gets inserted without a problem. However, when I try to extract areas and routes, I get foreign key constraint failures when trying to insert routes because the area that the route belongs to (parent_id) doesn't exist. Currently to work around this I've been running my crawler twice. Once to extract area data, and once to extract route data. If I do that, everything goes smoothly.
My best guess as to why this doesn't work currently is that the areas that are being inserted haven't been committed and so when I attempt to add a route that belongs to an uncommitted area, it can't find the parent area. This theory quickly falls apart though because I'm able to insert condition data in the same run that I insert the area that the data belongs to.
My insertion code looks like this:
def insert_item(self, table_name, item):
encoded_vals = [self.sql_encode(val) for val in item.values()]
sql = "INSERT INTO %s (%s) VALUES (%s)" % (
table_name,
", ".join(item.keys()),
", ".join(encoded_vals)
)
logging.debug(sql)
self.cursor.execute(sql)
# EDIT: As suggested by @tadman I have moved to using the built in sql value
# encoding. I'm leaving this here because it doesn't affect the issue
def sql_encode(self, value):
"""Encode provided value and return a valid SQL value
Arguments:
value {Any} -- Value to encode
Returns:
str -- SQL encode value as a str
"""
encoded_val = None
is_empty = False
if isinstance(value, str):
is_empty = len(value) == 0
encoded_val = "NULL" if is_empty or value is None else value
if isinstance(encoded_val, str) and encoded_val is not "NULL":
encode_val = encoded_val.replace("\"", "\\\"")
encoded_val = "\"%s\"" % encoded_val
return str(encoded_val)
The rest of the project lives in a GitHub repo if any more code/context is needed