1

I'm currently working on writing a web scraper for a website called Mountain Project and have come across an issue while inserting items into a MySQL (MariaDB) database.

The basic flow of my crawler is this:

  1. Get response from link extractor (I'm subclassing CrawlSpider)
  2. Extract data from response and send extracted items down the item pipeline
  3. Items get picked up by the SqlPipeline and inserted into the database

An important note about step 2 is that the crawler is sending multiple items down the pipeline. The first item will be the main resource (either a route or an area), and then any items after that will be other important data about that resource. For areas, those items will be data on different weather conditions, for routes those items will be the different grades assigned to those routes.

My area and route tables look like this:

CREATE TABLE `area` (
    `area_id` INT(10) UNSIGNED NOT NULL,
    `parent_id` INT(10) UNSIGNED NULL DEFAULT NULL,
    `name` VARCHAR(200) NOT NULL DEFAULT '',
    `latitude` FLOAT(12) NULL DEFAULT -1,
    `longitude` FLOAT(12) NULL DEFAULT -1,
    `elevation` INT(11) NULL DEFAULT '0',
    `link` VARCHAR(300) NOT NULL,
    PRIMARY KEY (`area_id`) USING BTREE
)

CREATE TABLE `route` (
    `route_id` INT(10) UNSIGNED NOT NULL,
    `parent_id` INT(10) UNSIGNED NOT NULL,
    `name` VARCHAR(200) NOT NULL DEFAULT '' COLLATE 'utf8_general_ci',
    `link` VARCHAR(300) NOT NULL COLLATE 'utf8_general_ci',
    `rating` FLOAT(12) NULL DEFAULT '0',
    `types` VARCHAR(50) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
    `pitches` INT(3) NULL DEFAULT '0',
    `height` INT(5) NULL DEFAULT '0',
    `length` VARCHAR(5) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
    PRIMARY KEY (`route_id`) USING BTREE,
    INDEX `fk_parent_id` (`parent_id`) USING BTREE,
    CONSTRAINT `fk_parent_id` FOREIGN KEY (`parent_id`) REFERENCES `mountainproject`.`area` (`area_id`) ON UPDATE CASCADE ON DELETE CASCADE
)

And here's an example of one of my condition tables:

CREATE TABLE `temp_avg` (
    `month` INT(2) UNSIGNED NOT NULL,
    `area_id` INT(10) UNSIGNED NOT NULL,
    `avg_high` INT(3) NOT NULL,
    `avg_low` INT(3) NOT NULL,
    PRIMARY KEY (`month`, `area_id`) USING BTREE,
    INDEX `fk_area_id` (`area_id`) USING BTREE,
    CONSTRAINT `fk_area_id` FOREIGN KEY (`area_id`) REFERENCES `mountainproject`.`area` (`area_id`) ON UPDATE CASCADE ON DELETE CASCADE
)

Here's where things start to get troublesome. If I run my crawler and just extract areas, everything works fine. The area is inserted into the database, and all the conditions data gets inserted without a problem. However, when I try to extract areas and routes, I get foreign key constraint failures when trying to insert routes because the area that the route belongs to (parent_id) doesn't exist. Currently to work around this I've been running my crawler twice. Once to extract area data, and once to extract route data. If I do that, everything goes smoothly.

My best guess as to why this doesn't work currently is that the areas that are being inserted haven't been committed and so when I attempt to add a route that belongs to an uncommitted area, it can't find the parent area. This theory quickly falls apart though because I'm able to insert condition data in the same run that I insert the area that the data belongs to.

My insertion code looks like this:

def insert_item(self, table_name, item):
    encoded_vals = [self.sql_encode(val) for val in item.values()]
    sql = "INSERT INTO %s (%s) VALUES (%s)" % (
        table_name,
        ", ".join(item.keys()),
        ", ".join(encoded_vals)
    )

    logging.debug(sql)
    self.cursor.execute(sql)

# EDIT: As suggested by @tadman I have moved to using the built in sql value
# encoding. I'm leaving this here because it doesn't affect the issue
def sql_encode(self, value):
    """Encode provided value and return a valid SQL value

    Arguments:
        value {Any} -- Value to encode

    Returns:
        str -- SQL encode value as a str
    """
    encoded_val = None
    is_empty = False

    if isinstance(value, str):
        is_empty = len(value) == 0

    encoded_val = "NULL" if is_empty or value is None else value

    if isinstance(encoded_val, str) and encoded_val is not "NULL":
        encode_val = encoded_val.replace("\"", "\\\"")
        encoded_val = "\"%s\"" % encoded_val

    return str(encoded_val)

The rest of the project lives in a GitHub repo if any more code/context is needed

Jacob
  • 363
  • 4
  • 12
  • 2
    **WARNING**: It's very important you **DO NOT** use `%` to format SQL strings. You *must* [properly escape values](https://bobby-tables.com). Your home-rolled escaping method here is inadequate. Let Python do the work for you, it already supports binding in `execute` if you pass a second argument as a tuple. – tadman Mar 30 '20 at 16:56
  • It's worth asking at this junction why you're not using some kind of [ORM](https://www.fullstackpython.com/object-relational-mappers-orms.html) to handle this for you because you're going to get a whole ton of things wrong if you're not deeply familiar with how SQL works. – tadman Mar 30 '20 at 16:58
  • 1
    Thanks for pointing that out, I’ve opened an issue and will get that changed. – Jacob Mar 30 '20 at 20:12
  • 1
    As for ORMs I'm not familiar with them but I'll look into it – Jacob Mar 30 '20 at 22:27

0 Answers0