0

I'm using Scrapy library to scrape a website where people selling cars.

I'm using Python and IBM Cloud functions and Scrapy to achieve that. The idea is to scrape the site every day with IBM Cloud actions and add every vehicle to the vehicles table in Postgres database. That part works fine.

The structure of the vehicles table is as follows:

enter image description here

In the first step is everything except data column (that are the details of the car and those need to be added in the second step) added to vehicles table. That works fine.

The second step is to check every day for every added vehicle from vehicles table if it still exists on the site (it can be deleted or sold). In this step I add every looped vehicle to daily_run_vehicle table. The structure of daily_run_vehicle is as follows:

enter image description here

If the vehicle exists I scrape the details and update vehicles table data column and set handled column to TRUE in daily_run_vehicle table. If it's sold or deleted, then I increment retries column in daily_run_vehicle table.

The second step should run every day.

First I loop through every vehicle from vehicles for which the handled column in daily_run_vehicle table is not TRUE or if handled is False but the number of retries is 5 or more. And for every iteration I add a new record to the daily_run_vehicle table.

The action is prepare-get-vehicles and the code is as follows:

import json
import requests
from common.db import add_logs, get_vehicle_references
from common.db import capture_error
from common.common import APIHOST, NAMESPACE, USER_PASS


def execute_reference(reference, reference_url):
    action = "prepare-get-vehicle"
    url = APIHOST + "/api/v1/namespaces/" + NAMESPACE + "/actions/" + action
    response = requests.post(url,
                             data=json.dumps({"reference": reference, 'reference_url': reference_url}),
                             params={"blocking": "false"},
                             auth=(USER_PASS[0], USER_PASS[1]),
                             headers={"Content-Type": "application/json"})
    print(response.json())


def main(params):
    try:
        for reference in get_vehicle_references():
            execute_reference(reference[0], reference[1])

        return {"Success": "prepare-get-vehicles action executed successfully."}
    except Exception as e:
        capture_error(str(e))
        return {"Failure": "prepare-get-vehicles action NOT executed successfully."}

The get_vehicle_references function is as follows:

def get_vehicle_references():
    conn = db_connection()
    cur = conn.cursor()
    try:
        s = "SELECT reference, reference_url FROM vehicles v WHERE (NOT EXISTS (select reference from daily_run_vehicle WHERE (handled = %s or (handled = %s and retries >= %s)) AND reference = v.reference))"
        cur.execute(s, (True, False, 5))
        return cur.fetchall()
    except Exception as e:
        capture_error(str(e))
    conn.close()

prepare-get-vehicle action does nothing except adding new record to daily_run_vehicle table and it is as follows:

def main(params):
    try:
        insert_daily_run_vehicle(params.get("reference", None), params.get("reference_url", None))
        return {"Success.": "The DB filler (daily_run_vehicle) is successfully executed."}
    except Exception as e:
        capture_error(str(e))
        return {"Failure": "The DB filler (daily_run_vehicle) action NOT executed successfully."}

But the problem is that vehicles table has more than 300k records and it becomes bigger and bigger every day. Than the for loop in prepare-get-vehicles action takes to much time to execute on IBM Cloud. There is timeout 600s but the for loop takes much more time.

Any advice how I can solve my problem and how I can loop through a table which has more than 300k record and for every record add new row to daily_run_table?

Thanks in advance.

Boky
  • 11,554
  • 28
  • 93
  • 163
  • 1
    Are you sure your problem is the for loop and not the per-record subselect in your query? Thos are usually really slow. The query can be optimized in several ways... E.g. `where reference not in (select ...)` or you could just join with `daily_run_vehicle` and filter then. Also you could improve `prepare-get-vehicle` to take more than one record and send them in batches. – swenzel Jan 05 '19 at 13:43
  • @swenzel Any idea what I can try? At the moment I'm not sure for nothing. :-) – Boky Jan 05 '19 at 13:56
  • 1
    I'd start with the query. Connect to your posgres server and try your current version. If it takes forever, you know that this is your bottleneck. There really is several ways you can filter one table's data by using data from another table, basically everything is better than `WHERE ([NOT] EXISTS ...)`. If the query is not the problem, try batching the data you send to `prepare-get-vehicles`. This way you don't have the REST API overhead for every record but only for every batch. You could even pseudo parallelize this with asyncio/multithreading (no need for multiprocessing here). – swenzel Jan 05 '19 at 14:21
  • SQL is orderless using LiMIT without ORDER BY is pretty much meaningless – Raymond Nijland Jan 05 '19 at 16:06

2 Answers2

0

If the id of the vehicle doesn't change you can do this :

INSERT INTO vehicle (id, reference, ...etc...) VALUES (1, 'ref', ...etc...) ON CONFLICT DO NOTHING;

To insert without looping on existing row. You can even do an update on confict http://www.postgresqltutorial.com/postgresql-upsert/

Rémi Desgrange
  • 868
  • 1
  • 6
  • 20
  • Thanks for the answer, but that doesn't help me. The problem is in the for loop where I need to iterate over more than 300k records. It takes more than 600s what is the limit of IBM Cloud funtions. – Boky Jan 05 '19 at 13:00
0

For processing a large table db, you can read the table rows in batches, for each batch run a new action programmatically each action can run in parallel each taking up to 600s.

For example if you have a 300k table, loop over the table rows in a certain order, for example 100k each batch. Then you can invoke 3 actions programmatically in parallel each action processing each batch of 100k.

csantanapr
  • 5,002
  • 2
  • 19
  • 15