1

I am trying to loop over ndarray to record index and value of it to postgresql. Here is my code:

    for idx, val in enumerate(data):
        cur.execute("INSERT INTO public.spams(review_id, label, confidence_level, aoc, created_at) VALUES (%s, %s, %s, %s, %s)", (idx+1, spamlabel, 0, 0, dt.now()))

The size of ndarray is 762k and it tooks more than 8h to insert those values. Is there any more efficient way to do this?

Mert Koç
  • 45
  • 4
  • that doesn't have anything to do with numpy, only the strategy you use with the database library. Which library do you use here? Almost any up-to-date library should support batched INSERT's, which is the way to go here. – Ancoron Mar 24 '19 at 11:27
  • I am using psycopg2 for postgresql. How can I do batched INSERT's with it according to my ndarray? – Mert Koç Mar 24 '19 at 11:39

1 Answers1

1

Use psycopg2's execute_values helper method and also provide constants to limit the data we have to transfer, e.g.:

from psycopg2 import extras

extras.execute_values(
    cur,
    "INSERT INTO public.spams(review_id, label, confidence_level, aoc, created_at) VALUES %s",
    enumerate(data),
    template = "(%s + 1, %s, 0, 0, CURRENT_TIMESTAMP)")

You can also experiment with the page_size parameter for further throughput tuning.

Ancoron
  • 2,447
  • 1
  • 9
  • 21
  • I have tried what you suggest but extras.execute_values accepts only one %s placeholder in the query. How can I add %s+1 as review_id and %s as val of data? – Mert Koç Mar 24 '19 at 23:01
  • 1
    Sorry, I was using `execute_batch` first. Updated for the `VALUES` template. – Ancoron Mar 25 '19 at 05:44
  • Thanks for the answer and I think this will work and will try soon but I have another problem here. During the execution, if the server closes the connection or any error occurs the inserts won't be committed and data will be lost. Do you have any suggestions on how can I commit records like part by part (thousands of data)? – Mert Koç Mar 25 '19 at 15:09
  • 1
    Sorry for the late reply. In that case, you have to split up the data beforehand in chunks and iterate over them. Also, you'd have to make sure that your Python app can "remember" where it left off (which chunk has been committed already). – Ancoron Mar 28 '19 at 21:59