1

I have a webcrawler that should save/insert the html page content to a PostgreSQL Database with some other meta data fields.

When inserting the html content field using mogrify I'll get the error message List index out of range. If I use a static dummy text for the html content e.g. "Hello World ö ü ä ß" (I am dealing with a german character set) the insert works fine.

This is my function:

def batch_insert(df, table, no_cols, conn=None, upsert_qry=""):
    """
    Using cursor.mogrify() to build the bulk insert query
    then cursor.execute() to execute the query
    """
    if conn is None:
        conn = get_connection()

    col_str = "(" + (no_cols-1)*"%s," + "%s)"
    # Create a list of tuples from the dataframe values
    tuples = [tuple(x) for x in df.to_numpy()]
    # Comma-separated dataframe columns
    cols = ','.join(list(df.columns))
    # SQL quert to execute
    cursor = conn.cursor()

    values = [cursor.mogrify(col_str, tup).decode('utf8') for tup in tuples]

    query = "INSERT INTO %s(%s) VALUES " % (table, cols) + ",".join(values) + upsert_qry
        #" ON CONFLICT (hk_portal_pid) DO UPDATE SET crawled_last=" + str(datetime.now())
        
    try:
        cursor.execute(query, tuples)
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print("Error: %s" % error)
        conn.rollback()
        cursor.close()
        return 1
    print("execute_mogrify() done")
    cursor.close()
    conn.close()

The postgres column field is declared as text as suggested in this post. The html content in my case has an average length of 40.000 characters, so there shouldn't be any storage issues.

I also checked if the the generated tuple in the function gets splitted for the html content field because of escape characters etc.. But this doesn't seem to be the case.

>>>>[len(a) for a in tuples]
    [9, 9, 9, 9] # 4 rows / 9 columns

Data type for the 8th column (page_content) is string:

>>>>type(tuples[0][7])
<class 'str'>

Target table for batch insert:

CREATE TABLE public.as_portal_parent_content (
    hk_page varchar(100) NULL,
    hk_offer varchar(100) NULL,
    portal_id varchar(3) NULL,
    pid varchar(10) null,
    page_category varchar(30) null,
    page_link bpchar(300) null,
    status varchar(3) null,
    page_content text null,
    last_crawled timestamp null
);

Any help much appreciated!

Added/Edit:

Traceback:

Traceback (most recent call last):
  File "C:/Users/PycharmProjects/pythonProject/project/playground.py", line 5, in <module>
    test.save_to_db(table='as_portal_parent_content')
  File "C:/Users/PycharmProjects/pythonProject/project/util/crawl.py", line 163, in save_to_db
    batch_insert(self.df, table)
  File "C:/Users/PycharmProjects/pythonProject/project/util/db.py", line 60, in batch_insert
    cursor.execute(query, tuples)
IndexError: list index out of range

Process finished with exit code 1
Maeaex1
  • 703
  • 7
  • 36

1 Answers1

2

You shouldn't use tuples in cursor.execute(query, tuples).
When you use mogrify, you are basically generating the after-VALUES part of the sql query. So, there's no need to pass query parameters(tuples in your case) to cur.execute again.

Shiva
  • 2,627
  • 21
  • 33
  • 1
    Thanks! It worked. But strange that it worked with the `tuples` argument for a dummy content field. – Maeaex1 Jul 29 '21 at 08:40
  • Beacuse you didn't use `mogrify` with dummy data? – Shiva Jul 29 '21 at 08:51
  • I did. The only difference was that the passed argument `df` did not had the real html content but dummy content as a value instead. The `mogrify` batch insert worked with dummy data for the page_content field. However, just happy that it does work now =) – Maeaex1 Jul 29 '21 at 08:54
  • I had the same phenomenon. I noticed that the code works on smaller datasets, and fails on larger ones. I don't know the reason why. – andexte Aug 22 '23 at 19:28
  • @andexte Does it fail with a different error when using large datasets? – Shiva Aug 23 '23 at 04:23
  • The error I got is "list index out of range". So this is the same. @Shiva, your solution works for me as well! – andexte Aug 23 '23 at 11:25