13

Our team uses software that is heavily reliant on dumping NumPy data into files, which slows our code quite a lot. If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.

Other performant methods of storing NumPy arrays in any database or searchable database-like structure are welcome, but PostgresSQL would be preferred.

My question is very similar to one asked previously. However, I am looking for a more robust and performant answer and I wish to store any arbitrary NumPy array.

Daniel Marchand
  • 584
  • 8
  • 26
  • see if this extension is of any use to you. https://pgxn.org/dist/pandapost/ – Raj Verma Feb 21 '20 at 06:00
  • *any arbitrary NumPy array* ... relational databases even free, open source enterprise level like Postgres should be planned and designed projects with all schemas, tables, fields, users, and other components ideally known and prepared for in advance. Dumping *any* data on the fly is not advisable and may end up with messy system. Ask any DBA! Maybe consider the [.npy binary format](https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html), [pickle](https://docs.python.org/3/library/pickle.html), or other non-text formats? – Parfait Feb 21 '20 at 17:08
  • Could you elaborate on a messy system? This might be a good answer to the post – Daniel Marchand Feb 22 '20 at 09:12

2 Answers2

14

Not sure if this is what you are after, but assuming you have read/write access to an existing postgres DB:

import numpy as np
import psycopg2 as psy
import pickle

db_connect_kwargs = {
    'dbname': '<YOUR_DBNAME>',
    'user': '<YOUR_USRNAME>',
    'password': '<YOUR_PWD>',
    'host': '<HOST>',
    'port': '<PORT>'
}

connection = psy.connect(**db_connect_kwargs)
connection.set_session(autocommit=True)
cursor = connection.cursor()

cursor.execute(
    """
    DROP TABLE IF EXISTS numpy_arrays;
    CREATE TABLE numpy_arrays (
        uuid VARCHAR PRIMARY KEY,
        np_array_bytes BYTEA
    )
    """
)

The gist of this approach is to store any numpy array (of arbitrary shape and data type) as a row in the numpy_arrays table, where uuid is a unique identifier to be able to later retrieve the array. The actual array would be saved in the np_array_bytes column as bytes.

Inserting into the database:

some_array = np.random.rand(1500,550)
some_array_uuid = 'some_array'

cursor.execute(
    """
    INSERT INTO numpy_arrays(uuid, np_array_bytes)
    VALUES (%s, %s)
    """,
    (some_array_uuid, pickle.dumps(some_array))
)

Querying from the database:

uuid = 'some_array'
cursor.execute(
    """
    SELECT np_array_bytes
    FROM numpy_arrays
    WHERE uuid=%s
    """,
    (uuid,)
)
some_array = pickle.loads(cursor.fetchone()[0])

Performance?

If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.

I haven't benchmarked this approach in any way, so I can't confirm nor refute this...

Disk Space?

My guess is that this approach takes as much disk space as dumping the arrays to a file using np.save('some_array.npy', some_array). If this is an issue consider compressing the bytes before insertion.

Vlad
  • 387
  • 2
  • 11
  • 3
    Unpickling allows arbitrary python code execution, so you should never unpickle something untrustworthy. If different processes are allowed to save arbitrary values for this column, then there is an (albeit small) exploitable security flaw. To close this, you can use `bf = io.BytesIO(); some_array = np.load(bf, allow_pickle=False)` to load the data, instead (and similarly for saving) – Multihunter Jan 29 '21 at 04:51
  • 1
    By the way, there is an extra (minimum) 37 byte overhead to using `pickle.dump` instead of `np.save`. – Multihunter Jan 29 '21 at 04:55
0

You can use subprocess.run() to execute shell commands for bulk copying from the csv files to the server using Postgressql COPY in pipes. I'm more familiar with mssql which has the bcp method, unable to test fully on my solution, though I imagine it's a similar method of calling through terminal. Terminal command is based off 3rd link utilizing the method, though that solution uses subprocess.call() which has since been updated with subprocess.run().

https://docs.python.org/3/library/subprocess.html#subprocess.run https://ieftimov.com/post/postgresql-copy/

Python psql \copy CSV to remote server

import subprocess


psql_command = "\"\copy table (col1, col2) FROM file_location CSV HEADER QUOTE '\\\"' NULL ''\""
# user, hostname, password, dbname all defined elsewhere above.
command = ["psql",
    "-U", user,
    "-h", hostname,
    "-d", dbname,
    "-w", password,
    "-c", psql_command,
]

subprocess.run(command)
hSin
  • 364
  • 2
  • 12