0

I am developing a PostgreSQL Procedure using the plpython3u extension which let you use python procedural language into PostgreSQL.

With the following code using plpy, I am able to retrive data form table and put it into pandas dataframe.

CREATE OR REPLACE PROCEDURE public.plpy_proc_clas_full(
    )
LANGUAGE 'plpython3u'
AS $BODY$
  import pandas as pd
  
  data_lt = plpy.execute('SELECT "key", "value" FROM public."<your-table>" ORDER BY "key"'); #PLyResult --> List or Dictionary
  data_df_x = pd.DataFrame.from_records(data_lt)['key'];
  data_df_y = pd.DataFrame.from_records(data_lt)['value'];
  df = pd.concat([data_df_x, data_df_y], axis=1).values
  
  return df;
$BODY$;

But how can I write back the pandas dataframe to a table (for example after a few data manipulations in python)?

Stavros Koureas
  • 1,126
  • 12
  • 34

2 Answers2

1

The documentation shows an example in a rather unexpected section, on trapping errors:

CREATE FUNCTION insert_fraction(numerator int, denominator int) RETURNS text AS $$
from plpy import spiexceptions
try:
    plan = plpy.prepare("INSERT INTO fractions (frac) VALUES ($1 / $2)", ["int", "int"])
    plpy.execute(plan, [numerator, denominator])
except spiexceptions.DivisionByZero:
    return "denominator cannot equal zero"
except spiexceptions.UniqueViolation:
    return "already have that fraction"
except plpy.SPIError as e:
    return "other error, SQLSTATE %s" % e.sqlstate
else:
    return "fraction inserted"
$$ LANGUAGE plpython3u;

That would suggest you need to map the dataframe back to a record and bind it to placeholders ($1, $2) in a plpy.executed INSERT...VALUES... statement.

As with all basic db interactions, you could also dynamically construct the full text of a literal, static insert statement with inlined literal values based on that dataframe, then plpy.execute that. While simple and intuitive, it will be slower (rewriting, reformatting the data that could be otherwise passed directly and processed automatically) and less robust (no auto-quoting or sanitation, for example).

0

After some time, I read a few articles which were using libraries like psycopg2 or plpy.connect to open a connection to destination database.

Personally, I do not find any reason to re-open a connection while manipulating data (in and out) within the same runtime/database.

I thought of a solution which makes the job, although not sure about performance while handling huge tables because of using an iteration into the dataframe which case multiple insert statements, more specific one insert per line into dataframe, which is not bulk.

CREATE OR REPLACE PROCEDURE public.plpy_proc_test(
    )
LANGUAGE 'plpython3u'
AS $BODY$
  import pandas as pd
  
  data_lt = plpy.execute('SELECT "key", "value" FROM public."ml_train" ORDER BY "key"'); #PLyResult --> List or Dictionary
  data_df_x = pd.DataFrame.from_records(data_lt)['key'];
  data_df_y = pd.DataFrame.from_records(data_lt)['value'];
  
  df = pd.concat([data_df_x, data_df_y], axis=1)
  
  for index, row in df.iterrows():
    plan = plpy.prepare('INSERT INTO test (y_hat, y_act) VALUES ($1, $2)', ['numeric', 'numeric'])
    plpy.execute(plan, [row[0], row[1]])
$BODY$;
Stavros Koureas
  • 1,126
  • 12
  • 34