0

I have a situation where writing a PL/pgSQL function solution is to slow and cumbersome to write and probably impossible cause I need many python modules. That's why I want to opt for VAEX or DASK.

The plan: dump the SQL table to a file, then apply a func.

Pseudo code:

def olap(ary,tbl_field):
  return set(ary) & set(tbl_field)

then for every row in the FILE loop and find the highest overlap.

pseudo code:

for row in FILE:
   max_olap = select olap(row.fieldX::int[], F.fieldX::int[]) as score
              from FILE as F
              order by score desc limit 1
   res.append(row.id, max_olap)

save2SQLTable(res) #or file first

#or better yet insert during the loop, unless it breaks parallization

i.e. loop within a loop.

I'm talking about ~1mln rows.

olap() function is just for illustration. The real fun is much more complex but still accepts two arrays as arguments.

Which tool is better for this scenario?

How do you dump the SQL table to numpy|panda|? What format is suitable for variable array field?

I want to use the parallelization capabilities?

Once done I have to import the result to SQL table.

Any example on any part of the process.

What I found so far:

  • DASK seems to have some func sql_to_table(). It is a bit confusing. It seems I have to first create DF in memory then save to file.

  • it is a bit hard to understand how .apply() user func works it seems both tools have support for it

sten
  • 7,028
  • 9
  • 41
  • 63
  • "*cause I need many python modules*" - you can write stored functions in pyhton: https://www.postgresql.org/docs/current/plpython.html –  Jun 11 '21 at 05:20
  • `then for every row in the FILE loop and find the highest overlap.` Overlap with *what* ? BTW: in SQL you can join a table with itself. And you don't want to loop in SQL. – wildplasser Jun 11 '21 at 08:32
  • yes it has python, but it its not suitable for development, plus the func will become bottleneck. PG parallelize queries only in special cases definitely not python based – sten Jun 11 '21 at 16:38
  • @wildplasser not a good idea 1mln * 1mln join ~ 10^12 ops with expensive func will take forever .. pg wont parallelize it – sten Jun 11 '21 at 16:41

0 Answers0