I have a situation where writing a PL/pgSQL function solution is to slow and cumbersome to write and probably impossible cause I need many python modules. That's why I want to opt for VAEX or DASK.
The plan: dump the SQL table to a file, then apply a func.
Pseudo code:
def olap(ary,tbl_field):
return set(ary) & set(tbl_field)
then for every row in the FILE loop and find the highest overlap.
pseudo code:
for row in FILE:
max_olap = select olap(row.fieldX::int[], F.fieldX::int[]) as score
from FILE as F
order by score desc limit 1
res.append(row.id, max_olap)
save2SQLTable(res) #or file first
#or better yet insert during the loop, unless it breaks parallization
i.e. loop within a loop.
I'm talking about ~1mln rows.
olap()
function is just for illustration. The real fun is much more complex but still accepts two arrays as arguments.
Which tool is better for this scenario?
How do you dump the SQL table to numpy|panda|? What format is suitable for variable array field?
I want to use the parallelization capabilities?
Once done I have to import the result to SQL table.
Any example on any part of the process.
What I found so far:
DASK seems to have some func sql_to_table(). It is a bit confusing. It seems I have to first create DF in memory then save to file.
it is a bit hard to understand how .apply() user func works it seems both tools have support for it