-1

I am using Vertica and I have some functions written in Python that use numpy linear algebra capabilities/scipy optimization and are too complicated to write in SQL.

Based on benchmarks which of the options will perform better -

1- Vertica UDF - where the udf of the original python functions will be applied on every record.

2- Using Pyspark and apply the functions directly on every record.

Which is faster for applying python functions on every record - Spark or Vertica UDF ?

Michael
  • 427
  • 1
  • 3
  • 13

1 Answers1

1

The bottleneck is always data travelling from and to the database.

So bring the functionality to where the data is rather than bringing the data to the functionality - always.

Have you actually explored all the functionality you have at your disposal with a DBMS actually complying with the ANSI 99 or even the ANSI 2003 standard of SQL?

You'll actually be surprised how little is left that is too complicated to do in SQL.

I have long since come to the point where I say:

If you can't do it in SQL, do it in C (or C++). If you can't do it in C, it's not worth doing. Search the net for "Real Programmers Don't Use Pascal" to see what I mean.

First, check the DBMS's documentation. Especially Vertica's, which is really detailled.

Then, create SQL functions. And - only as a last resort, code a User Defined Extension.

The best architecture you can build is simple.

The best code is the one you don't have to develop.

Ansering to your comment you posted in the meantime:

  1. Explore the Machine Learning Function library and documentation in Vertica - there is a ton of functionality that you don't have to develop and can just use: https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/_MLFunctions.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CMachine%20Learning%20Functions%7C_____0

And for whatever might be missing- try User Defined Extensions - in Python, in your case:

https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/ExtendingVertica/UDx/DevelopingUDxs.htm?tocpath=Extending%20Vertica%7CDeveloping%20User-Defined%20Extensions%20(UDxs)%7C_____0

and:

https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/ExtendingVertica/Python/IntroductionPythonSDK.htm?tocpath=Extending%20Vertica%7CDeveloping%20User-Defined%20Extensions%20(UDxs)%7CDeveloping%20with%20the%20Python%20SDK%7C_____0

marcothesane
  • 6,192
  • 1
  • 11
  • 21
  • I need to use optimization algorithm (L-BFGS-B) on very simple function, calculate many cross products and multiply matrices - exactly the goal of numpy/scipy. I didn’t find any SQL support for linear algebra / optimization. – Michael Aug 02 '20 at 02:35