1

I am using PyHive with SQLAlchemy DB-API (asynchronous). My Hive table has millions of records, if I execute:

SELECT * FROM table

It loads millions of records in the memory. Is there a way to limit the size of query result to certain size, let's say 5 MB?

I discard all the results more than 5 MB. Millions of record can be 100 MB but I do want to load these much data in my Python code.

Example, if:

SELECT * FROM table

1000 rows = 5 MB for table, SQLAlchemy will only return me 1000 rows.

For another query, example:

SELECT * FROM table2

here, 100,000 rows = 5 MB for table2, SQLAlchemy will return 100,000 rows.

This way I do not run out of memory on my client. Example are simple but there might be complex join SQL and I want to limit the query result size based on MB of data.

Is there a way to achieve it with SQLAlchemy or need to modify Hive & Presto server to achieve such behavior?

Mithun Mistry
  • 138
  • 2
  • 7
  • You can put a limit on the number of rows returned in the query: `SELECT * FROM table1 LIMIT 1000`. SQLAlchemy queries have a corresponding `limit()` method. – snakecharmerb May 28 '21 at 08:43
  • @snakecharmerb I know the limit clause but my question is to add limit dynamically based on query result size. I do not want to necessarily limit it to certain number of rows each time. If I can load millions of rows within 5 MB, I am ok with loading millions of rows too. I just don't want to load > X MB in memory. – Mithun Mistry May 28 '21 at 09:33

0 Answers0