I am working on a cron job which needs to query Postgres on a daily basis. The table is huge ~ trillion records. On an average I would expect to retrieve about a billion records per execution. I couldn't find any documentation on using cursors or pagination for Slick 2.1.0 An easy approach I can think of is, get the count first and loop through using drop and take. Is there a better and efficient way to do this?
Asked
Active
Viewed 452 times
0
-
1Are you (1) relying on a huge amount of RAM on the client, or (2) planning to use cursors to fetch less than the whole result-set? I'm not sure I understand your question in the specifics. – Richard Huxton Oct 16 '14 at 08:18
-
Apologies if I didn't explain it properly. I am definitely looking for an option like (2) – Shashi Oct 16 '14 at 17:30
-
I think what Richard Huxton was getting at is that if you use cursors, you won't pull all of those records into memory at once. You will read your billion results into memory one at a time (or perhaps a chunk at a time, depending on the database driver). – AmigoNico Oct 21 '14 at 07:25
-
Ah, that helped. You might be interested in this question: http://stackoverflow.com/questions/14369899/scala-slick-lazy-fetch – AmigoNico Oct 21 '14 at 23:29
-
I haven't actually tried Slick yet, but I would have thought that it would give you an Iterator that uses a cursor underneath. Ah, this may be helpful: http://stackoverflow.com/questions/19409646/database-iterators-in-conjunction-with-paging – AmigoNico Oct 21 '14 at 23:32
-
thanks @AmigoNico both those links recommend what I have implemented right now. Get a total count and paginate using drop and take. – Shashi Oct 22 '14 at 15:45
1 Answers
-1
Map reduce using akka, postgresql-async, first count then distribute with offset+limit query to actors then map the data when needed then reduce the resul to elasticsearch or other store?
-
1This answer looks more like a comment. Try to format your thens into steps and give a little more detail. – CaptainBli Oct 16 '14 at 16:34