How can I run an outer join on two large postgreSQL tables in batches?

Question

I have two tables with millions of rows. They share a common email address. They don't share any other fields.

I have a join operation that works fine.

select r.*,l.* from righttable r full outer join lefttable l on r.email=l.email

However, the result set contains millions of rows, which overwhelms my server's memory. How can I run consecutive queries that only pull a limited number of rows from each table at a time and ultimately visit all of the rows in the two tables?

Furthermore, after receving a result set, our server may make some inserts into one or both of the tables. I'm afraid this may complicate keeping track of the offset in each consecutive query. Maybe it's not a problem. I can't wrap my head around it.

Your previous answer, which I see you deleted, suggests using a Like clause. I see how this would work. As I imagine you considered after posting it, this would only work if the number of rows having an email beginning with “a” is more or less the same number of rows having an email beginning with a “b”, and so forth for all other letters. But this isn’t guaranteed and if it coincidentally were the case, a query may still return too many results for our server to handle. — GNG, Jun 17 '20 at 05:58
No, the reason I deleted the answer is that it doesn't work with outer joins. The exact `LIKE` condition can be varied to get an equal partitioning, that is not a fundamental problem. — Laurenz Albe, Jun 17 '20 at 06:11

score 1 · Answer 1 · answered Jun 17 '20 at 13:23

I don't think you can do this in batches, because it won't know what rows to fabricate to fulfill the "FULL OUTER" without seeing all of the data. You might be able to get around that if you know that no one is making changes to the tables while you work, by selecting the left-only tuples, right-only tuples, and inner tuples in separate queries.

But, it should not consume all your memory (assuming you mean RAM, not disk space) on the server, because it should use temp files instead of RAM for the bulk of the storage needed (though there are some problems with memory usage for huge hash joins, so you might try set enable_hashjoin=off).

The client might use too much memory, as it might try to read the entire result set into the client RAM at once. There are ways around this, but they probably do not involve manipulating the JOIN itself. You can use a cursor to read in batches from a single result stream, or you could just spool the results out to disk using \copy, and then us something like GNU split on it.

How can I run an outer join on two large postgreSQL tables in batches?

1 Answers1