0

I am processing flat files from disk and need to ensure that I never process the same file twice. The filename of every processed file is stored in a postgresql DB and at the next iteration I need to determine the unseen files on disk and process them, ie. I need to determine the set difference of the filenames on disk and the filenames in the DB.

Currently my approach is to create a CTE from the filenames on disk and join that to the table of seen filenames. The list of files on disk is large and constantly changing, and processing is slowing down.

This is the current query:

WITH input(filename) AS (VALUES ${filenames.joinToString { "(?)" }})
SELECT input.filename FROM input 
LEFT JOIN my_table pm ON input.filename ILIKE pm.filename
WHERE pm.filename IS NULL

${filenames.joinToString { "(?)" }} expands to something like (?), (?), (?), depending on the number of filename parameters.

What can I do to speed up this process?

One thing that I have to do is add an index on the filename column. What kind of index is the correct choice?

Niel de Wet
  • 7,806
  • 9
  • 63
  • 100
  • I would do the comparison in the host language using the equivalent of `HashSet`. Load the `HashSet` from `my_table`, and check `contains()` as you loop through filenames on disk. – Mike Organek Aug 13 '20 at 10:30

1 Answers1

0

Since you're using ILIKE, I wouldn't put an index on pm.filename, but on LOWER(pm.filename). This should allow you to remove ILIKE in favour of the more performant LIKE. This also means you can just use a simple B-tree index, as it works fine with LIKE. LIKE is useful if you use wildcards, but if you don't, just use normal =-equality.

Finally, there is a good chance that the query optimiser already does a lot with the query, but I suggest you look at the EXPLAIN (ANALYSE) output of this query. I have some suggestions for improvement, but no idea on whether they will help or they will all be boiled down to the same query plan. That's completely up to you!


This takes the result of the first query first list and removes any matches from the result of the second query. The downside is that the returned filenames are lowercased.

SELECT LOWER(filename)
FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
EXCEPT ALL (SELECT LOWER(filename) FROM my_table pm)

This query doesn't have this drawback, it just returns all filenames that do not have a match in my_table.

SELECT filename
FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
WHERE NOT EXISTS (
  SELECT
  FROM my_table pm
  WHERE LOWER(pm.filename) = LOWER(input.filename)
)

The last query is probably equivalent to this one, but I'll add it for completeness.

SELECT filename
FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
WHERE LOWER(filename) NOT IN (
  SELECT LOWER(pm.filename)
  FROM my_table pm
)
Ruben Helsloot
  • 12,582
  • 6
  • 26
  • 49
  • Thanks Ruben. `LOWER(pm.filename) = LOWER(input.filename)` did the trick. Adding the index made little difference. – Niel de Wet Aug 13 '20 at 20:19