I am processing flat files from disk and need to ensure that I never process the same file twice. The filename of every processed file is stored in a postgresql DB and at the next iteration I need to determine the unseen files on disk and process them, ie. I need to determine the set difference of the filenames on disk and the filenames in the DB.
Currently my approach is to create a CTE from the filenames on disk and join that to the table of seen filenames. The list of files on disk is large and constantly changing, and processing is slowing down.
This is the current query:
WITH input(filename) AS (VALUES ${filenames.joinToString { "(?)" }})
SELECT input.filename FROM input
LEFT JOIN my_table pm ON input.filename ILIKE pm.filename
WHERE pm.filename IS NULL
${filenames.joinToString { "(?)" }}
expands to something like (?), (?), (?)
, depending on the number of filename parameters.
What can I do to speed up this process?
One thing that I have to do is add an index on the filename column. What kind of index is the correct choice?