I am storing document binaries (mostly PDF files) in a SQL Server database and am using the Acrobat IFilter and full-text indexing to make the contents of the files searchable.
However, some of these PDFs were scanned with really cheap software that did not do OCR, and are images of documents instead of proper documents with searchable text. I would like to determine which records in the database have no searchable text so they can be OCRed and re-uploaded.
I can get the document IDs that do have at least one full-text entry by using sys.dm_fts_index_keywords_By_Document
. I tried joining the distinct list of IDs with the document table to find the records that don't match, but this turned out to be incredibly slow -- I have about 20,000 documents (some hundreds of pages) and the query ran for over 20 minutes before I canceled it.
Is there a better way to do this?