2

I am storing document binaries (mostly PDF files) in a SQL Server database and am using the Acrobat IFilter and full-text indexing to make the contents of the files searchable.

However, some of these PDFs were scanned with really cheap software that did not do OCR, and are images of documents instead of proper documents with searchable text. I would like to determine which records in the database have no searchable text so they can be OCRed and re-uploaded.

I can get the document IDs that do have at least one full-text entry by using sys.dm_fts_index_keywords_By_Document. I tried joining the distinct list of IDs with the document table to find the records that don't match, but this turned out to be incredibly slow -- I have about 20,000 documents (some hundreds of pages) and the query ran for over 20 minutes before I canceled it.

Is there a better way to do this?

Paul Abbott
  • 7,065
  • 3
  • 27
  • 45
  • I've gone through this and couldn't find a better answer...my record set was no where as large, but it still took some time. Execute and walk off for the day...I suggest doing this as an insert statement so all the rows are dumped into a table you can call later. Bonus points to whomever has an answer on this one. – Twelfth Jul 29 '14 at 00:35
  • I am not in connected to a suitable database as the moment, but I have often found that I could extract the sql code for an ms supplied proc, etc. Maybe if you do this you could determine a useful subset of the full query that would run faster. – Gary Walker Jul 29 '14 at 03:12
  • Just remembered I could RDP to a suitable machine. sys.dm_fts_index_keywords_By_Document is under master, system, table-valued functions, but this can't be exported to a create function script, so no help there.. – Gary Walker Jul 29 '14 at 03:32

1 Answers1

1

I managed to come up with a solution that only took about 2 minutes to run on a set of 40,000 documents.

1) Create a temp table to store the document_id values from sys.dm_fts_index_keywords_by_document.

2) Populate it by grouping by document_id. Almost all documents will have at least some entries so choose a keyword count threshold that indicates the fulltext index has no meaningful information (I used 30 but most "bad" documents only had 3-5). In my particular case, the table storing the PDF binaries is PhysicalFile.

3) If needed, join the temp table to whatever other tables have the information you need listed. In my particular case, MasterDocument contains the document title and I also included a few lookup tables.

create table #PhysicalFileIDs (PhysicalFileID int, KeywordCount int)

insert into #PhysicalFileIDs (PhysicalFileID, KeywordCount)
    select document_id, count(keyword) from sys.dm_fts_index_keywords_by_document (db_id(), object_id('PhysicalFile'))
    group by document_id having count(keyword) < 30

select MasterDocument.DocumentID, MasterDocument.Title, ProfileType.ProfileTypeDisplayName, #PhysicalFileIDs.KeywordCount
    from MasterDocument
    inner join #PhysicalFileIDs on Masterdocument.PhysicalFileID = #PhysicalFileIDs.PhysicalFileID
    inner join DocumentType on MasterDocument.DocumentTypeID = DocumentType.DocumentTypeID
    inner join ProfileType on ProfileType.ProfileTypeID = DocumentType.ProfileTypeID

drop table #PhysicalFileIDs
Paul Abbott
  • 7,065
  • 3
  • 27
  • 45