We have a IBM Notes procedure database that uses a separate database to store attachment documents that have the current copy of the latest procedure attached. That database is full text indexed for searching the procedures. Most of the procedures are Word documents and don't seem to have a problem, but a particular kind of procedure is stored as a PDF. The problem we have is with the PDFs. It appears that a search doesn't return anything but Word documents that contain the search phrase even though there are many PDFs that contain the search phrase. Is there a setting or something that needs to be set to get it to find the PDFs? These are true PDFs, not TIFs. MJ
2 Answers
Unfortunately you can't use the answer from Torsten. Domino started using Apache Tika from version 10.0 onward, and Domino 9.x and prior all used the Verity Keyview filter libraries. Was there ever a point at which the PDFs were indexing?
One thing I might try in order to trouble-shoot this is to enable the INI DEBUG_FT_STREAM=2049. You don't need to restart the server. Rebuild your database's index (load updall -x mydbname). IF the pdf is being processed at all, you should see a log line stating one of the following:
"Indexing Attachment Object: 'myattachment.pdf' Size = 65536 using Keyview"
"Indexing Attachment Object: 'myattachment.pdf' Size = 65536 using Brute Force"
If neither of these show up, then you may need to dig some more. If the "Brute Force" one shows up then, yeah, something from the PDF is being indexed but who knows what. Brute Force just quickly strips out any ASCII text it can find and so the indexed result can be very inaccurate.

- 11
- 1
As you can read in this link there are a bunch of attachment types that by default are not indexed:
By default, all file formats that are supported by Tika 1.18 are full-text indexed with the exception of the following ones:
.au, .bqy, .cca, .dbd, .dll, .exe, .gif, .gz, .img, .jar, .jpg, .mov, .mp3,.mpg, .msi,.nsf, .ntf, .p7m, .p7s,.pag, .pdb, .png, .rar, .sys, .tar, .tar, .tif, .wav, .wpl, .z, .zip.
As you can see: PDF is not one of them. BUT: There is an notes.ini- Entry that one can set to add special types to that blacklist / replace that blacklist:
To define your own list of attachment types to allow for full-text indexing, add the following notes.ini setting to a Domino server or Notes client:
FT_USE_MY_ATTACHMENT_WHITE_LIST=1 ...Configure which file types to allow on all databases.
FT_INDEX_FILTER_ATTACHMENT_TYPES=*.format,*.format where format is a file format. Use a comma between formats.
It might be, that one of your admins set that ini- parameter to exclude pdf files.

- 1,013
- 9
- 13
-
Got a copy of the Notes.ini for that server from admins and checked the Notes.ini. I don't see anything that would restrict the indexing. We also tested this locally and it would not find any PDFs in the search that it should have. – user560944 Feb 24 '20 at 13:32
-
This server is DAOSed. Does that make a difference? – user560944 Feb 24 '20 at 13:37
-
Additionally, I had my admin delete the index and re-create it. Still no PDFs in searches. – user560944 Feb 24 '20 at 13:38
-
Ummm... The documentation for Tika 1.18 says that it does include PDF, https://tika.apache.org/1.18/formats.html#Portable_Document_Format and the list above from IBM is the blacklist of types that Tika supports but Notes/Domino does not. Since PDF is not on the blacklist, it should be indexed and there should be no need to override it with the notes.ini settings. Am I missing something? – rhsatrhs Feb 25 '20 at 18:15