Previously, I have successfully indexed rich documents that are stored in a BLOB column in Oracle using Tika & Solr. However, now I'm trying to do that same thing with a PostgreSQL (9.5.1) database and Solr (5.5.0) and cannot get it to work. I've Googled a lot and found nothing specifically about BYTEA columns, Tika & Solr.
I suspect that my data source is configured wrong but I've tried every data source type with no success.
In the PostgreSQL database, I have a table called "attachment" with a column called "media" of type BYTEA. There are rich documents stored in the column (e.g., Word docs, JPGs, RTFs, etc.).
This is relevant portion of data-config.xml.
<dataSource name="f1" type="FieldStreamDataSource"/>
<dataSource name="db" type="JdbcDataSource" driver="org.postgresql.Driver"
url="jdbc:postgresql://<ip_address>/<db_name>"
user="<username>" password="<password>"/>
<document>
<entity name="attachment" dataSource="db" query="select * from attachment">
<entity name="blob" dataSource="f1" processor="TikaEntityProcessor" url="media" dataField="attachment.MEDIA" format="text" onError="continue">
<field column="text" name="body" />
</entity>
</entity>
</document>
In my solrconfig.xml I have all the proper libraries included because it does not complain about that.
In the managed-schema file, I have this section.
<field name="body" type="text_general" indexed="true" stored="true"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
</fieldType>
Solr will start fine. I go to perform a data import and Solr reports:
Indexing completed. Added/Updated: 226 documents. Deleted 0 documents. (Duration 02s)
Requests: 1(2/s), Fetched: 226(113/s), Skipped: 0, Processed: 226(113/s)
However, no text associated with the rich documents is in the index. When I go to the Logging tag, I see these errors (one for each row of the attachment table).
Thursday, March 02, 2017 3:17:45 PM ERROR null EntityProcessorWrapper Exception in entity : blob:java.lang.RuntimeException: unsupported type : class java.lang.String
I have tried changing the data-config.xml so that the f1 dataSource is of type FieldReaderDataSource:
However, I still get errors. This is the error I see with FieldReaderDataSource (one for each row in the attachment table).
Thursday, March 02, 2017 4:09:19 PM ERROR null EntityProcessorWrapper Exception in entity : blob:java.lang.ClassCastException: java.io.StringReader cannot be cast to java.io.InputStream
Any ideas what I'm doing wrong?