Indexing PostgreSQL BYTEA data with Solr & Tika

Question

Previously, I have successfully indexed rich documents that are stored in a BLOB column in Oracle using Tika & Solr. However, now I'm trying to do that same thing with a PostgreSQL (9.5.1) database and Solr (5.5.0) and cannot get it to work. I've Googled a lot and found nothing specifically about BYTEA columns, Tika & Solr.

I suspect that my data source is configured wrong but I've tried every data source type with no success.

In the PostgreSQL database, I have a table called "attachment" with a column called "media" of type BYTEA. There are rich documents stored in the column (e.g., Word docs, JPGs, RTFs, etc.).

This is relevant portion of data-config.xml.

<dataSource name="f1" type="FieldStreamDataSource"/>
<dataSource name="db" type="JdbcDataSource" driver="org.postgresql.Driver" 
  url="jdbc:postgresql://<ip_address>/<db_name>" 
  user="<username>" password="<password>"/>
<document>
   <entity name="attachment" dataSource="db" query="select * from attachment">
     <entity name="blob" dataSource="f1" processor="TikaEntityProcessor" url="media" dataField="attachment.MEDIA" format="text" onError="continue">
       <field column="text" name="body" />
     </entity>
   </entity>
</document>

In my solrconfig.xml I have all the proper libraries included because it does not complain about that.

In the managed-schema file, I have this section.

<field name="body" type="text_general" indexed="true" stored="true"/>

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
  </analyzer>
</fieldType>

Solr will start fine. I go to perform a data import and Solr reports:

Indexing completed. Added/Updated: 226 documents. Deleted 0 documents. (Duration 02s)
Requests: 1(2/s), Fetched: 226(113/s), Skipped: 0, Processed: 226(113/s)

However, no text associated with the rich documents is in the index. When I go to the Logging tag, I see these errors (one for each row of the attachment table).

Thursday, March 02, 2017 3:17:45 PM  ERROR  null  EntityProcessorWrapper  Exception in entity : blob:java.lang.RuntimeException: unsupported type : class java.lang.String

I have tried changing the data-config.xml so that the f1 dataSource is of type FieldReaderDataSource:

However, I still get errors. This is the error I see with FieldReaderDataSource (one for each row in the attachment table).

Thursday, March 02, 2017 4:09:19 PM  ERROR  null  EntityProcessorWrapper  Exception in entity : blob:java.lang.ClassCastException: java.io.StringReader cannot be cast to java.io.InputStream

Any ideas what I'm doing wrong?

@BhaumikThakkar I never did get it to work. We ended up moving to a different project and so it didn't matter. At some point, someone start that project again. Maybe the new version will solve it. — DarkerIvy, Dec 12 '17 at 22:33

Indexing PostgreSQL BYTEA data with Solr & Tika

0 Answers0