0

I'm trying to index an HTML document using Apache Solr and the TikaEntityProcessor, with the idea being that I can use XPath to select specific elements from the HTML.

I have followed the advanced example shown at the bottom of the TikaEntityProcessor Solr Wiki page.

When I try to complete a data import command, I receive the following error message(s):

03-Oct-2012 16:39:48 org.apache.solr.handler.dataimport.DataImporter doFullImport
INFO: Starting Full Import
03-Oct-2012 16:39:48 org.apache.solr.core.SolrCore execute
INFO: [htmlTest] webapp=/apache-solr-3.6.1 path=/dataimport params={command=full-import} status=0 QTime=31 
03-Oct-2012 16:39:48 org.apache.solr.handler.dataimport.SimplePropertiesWriter readIndexerProperties
INFO: Read dataimport.properties
03-Oct-2012 16:39:48 org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [htmlTest] REMOVING ALL DOCUMENTS FROM INDEX
03-Oct-2012 16:39:48 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
    commit{dir=C:\Program Files\Apache Tomcat\conf\apache-solr-3.5.0\htmlTest\data\index,segFN=segments_1e,version=1349187077567,generation=50,filenames=[_u.fnm, _u.nrm, _u.tis, _u.prx, _u.frq, _u.fdx, _u.fdt, _u.tii, segments_1e]
03-Oct-2012 16:39:48 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1349187077567
03-Oct-2012 16:39:48 org.apache.solr.handler.dataimport.SqlEntityProcessor initQuery
SEVERE: The query failed 'null'
java.lang.NullPointerException
    at java.io.File.<init>(File.java:222)
    at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:96)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:53)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:44)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
03-Oct-2012 16:39:48 org.apache.solr.common.SolrException log
SEVERE: Exception while processing: tika-test document : SolrInputDocument[{text=text(1.0)={<html>

<meta name="Content-Encoding" content="ISO-8859-1">
<meta name="Content-Type" content="text/html">
<title></title>

<body>
    <h1>This is my first heading</h1>


        This is some content


    <h1>This is my second heading</h1>


        This is some more content


</body></html>}}]:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:65)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.NullPointerException
    at java.io.File.<init>(File.java:222)
    at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:96)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:53)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:44)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
    ... 11 more

03-Oct-2012 16:39:48 org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {deleteByQuery=*:*} 0 31
03-Oct-2012 16:39:48 org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
    ... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:65)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
    ... 5 more
Caused by: java.lang.NullPointerException
    at java.io.File.<init>(File.java:222)
    at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:96)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:53)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:44)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
    ... 11 more

03-Oct-2012 16:39:48 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
03-Oct-2012 16:39:48 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback

My data import configuration is:

<dataConfig>
    <dataSource type="BinFileDataSource"/>
    <dataSource type="FieldReaderDataSource" name="fld"/> 
    <document>
        <entity name="tika-test" processor="TikaEntityProcessor"
                url="C:/Program Files/Apache Tomcat/conf/apache-solr-3.5.0/htmlTest/data/html_basic.html" format="html">
                <field column="text"/>
                <entity type="XPathEntityProcessor" forEach="/html" dataField="text">
                    <field xpath="//h1"  column="date" />
                </entity>
        </entity>
    </document>
</dataConfig>

And the HTML document Solr is indexing is:

<html>
<head>
</head>
<body>
    <h1>This is my first heading</h1>
    <div>
        This is some content
    </div>
    <h1>This is my second heading</h1>
    <div>
        This is some more content
    </div>
</body>

Sam Delaney
  • 1,305
  • 11
  • 10
  • 1
    Just to add some further information, it is understood that the XPathEntityProcessor defaults to a SqlEntityProcessor as its source. For some reason I don't think it can bind to the TikaEntityProcessor (if that's how it works) – Sam Delaney Oct 05 '12 at 08:47

1 Answers1

0

You seem to be missing a reference to the right data source. It needs to be an attribute on entity called dateSource which matches attribute name on the datasource definition itself. You seem to have defined the name fld but did not reference it.

I recommend doing this explicitly for both data sources and the corresponding entities to avoid confusion.

Alexandre Rafalovitch
  • 9,709
  • 1
  • 24
  • 27