2

I have a fileSystem datasource, and i have created a dataconfig for it to run DIH the dataconfig is

<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
    <dataSource type="FileDataSource" />
    <document>
        <entity name="pdf" processor="FileListEntityProcessor" baseDir="/path/to/my/pdf" fileName=".*pdf" newerThan="'NOW-3DAYS'" recursive="true" rootEntity="false" dataSource="pdf">
        </entity>
    </document>
</dataConfig>

and when i run the DIH, it gives
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
Requests: 0, Fetched: 35924, Skipped: 0, Processed: 0

Any idea why it didn't process any document?

Alaa
  • 4,471
  • 11
  • 50
  • 67

2 Answers2

1

You don't have a root entity in your config; you have only one entity, and it has rootEntity="false", so no documents are created from it.

You will also need to define some "field" lines inside your entity to map the file information to the fields in your schema; this question indexing all documents in doc folder in to solr FileListEntityProcessor does something similar to what you need.

Community
  • 1
  • 1
Yann
  • 1,019
  • 1
  • 8
  • 18
  • Thanks Yann, Rajesh, in the link you provided is suffering as well and he didn't get an answer :(. Re rootEntity : By default the entities falling under the document are root entities. If it is set to false , the entity directly falling under that entity will be treated as the root entity. Could you please share a dataconfig example for fileSystem Datasource if you have? – Alaa Jan 28 '15 at 12:40
  • I don't have anything that looks exactly like your case (I load CSV files, which I then process via a script). One more thing I see in your script: you didn't name your data source (in the datasource tag), and in the entity (entity tag) you refer to a "pdf" datasource, which is also the name of the entity, which doesn't seem right? – Yann Jan 28 '15 at 13:14
  • e.g.: `` and `` – Yann Jan 28 '15 at 13:16
  • Hi Alaa - did it help? – Yann Jan 29 '15 at 12:19
  • unfortunately no,,, i found the answer already and answered my question... thanks a lot for your attention – Alaa Jan 29 '15 at 18:51
1

Thanks, I did it, and below is the needed dataconfig

<?xml version="1.0" encoding="UTF-8"?> 
<dataConfig> 
    <dataSource type="BinFileDataSource" /> 
    <document> 
        <entity name="pdf" processor="FileListEntityProcessor" baseDir="/path/to/my/pdf" fileName=".*pdf" newerThan="'NOW-3DAYS'" recursive="true" rootEntity="false" dataSource="null"> 
            <field column="fileAbsolutePath" name="id" /> 
            <entity name="documentImport" processor="TikaEntityProcessor" url="${pdf.fileAbsolutePath}" format="text"> 
                <field column="text" name="text"/> 
            </entity> 
        </entity> 
    </document> 
</dataConfig>
Alaa
  • 4,471
  • 11
  • 50
  • 67