0

Hi All Can you please tell if it is possible or not to search in pdf and word files by passing the path via xml docs ... so that the xml file will be something like this..

<doc>
    <field name="id">1</field>
    <field name="name">A</field>
    <field name="sk">Acce</field>
    <field name="level">Beginner</field>
    <field name="do">Tuto</field>
    <field name="open">1</field>
    <field name="type">Ct</field>
    <field name="extensis">cl_ex</field>
    <field name="features">Atos</field>
    <field name="downl"></field>
    <field name="source">Atoms</field>
    <field name="description">Ths.</field>
    <field name="file_path">http://www.abcd.com/files/abcd.pdf</field>

  </doc>

  <doc>
    <field name="id">2</field>
    <field name="name">Ar</field>
    <field name="sk">Acrce</field>
    <field name="level">Beginner</field>
    <field name="do">Tuto1</field>
    <field name="open">11</field>
    <field name="type">C1t</field>
    <field name="extensis">cl_exd</field>
    <field name="features">Atos</field>
    <field name="downl"></field>
    <field name="source">ddddd</field>
    <field name="description">Thsdd.</field>
    <field name="file_path">http://www.abcd.com/files/abcd.pdf</field>

  </doc>

So here if I search for word "solr word" uaing the solr query, rather than searching only in docs it should also go inside the files(file_path) and search for the word. Any suggestions, assistance in this will be helpfull..

Indudhara Gs
  • 11
  • 1
  • 9
  • Here's a usage of extracting request handler: http://stackoverflow.com/questions/9558526/indexing-multiple-documents-and-mapping-to-unique-solr-id/9567536#9567536 . You upload the **file** itself to Solr. – Jesvin Jose Nov 16 '13 at 13:24

1 Answers1

0

Not that I know of.

But it is possible via another route. You can use Apache Tika to extract the pdf/doc files into text and then you can index said text making you able to search "within" the documents.

Sample implementation :

pdf -> tika

tika -> text from pdf

text from pdf && filepath -> solr doc

search solr -> returns doc with filepath if search matches contents of file

Rahul Shardha
  • 399
  • 7
  • 16
  • Do you mean to say that I have to index each file with document id using tika and when searched it will give the file matches and the corresponding path of that file only.. And I cannot achive this via xml file because in my case I was expecting only file path to download file and the field details like name, description and other also to be displayed....? – Indudhara Gs Nov 14 '13 at 14:57
  • You can do what you just described. What I was giving was a sample implementation. You can return as many fields as you want with as many results as you want (given that your docs match the query). – Rahul Shardha Nov 14 '13 at 15:06