0

I would like to upload a file (some ms word document) for instance to solr, but I would like to add my own fields to this upload, like the userId of the person who uploaded it or a number of tags. The content of the file must be parsed and searchable and the exta parameters should be added as fields. Therefor I have added the following definition in schema.xml

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.1">
  <types>
   <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
   <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
    <!-- A general text field that has reasonable, generic
         cross-language defaults: it tokenizes with StandardTokenizer,
     removes stop words from case-insensitive "stopwords.txt"
     (empty by default), and down cases.  At query time only, it
     also applies synonyms. -->
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
 </types>


 <fields>
    <field name="documentId" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="text" type="string" indexed="true" stored="false" multiValued="true"/>
<dynamicField name="metadata_*" type="text_general" indexed="true" stored="true" multiValued="true"/>
 </fields>

 <uniqueKey>documentId</uniqueKey>
 <defaultSearchField>text</defaultSearchField>
 <solrQueryParser defaultOperator="AND"/>

</schema>

The relevant part of my solrconfig.xml now looks like this:

  <equestHandler name="/update/extract" 
                startup="lazy"
                class="solr.extraction.ExtractingRequestHandler">
 <lst name="defaults">
   <str name="fmap.content">text</str>
   <str name="lowernames">true</str>
   <str name="fmap.documentId">documentId</str>
   <!-- also tried with
   <str name="fmap.literal.documentId">documentId</str>
   and
   <str name="literal.documentId">documentId</str>
   -->
   <str name="uprefix">metadata_</str>

   <!-- capture link hrefs but ignore div attributes -->
   <str name="captureAttr">true</str>
   <str name="fmap.a">links</str>
   <str name="fmap.div">ignored_</str>
  </lst>
  </requestHandler>

However no matter what combination I try with this command:

java -Durl=http://localhost:9090/solr/update/extract?documentId=test -jar post.jar somedoc.pdf

or

java -Durl=http://localhost:9090/solr/update/extract?literal.documentId=test -jar post.jar somedoc.pdf

I keep on getting missing required field for documentId

Regards Ronald

Ronald
  • 346
  • 1
  • 2
  • 12

2 Answers2

2

The reason you have 0 docs it probably you are not specifying documentId (or any other required fields for that matter), and indexing is failing on that (look up the logs).

You have to just fallow example: http://wiki.apache.org/solr/ExtractingRequestHandler#Getting_Started_with_the_Solr_Example

To add any field to document indexed with Tika you have to use literal parameter. In your case it might be:

&literal.userId=123&literal.documentId=doc1

If you have some other question, please ask (add possibly add some more details: what your command looks like, errors from the log)

Fuxi
  • 5,298
  • 3
  • 25
  • 35
  • thx for your remarks, I updated my question with some more details – Ronald Aug 06 '12 at 14:14
  • What about "java -Durl=http://localhost:9090/solr/update/extract -Dparams=literal.documentId=test" Have you tried curl? – Fuxi Aug 06 '12 at 14:32
  • curl "http://localhost:9090/solr/update/extract?literal.documentId=test&commit-true" -F " file=@voortgang.docx" same error. Using with -Dparams='literal.documentId=test' same error. Sigh, it must be something stupid that I have misconfigured, but what, where? – Ronald Aug 06 '12 at 15:09
  • I was really puzzled why this was not working for you, however I reproduced your settings and that give me the same problem. I do not know why it's working like that, but once I renamed documentId to something else like "id" or "uid" everything start working again. Good luck! – Fuxi Aug 06 '12 at 18:27
  • well thx for the effort, I will rename it, but have my doubts that the other required fields will work now as well. Will keep you posted – Ronald Aug 07 '12 at 06:54
  • Other fields took literal values as they should in my tests – Fuxi Aug 07 '12 at 08:42
  • Not sure what you meant by the last remark, but I fail to add any other field besides id as a field. Even after making them not required they still are not added (of course, since it is skipped apparently by the solr cell). I do see that the content of the file is searchable so I might have to resort to do a dealerNumber/userid search on database levels in combination with a content search, allthough I would have liked it when solr could handle this as well – Ronald Aug 07 '12 at 09:18
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/15001/discussion-between-ronald-and-fuxi) – Ronald Aug 07 '12 at 14:04
0

I had the same issue and the problem was the name of my field "documentId". Turns out there is a problem checking for required fields when the field name ends in "Id" (capital I)

See this other question which helped me figure it out : Solr - Missing Required Field

I changed my field name to "id" and all is fine now. This really makes no sense and has probably driven a few people completely crazy

Community
  • 1
  • 1
vinnie
  • 333
  • 3
  • 11