0

I'm indexing with Solr Cell a large HTML page using a curl command with a Windows command prompt like so:

curl http://localhost:8987/solr/myexample/update/extract -d @test.html -H 'Content-type:html'

I have found that I'm missing data (text) in my fields when I query (query?q=*:*&q.op=OR&indent=true) them in the admin menu of SOLR. Example: I have a bunch of lorem ipsum <p> tags but near the end of my HTML page I have another paragraph tag of Hello world, this does not show up in SOLR admin.

I found the following on the old wiki.

Large individual fields.

It is possible to store megabytes of text in one record. These fields are clumsy to work with. By default the number of characters stored is clipped.

It does not go into any details on how you would prevent the text from being clipped, that is if this is even what's causing the issue because I can't even get MB worth of data in a field before it's cut.

schema.xml

    <field name="main" type="text_general" indexed="true" stored="true"/>
    <field name="div" type="text_general" indexed="true" stored="true"/>
    <field name="doc_id" type="string" uninvertible="true" indexed="true" stored="true"/>
    <field name="date_pub" type="pdate" uninvertible="true" indexed="true" stored="true"/>
    <field name="p" type="text_general" uninvertible="true" indexed="true" stored="true"/>
    <field name="_text_" type="text_general" indexed="true" stored="true" multiValued="true"/>
    <copyField source="*" dest="_text_"/>

solrconfig.xml

  <requestHandler name="/update/extract"
    class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
      <str name="fmap.content">content</str>
      <str name="capture">div</str>
      <str name="fmap.div">div</str>
      <str name="capture">h1</str>
      <str name="fmap.h1">h1</str>
      <str name="capture">h2</str>
      <str name="fmap.h2">h2_t</str>
      <str name="capture">p</str>
      <str name="fmap.p">p</str>
    </lst>
  </requestHandler>

Solr Version: 8.10.1

ImTrying
  • 45
  • 7
  • 1
    It sounds like you're saying that there is text in the HTML documents that you are indexing, and that you are not finding that text when you search on it. Is that correct? My first guess is that the HTML document you're giving to Solr is somehow malformed or invalid. – Andy Lester Jan 31 '22 at 23:21
  • 1
    @AndyLester That's exactly the issue. I'll take a look at my HTML, run it through some validators maybe make a new one. – ImTrying Jan 31 '22 at 23:36
  • 1
    @AndyLester no errors using the w3c markup validation. Just a plain old HTML page with 50000 characters. It's weird it'll get most of the text up to a certain point and everything after is not added. Works perfectly on smaller HTML pages. – ImTrying Feb 01 '22 at 00:29
  • Can you post the HTML page somewhere? Like in a GitHub gist? I'm curious to look at the markup. Maybe there's markup that the ExtractingRequestHandler doesn't recognize. Maybe try removing certain types of tags one at a time and reindexing and see what happens. – Andy Lester Feb 01 '22 at 15:38
  • I don't know what else to look at. I suggest you head over to the Solr Slack at https://communityinviter.com/apps/apachesolr/apache-solr and ask the folks over there. – Andy Lester Feb 01 '22 at 19:54
  • @AndyLester I'll take a look. Thanks! – ImTrying Feb 01 '22 at 21:22
  • @AndyLester If you're curious I have posted the solution. – ImTrying Feb 01 '22 at 23:11

1 Answers1

1

SOLR cell doesn't seem to limit the characters, however, and don't ask me why, the culprit was the curl command I was using below:

curl http://localhost:8987/solr/myexample/update/extract -d @test.html -H 'Content-type:html'

Solution: The following command pulls all the text without truncating any of the text (replace paths with wherever your post.jar and HTML file are):

java -jar -Dc=myexample -Dauto example\exampledocs\post.jar example\exampledocs\sample.html

Worth noting these are Window commands for the Command Prompt.

ImTrying
  • 45
  • 7