2

I have pretty common task, having few thousands of websites and having to parse as many, as possible (in an adequate manner, of course).

First, I've made a stormcrawlerfight-like configuration, using JSoup parser. Productivity was pretty good, very stable, about 8k fetches in a minute.

Then I wanted to add possibility to parse PDF/doc/etc. So I have added Tika parser to parse non-HTML documents. But I see this kind of metrics: enter image description here

So sometimes there are good minutes, sometimes it drops to hundreds in a minute. When I remove Tika stream records - everything returns back to normal. So the question in general is, how to find the reason of this behavior, the bottleneck. Maybe I miss some setting?

Here is what I see in crawler topology in Storm UI: enter image description here

es-injector.flux:

name: "injector"

includes:
- resource: true
  file: "/crawler-default.yaml"
  override: false

- resource: false
  file: "crawler-custom-conf.yaml"
  override: true

- resource: false
  file: "es-conf.yaml"
  override: true

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.spout.FileSpout"
    parallelism: 1
    constructorArgs:
      - "."
      - "feeds.txt"
      - true

bolts:
  - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBol    t"
    parallelism: 1

streams:
  - from: "spout"
    to: "status"
    grouping:
      type: CUSTOM
      customClass:
        className:    "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
        constructorArgs:
          - "byHost"
      streamId: "status"

es-crawler.flux:

name: "crawler"

includes:
- resource: true
  file: "/crawler-default.yaml"
  override: false

- resource: false
  file: "crawler-custom-conf.yaml"
  override: true

- resource: false
  file: "es-conf.yaml"
  override: true

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
    parallelism: 10

bolts:
  - id: "partitioner"
    className:    "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1

  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 1

  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 1

  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 5

  - id: "index"
    className:    "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
    parallelism: 1

  - id: "status"
    className:    "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 4

  - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 1

  - id: "redirection_bolt"
    className: "com.digitalpebble.stormcrawler.tika.RedirectionBolt"
    parallelism: 1

  - id: "parser_bolt"
    className: "com.digitalpebble.stormcrawler.tika.ParserBolt"
    parallelism: 1

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE

  - from: "spout"
    to: "status_metrics"
    grouping:
      type: SHUFFLE     

  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["key"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  # This is not needed as long as redirect_bolt is sending html content to index?
  # - from: "parse"
  #   to: "index"
  #   grouping:
  #     type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "redirection_bolt"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "redirection_bolt"
    to: "parser_bolt"
    grouping:
      type: LOCAL_OR_SHUFFLE
      streamId: "tika"

  - from: "redirection_bolt"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parser_bolt"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

Update: I have found that I'm getting Out of memory errors in workers.log, even that I have set workers.heap.size to 4Gb, worker process raises to 10-15Gb..

Update2: After I limited memory usage I see no OutOfMemory errors, but performance if very low. enter image description here enter image description here

Without Tika - I see 15k fetches per minute. With Tika - it's all after high bars, hundreds per minute only.

And I see this in the worker log: https://paste.ubuntu.com/p/WKBTBf8HMV/

CPU usage is very high but nothing in the log.

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28
elgato
  • 321
  • 3
  • 13
  • logs like "Incorrect mimetype - passing on : https://www.alliedmotion.com/wp-content/uploads/documents/Allied_Motion_Dimensions-MF0255XF.pdf" are nothing to be worried about, it simply means that the Jsoup parser is getting a non-html document to parse and passes it on to the Tika parser. – Julien Nioche Mar 11 '19 at 08:58
  • "Could not find unacked tuple for ..." -> not a massive issue, see https://github.com/DigitalPebble/storm-crawler/issues/689 – Julien Nioche Mar 11 '19 at 09:00
  • the Storm UI suggests there is no obvious bottleneck, judging by the logs the Fetcher doesn't have much work to do. Maybe look at the metrics for the spouts, see how long the queries are taking? It could be that with the increased load on the CPU from the Tika parsing, your machine is maxed on CPU and ES is struggling to return the results in a timely fashion – Julien Nioche Mar 11 '19 at 09:03
  • @Julien, the problem is that that are millions of urls waiting. When I disable Tika and restart the crawler - I get full CPU load but 15k requests per minute, with Tika I get sometimes low CPU, sometimes big CPU load, but 1/10 of non-Tika speed. I'd suggest Tika parses only pdf/doc documents. So big impact on the speed just because of Tika? – elgato Mar 11 '19 at 18:10
  • it would appear so, but I am surprised that it is not reflected by the capacity metrics.Maybe turn the log level to DEBUG and see iff something interesting can be found there – Julien Nioche Mar 11 '19 at 21:14

2 Answers2

1

As you can see in the stats on the UI, the Tika parser bolt is the bottleneck: it has a capacity of 1.6 (a value > 1 means that it can't process the inputs fast enough). This should improve if you give it the same parallelism as the JSOUP parser i.e. 4 or more.

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28
  • Thanks. I have noticed that I have out of memory errors in worker.log... Java raises too much while working... – elgato Mar 09 '19 at 15:48
0

Late reply but might be useful to others.

What happens with using Tika in open crawls is that the Tika parser gets everything that the JSOUPParser bolt didn't handle: zips, images, videos etc... typically these URLs tend to be very heavy and slow to process and quickly the incoming tuples back up in the internal queue until the memory explodes.

I have just committed Set mimetype whitelist for Tika Parser #712 which allows you to define a set of regular expressions that will be tried on the content type for the document. if there is a match, the document is processed, if not, the tuple is sent to the STATUS stream as an error.

You can configure the whitelist as so:

parser.mimetype.whitelist:
 - application/.+word.*
 - application/.+excel.*
 - application/.+powerpoint.*
 - application/.*pdf.*

This should make your topology a lot faster and stable. Let me know how it goes.

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28