2

I am currently struggling with this issue since about 3 month. The Crawler seems to fetch pages every 10 minutes but seems to do nothing in between. With an overall very slow throughput. I am crawling 300 domains in parallel. Which should make around 30 pages/second with a crawling delay of 10 seconds. Currently it is about 2 pages per second.

The topology runs on a PC with

  • 8GB Memory
  • Normal Harddrive
  • Core Duo CPU
  • Ubuntu 16.04

Elasticsearch is installed on another machine with the same specs.

Here you can see the metrics from the Grafana Dashboard

The Grafana Dashboard

They are also reflected in the process latency seen in the Storm UI:

Storm UI

My current architecture for Stormcrawler is:

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
    parallelism: 25

bolts:
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 6
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 1
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 1
  - id: "index"
    className: "de.hpi.bpStormcrawler.BPIndexerBolt"
    parallelism: 1
  - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 4
  - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 1

with the configuration (here the most relevant part):

config:
  topology.workers: 1
  topology.message.timeout.secs: 300
  topology.max.spout.pending: 100
  topology.debug: false

  fetcher.threads.number: 50

  worker.heap.memory.mb: 2049
  partition.url.mode: byDomain

  fetcher.server.delay: 10.0

and here the storm configuration (also just the relevant part):

nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"

ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"

supervisor.childopts: "-Djava.net.preferIPv4Stack=true"

worker.childopts: "-Xmx1500m -Djava.net.preferIPv4Stack=true"

Do you have any idea what could be the problem? Or is it just a problem of the hardware?

What I already tried out

  • Increase the fetcher.server.delay to a higher and lower value which didn't change anything
  • Decrease and increase the number of fetcher threads
  • Played around with the parallelism
  • Calculated if it is the network bandwidth. With a bandwidth of 400mbit/s and an average page size of 0.5 MB it would be 15MB/s which would be 120mbit/s which should also not the problem
  • Increased the number of workers

Do you have any more ideas which I should check out or reasons that could explain the slow fetching? Maybe it is also just the slow hardware? Or the bottleneck is Elasticsearch?

Thank you very much in advance

EDIT:

I changed the topology to two workers and have a recurring error

2018-07-03 17:18:46.326 c.d.s.e.p.AggregationSpout Thread-33-spout-executor[26 26] [INFO] [spout #12]  Populating buffer with nextFetchDate <= 2018-06-21T17:52:42+02:00
2018-07-03 17:18:46.327 c.d.s.e.p.AggregationSpout I/O dispatcher 26 [ERROR] Exception with ES query
java.io.IOException: Unable to parse response body for Response{requestLine=POST /status/status/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&preference=_shards%3A12&search_type=query_then_fetch&batched_reduce_size=512 HTTP/1.1, host=http://ts5565.byod.hpi.de:9200, response=HTTP/1.1 200 OK}
 at org.elasticsearch.client.RestHighLevelClient$1.onSuccess(RestHighLevelClient.java:548) [stormjar.jar:?]
 at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onSuccess(RestClient.java:600) [stormjar.jar:?]
 at org.elasticsearch.client.RestClient$1.completed(RestClient.java:355) [stormjar.jar:?]
 at org.elasticsearch.client.RestClient$1.completed(RestClient.java:346) [stormjar.jar:?]
 at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119) [stormjar.jar:?]
 at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177) [stormjar.jar:?]
 at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436) [stormjar.jar:?]
 at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326) [stormjar.jar:?]
 at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) [stormjar.jar:?]
 at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) [stormjar.jar:?]
 at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) [stormjar.jar:?]
 at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) [stormjar.jar:?]
 at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) [stormjar.jar:?]
 at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) [stormjar.jar:?]
 at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) [stormjar.jar:?]
 at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) [stormjar.jar:?]
 at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [stormjar.jar:?]
 at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [stormjar.jar:?]
 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: java.lang.NullPointerException

Still the crawling process seems much more balanced but still does not fetch a lot of links

enter image description here

Also after running the topology for some weeks the latency went up a lot

enter image description here

2 Answers2

2

Sorry for the late reply, just back from holidays.

Judging by the graph, the worker gets restarted, which makes me think that something is blocking or crashing the topology. After a while where nothing happens, the worker is restarted, it processed some URLs and the issue happens again.

Have you checked the logs for error messages? Is there a memory dump in the logs? Can you isolate the URL(s) which cause the issue?

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28
  • 1
    I think this is not URL depended but a issue in general as this was always the same case for several different URLs. I updated the current status which no runs more smoothly with some configuration changes but does not fetch a lot of URLs .. With the given number this would be around 2 pages per second which is way to small.. – Jonas Pohlmann Jul 03 '18 at 15:29
  • 1
    It won't be able to fetch many URLs if the spouts can't retrieve results as the error on the logs is suggesting. You could set the logs to DEBUG level and see what the ES queries generated by the spouts look like and see if you can reproduce the problem when running them directly into ES. – Julien Nioche Jul 04 '18 at 08:30
2

This happens when you do not ack() or fail() all tuples in all your (custom) bolts. Also, when emitting from an intermediate bolt, make sure to anchor the new tuple correctly to the previous tuple. See this SO answer for an explanation.

I had the same issue, also with an ES setup. I first thought it was about the nextFetchDate logic of the ES Spout, but after disabling my custom bolts, the issue was fixed. So I found that I missed to ack() a tuple somewhere. This resulted in un-acked tuples being lost in the topology, which then would be re-emitted after 10 minutes / 600 seconds.

Though I did not find where the 10 minute replay timeout is defined, as the storm-replay mechanism was set to 5 minutes via topology.message.timeout.secs: 300

nrnrn
  • 21
  • 1