2

I have a very simple topology that spouts from an ES index (AggregationSpout), fetches the pages (FetcherBolt) and uses StatusUpdaterBolt to update the ES status to "FETCHED".

However, I noticed such warnings in the log files:

[WARN] Could not find unacked tuple for 357dc2fcb59c6457884a8f7a83794c4cf77f490a3acfd849a792a35153ed4665

The corresponding debug info looks like: ...

2017-12-06 12:44:53.572 o.e.t.T.tracer elasticsearch[client][transport_client_boss][T#2] [TRACE] [214][indices:data/write/bulk] received response from [{ESPatentNode-1}{S4C2h8WjRuu6MpM25oM-3w}{Fvjny3VaQl2w45hPXZ5A9g}{127.0.0.1}{127.0.0.1:9300}] 2017-12-06 12:44:53.572 c.d.s.e.p.StatusUpdaterBolt elasticsearch[client][listener][T#1] [DEBUG] afterBulk [105] with 47 responses 2017-12-06 12:44:53.572 c.d.s.e.p.StatusUpdaterBolt elasticsearch[client][listener][T#1] [DEBUG] Acked 1 tuple(s) for ID 5967f802c84e3e9c6ac22a3184e0665b850779cba9050fa4ec910a41f9f90655 2017-12-06 12:44:53.573 c.d.s.e.p.StatusUpdaterBolt elasticsearch[client][listener][T#1] [DEBUG] Acked 2 tuple(s) for ID 357dc2fcb59c6457884a8f7a83794c4cf77f490a3acfd849a792a35153ed4665 2017-12-06 12:44:53.573 c.d.s.e.p.StatusUpdaterBolt elasticsearch[client][listener][T#1] [DEBUG] Acked 1 tuple(s) for ID 092e59cd1ebb004884babfaf1d6ca4b7505b3dcb1b3cb3a52b9072d647fb7a93 2017-12-06 12:44:53.573 c.d.s.e.p.StatusUpdaterBolt elasticsearch[client][listener][T#1] [WARN] Could not find unacked tuple for 357dc2fcb59c6457884a8f7a83794c4cf77f490a3acfd849a792a35153ed4665

What I would like to understand is:

  1. why several tuples can be attached to an ID
  2. how it is possible to come accross twice the same "waitAck" cache element looping through "response" in the afterBulk method of StatusUpdaterBolt

Thanks in advance for your help!

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28
EJO
  • 43
  • 4

1 Answers1

1

These warnings are pretty normal, see explanation below.

  1. tuples will have the same ID if they have the same URL. With the log at debug level, you should see the mappings => 'Sent to ES buffer {} with ID {}'

    1. because the status is FETCHED, the tuples are sent to ES (unlike DISCOVERED) more than once, then in the pseudo ack method we store both tuples as value with the ID in the cache. When processing the returns from ES, we get 2 different results, the first one acks both tuples the second does nothing but triggers the message you saw.

The question is why would you get the same URL more than once if all you do is fetching. That's probably worth investigating.

Thanks!

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28
  • Hi Julien, many thanks for the explanations and the very quick answer! The URLs set I'm fetching are stemming from a previous recursive crawl. In such case, would that explain potential doublons (or more) in the ES index or should doublons be filtered somehow by default and not pollute the index? – EJO Dec 06 '17 at 13:32
  • No problem, please mark the answer as accepted or useful. There should not be any duplicates in the status index if the URL is used as key; also if you are sharding then make sure URLs from the same host or domain end up in the same shard. How do you inject the URLs in the first place? – Julien Nioche Dec 06 '17 at 14:13
  • 1
    I'm using a single shard for now. I used your ESSeedInjector in the first place (URL as key), but when running my recursive crawl I did not have the URL as a key. That would explain my original problem. I will rerun the whole thing and let you know. Thanks again for the support! – EJO Dec 07 '17 at 11:36