Questions tagged [stormcrawler]

StormCrawler is an open source project providing a collection of resources for building low-latency, scalable web crawlers based on Apache Storm.

StormCrawler is an open source project providing a collection of resources for building low-latency, scalable web crawlers using Apache Storm.

214 questions
5
votes
2 answers

Nutch vs Heritrix vs Stormcrawler vs MegaIndex vs Mixnode

We need to crawl a large number (~1.5 billion) of web pages every two weeks. Speed, hence cost, is a huge factor for us as our initial attempts have ended up costing us over $20k. Is there any data on which crawler performs the best in a distributed…
Anakin
  • 107
  • 1
  • 5
3
votes
0 answers

KryoException: Buffer underflow error in Apache Storm and Storm-Crawler

I have been encountering a recurring issue during the deployment of a new version of my topology in Storm-Crawler, and I am seeking assistance in understanding and resolving the problem. Error: Upon deployment, I consistently encounter the following…
2
votes
1 answer

Storm Crawler with Java 11

Trying to update the Java version from Java 8 to Java 11 to compile and run the StromCrawler. My question- Does Storm Crawler is supported on Java 11?. As we I update the java version in my POM and build the project I was successfully build the…
Vinay S
  • 21
  • 2
2
votes
2 answers

Tika Parser slowing down StormCrawler

I have pretty common task, having few thousands of websites and having to parse as many, as possible (in an adequate manner, of course). First, I've made a stormcrawlerfight-like configuration, using JSoup parser. Productivity was pretty good, very…
elgato
  • 321
  • 3
  • 13
2
votes
1 answer

how to on selenium plugin in storm crawler

How we can configure to on selenium plugin in storm crawler, for example in archetype project of that? There is a code for using selenium in storm crawler. But i don't know how to use it.
2
votes
2 answers

Stormcrawler slow with high latency crawling 300 domains

I am currently struggling with this issue since about 3 month. The Crawler seems to fetch pages every 10 minutes but seems to do nothing in between. With an overall very slow throughput. I am crawling 300 domains in parallel. Which should make…
2
votes
1 answer

StormCrawler's archetype topology does not fetch outlinks

From my understanding the basic example should be able to crawl and fetch pages. I followed the example on http://stormcrawler.net/getting-started/ but the crawler seems to only fetch a few pages and then does nothing more. I wanted to crawl…
2
votes
1 answer

X509 Certificate Exception while crawling some urls with StormCrawler

I have been using StormCrawler to crawl websites. As https protocol, I set default https protocol in StormCrawler. However, when I crawl some websites I am receiving below exception: Caused by:…
isspek
  • 133
  • 1
  • 11
2
votes
1 answer

StormCrawler maven packaging error

I am trying to set up and run Storm Crawler and follow http://digitalpebble.blogspot.co.uk/2017/04/crawl-dynamic-content-with-selenium-and.html blog post. The set of resources and configuration for StormCrawler are on my computer in…
Deividas Duda
  • 123
  • 1
  • 8
2
votes
1 answer

StatusUpdaterBolt: Could not find unacked tuple for ID

I have a very simple topology that spouts from an ES index (AggregationSpout), fetches the pages (FetcherBolt) and uses StatusUpdaterBolt to update the ES status to "FETCHED". However, I noticed such warnings in the log files: [WARN] Could not…
EJO
  • 43
  • 4
2
votes
1 answer

Storm-crawler crawl and indexing

I've worked with Nutch 1x for crawling websites and using Elasticsearch to index the data. I've come across Storm-crawler recently and like it, especially the streaming nature of it. Do I have to init and create the mappings for my ES server that…
user3125823
  • 1,846
  • 2
  • 18
  • 46
2
votes
1 answer

Storm Crawler- Crawling the websites which require authentication

I would like to crawl websites which require authorization (I already have credentials) in intranet with Storm Crawler. Is it possible to do that by simply modifying the crawler configuration or should I alter the classes in the source code, if so,…
isspek
  • 133
  • 1
  • 11
2
votes
1 answer

Crawling using Storm Crawler

We are trying to implement Storm Crawler to crawl data. We have been able to find sub-links from an url but we want to get contents from those sublinks. I have not been able find much resources which would guide me how to get it? Any useful…
Ravi Ranjan
  • 353
  • 1
  • 6
  • 22
1
vote
1 answer

What is the meaning of bucket in StormCrawler spouts?

What is the meaning of bucket in the StormCrawler project? I have seen bucket in different spouts of the project. For example, in Solr and Sql based spouts we have used it in the spouts.
aeranginkaman
  • 279
  • 1
  • 3
  • 11
1
vote
1 answer

Stormcrawler not retrieving all text content from web page

I'm attempting to use Stormcrawler to crawl a set of pages on our website, and while it is able to retrieve and index some of the page's text, it's not capturing a large amount of other text on the page. I've installed Zookeeper, Apache Storm, and…
Dennis
  • 111
  • 6
1
2 3
14 15