Highest Voted 'stormcrawler' Questions

5

votes

2 answers

Nutch vs Heritrix vs Stormcrawler vs MegaIndex vs Mixnode

We need to crawl a large number (~1.5 billion) of web pages every two weeks. Speed, hence cost, is a huge factor for us as our initial attempts have ended up costing us over $20k. Is there any data on which crawler performs the best in a distributed…

asked Oct 10 '17 at 18:41

Anakin

107
1
5

3

votes

0 answers

KryoException: Buffer underflow error in Apache Storm and Storm-Crawler

I have been encountering a recurring issue during the deployment of a new version of my topology in Storm-Crawler, and I am seeking assistance in understanding and resolving the problem. Error: Upon deployment, I consistently encounter the following…

apache-storm stormcrawler

asked Jul 03 '23 at 07:09

Hamide Ahadi

31
3

2

votes

1 answer

Storm Crawler with Java 11

Trying to update the Java version from Java 8 to Java 11 to compile and run the StromCrawler. My question- Does Storm Crawler is supported on Java 11?. As we I update the java version in my POM and build the project I was successfully build the…

java-11 stormcrawler

asked Nov 19 '20 at 12:51

Vinay S

21
2

2

votes

2 answers

Tika Parser slowing down StormCrawler

I have pretty common task, having few thousands of websites and having to parse as many, as possible (in an adequate manner, of course). First, I've made a stormcrawlerfight-like configuration, using JSoup parser. Productivity was pretty good, very…

web-crawler stormcrawler

asked Mar 08 '19 at 21:13

elgato

321
3
13

2

votes

1 answer

how to on selenium plugin in storm crawler

How we can configure to on selenium plugin in storm crawler, for example in archetype project of that? There is a code for using selenium in storm crawler. But i don't know how to use it.

selenium web-crawler apache-storm stormcrawler

asked Jul 30 '18 at 04:49

sabereh faez

29
4

2

votes

2 answers

Stormcrawler slow with high latency crawling 300 domains

I am currently struggling with this issue since about 3 month. The Crawler seems to fetch pages every 10 minutes but seems to do nothing in between. With an overall very slow throughput. I am crawling 300 domains in parallel. Which should make…

elasticsearch web-crawler apache-storm stormcrawler

asked Jun 20 '18 at 14:34

Jonas Pohlmann

69
7

2

votes

1 answer

StormCrawler's archetype topology does not fetch outlinks

From my understanding the basic example should be able to crawl and fetch pages. I followed the example on http://stormcrawler.net/getting-started/ but the crawler seems to only fetch a few pages and then does nothing more. I wanted to crawl…

web-crawler apache-storm stormcrawler

asked Apr 05 '18 at 12:15

Jonas Pohlmann

69
7

2

votes

1 answer

X509 Certificate Exception while crawling some urls with StormCrawler

I have been using StormCrawler to crawl websites. As https protocol, I set default https protocol in StormCrawler. However, when I crawl some websites I am receiving below exception: Caused by:…

java web-crawler apache-storm x509certificate stormcrawler

asked Mar 21 '18 at 06:23

isspek

133
1
11

2

votes

1 answer

StormCrawler maven packaging error

I am trying to set up and run Storm Crawler and follow http://digitalpebble.blogspot.co.uk/2017/04/crawl-dynamic-content-with-selenium-and.html blog post. The set of resources and configuration for StormCrawler are on my computer in…

maven web-crawler stormcrawler

asked Mar 05 '18 at 14:13

Deividas Duda

123
1
8

2

votes

1 answer

StatusUpdaterBolt: Could not find unacked tuple for ID

I have a very simple topology that spouts from an ES index (AggregationSpout), fetches the pages (FetcherBolt) and uses StatusUpdaterBolt to update the ES status to "FETCHED". However, I noticed such warnings in the log files: [WARN] Could not…

web-crawler stormcrawler

asked Dec 06 '17 at 12:27

EJO

43
4

2

votes

1 answer

Storm-crawler crawl and indexing

I've worked with Nutch 1x for crawling websites and using Elasticsearch to index the data. I've come across Storm-crawler recently and like it, especially the streaming nature of it. Do I have to init and create the mappings for my ES server that…

elasticsearch web-crawler nutch stormcrawler

asked May 31 '17 at 20:07

user3125823

1,846
2
18
46

2

votes

1 answer

Storm Crawler- Crawling the websites which require authentication

I would like to crawl websites which require authorization (I already have credentials) in intranet with Storm Crawler. Is it possible to do that by simply modifying the crawler configuration or should I alter the classes in the source code, if so,…

web-crawler stormcrawler

asked Feb 23 '17 at 06:07

isspek

133
1
11

2

votes

1 answer

Crawling using Storm Crawler

We are trying to implement Storm Crawler to crawl data. We have been able to find sub-links from an url but we want to get contents from those sublinks. I have not been able find much resources which would guide me how to get it? Any useful…

web-crawler apache-storm stormcrawler

asked Dec 28 '16 at 09:29

Ravi Ranjan

353
1
6
22

1

vote

1 answer

What is the meaning of bucket in StormCrawler spouts?

What is the meaning of bucket in the StormCrawler project? I have seen bucket in different spouts of the project. For example, in Solr and Sql based spouts we have used it in the spouts.

web-crawler stormcrawler

asked Jun 26 '21 at 05:40

aeranginkaman

279
1
3
11

1

vote

1 answer

Stormcrawler not retrieving all text content from web page

I'm attempting to use Stormcrawler to crawl a set of pages on our website, and while it is able to retrieve and index some of the page's text, it's not capturing a large amount of other text on the page. I've installed Zookeeper, Apache Storm, and…

stormcrawler

asked Apr 16 '21 at 17:00

Dennis

111
6

Questions tagged [stormcrawler]