The tutorial here describes how to set up Stormcrawler to run with phantomJS, but phantomJS doesn't seem capable of sourcing and executing outlinking javascript pages (e.g., javascript code that's linked to outside of the immediate page's context). Chromedriver appears to be able to handle this case, however. How can I set up Stormcrawler to run with chromedriver instead of phantomJS?
Asked
Active
Viewed 159 times
0
-
Smart way to earn points: to ask a question coming instantly with given by yourself a very long answer :) But it's not a question. – Prophet Apr 29 '21 at 15:49
-
@Prophet, I don't understand, how is this not a question? – Dennis Apr 29 '21 at 16:25
-
Question is when you asking a question. But you published it with the answer. So you are not asking, just trying to show others what you know. – Prophet Apr 29 '21 at 16:37
-
1This was a question that spawned from a previous question I asked a number of days ago (https://stackoverflow.com/questions/67129360/stormcrawler-not-retrieving-all-text-content-from-web-page), which I spent a good amount of time working with the developer of stormcrawler to figure out, and I imagine others will have this same question. So, I figured I would post my experience in detail here as a reference for others. – Dennis Apr 29 '21 at 16:45
-
@Prophet https://stackoverflow.com/help/self-answer – Julien Nioche Apr 29 '21 at 18:21
1 Answers
1
The basic set of steps you need to follow are:
- Install latest versions of Chrome and Chromedriver (below based on the tutorial here):
# Install Google Chrome wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb sudo apt install ./google-chrome-stable_current_amd64.deb # Install Chromedriver PLATFORM=linux64 # Adjust as necessary depending on your system VERSION=$(curl http://chromedriver.storage.googleapis.com/LATEST_RELEASE) curl -O http://chromedriver.storage.googleapis.com/$VERSION/chromedriver_$PLATFORM.zip unzip chromedriver_linux64.zip # Move executable into your path cp chromedriver /usr/bin/
- Specify the following selenium settings in your crawler configuration file (based on snippet from @JulienNioche here), including the address and port at which chromedriver will be running:
http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol" https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol" selenium.addresses: "http://localhost:9515" selenium.setScriptTimeout: 10000 selenium.pageLoadTimeout: 1000 selenium.implicitlyWait: 1000 selenium.capabilities: goog:chromeOptions: args: - "--no-sandbox" - "--disable-dev-shm-usage" - "--headless" - "--disable-gpu"
- Rebuild your Stormcrawler maven package:
mvn clean install; mvn clean package
- Only necessary if you modified any of your source configuration files, but doesn't hurt to rebuild again anyway
- Start chromedriver in background (defaults to port 9515):
chromedriver --headless &
- [Optional if connecting to Elasticsearch] Set up your ES indices, if not already done so
- Start your topology (first in local mode as shown here to test your setup; if it doesn't crash, then you should be good to go in remote mode):
storm jar target/stormcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 600000
If things still don't work for you after following these steps, there may be an issue in one of your configuration files or a version incompatibility issue between one or more of the tools. In any case, I've provided a set of example configurations below that worked for me (as of the time of writing) which I hope may be of help in getting things working.
Example Configurations (for stormcrawler-elasticsearch setup with chromedriver)
Versions used at time of writing this answer:
- Stormcrawler 1.17
- Storm 1.2.3
- Selenium 4.0.0-alpha-6 (no need to install this separately; will be downloaded and installed during maven build of stormcrawler 1.17)
- Chromedriver 90.0.4430.24
- Google Chrome 90.0.4430.93
crawler-conf.yaml
config:
topology.workers: 3
topology.message.timeout.secs: 3000
topology.max.spout.pending: 100
topology.debug: true
fetcher.threads.number: 100
# override the JVM parameters for the workers
topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# lists the metadata to persist to storage
# these are not transfered to the outlinks
metadata.persist:
- _redirTo
- error.cause
- error.source
- isSitemap
- isFeed
http.agent.name: "Anonymous Coward"
http.agent.version: "1.0"
http.agent.description: "built with StormCrawler 1.17"
http.agent.url: "http://someorganization.com/"
http.agent.email: "someone@someorganization.com"
# The maximum number of bytes for returned HTTP response bodies.
# The fetched page will be trimmed to 65KB in this case
# Set -1 to disable the limit.
http.content.limit: -1 # default 65536
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# revisit a page daily (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.default: 1440
# revisit a page with a fetch error after 2 hours (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.fetch.error: 120
# never revisit a page with an error (or set a value in minutes)
fetchInterval.error: -1
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
# Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
selenium.addresses: "http://localhost:9515"
selenium.setScriptTimeout: 10000
selenium.pageLoadTimeout: 1000
selenium.implicitlyWait: 1000
selenium.capabilities:
goog:chromeOptions:
args:
- "--nosandbox"
- "--disable-dev-shm-usage"
- "--headless"
- "--disable-gpu"
es-conf.yaml
config:
# ES indexer bolt
es.indexer.addresses: "localhost"
es.indexer.index.name: "content"
# es.indexer.pipeline: "_PIPELINE_"
es.indexer.create: false
es.indexer.bulkActions: 100
es.indexer.flushInterval: "2s"
es.indexer.concurrentRequests: 1
# ES metricsConsumer
es.metrics.addresses: "http://localhost:9200"
es.metrics.index.name: "metrics"
# ES spout and persistence bolt
es.status.addresses: "http://localhost:9200"
es.status.index.name: "status"
es.status.routing: true
es.status.routing.fieldname: "key"
es.status.bulkActions: 500
es.status.flushInterval: "5s"
es.status.concurrentRequests: 1
# spout config #
# time in secs for which the URLs will be considered for fetching after a ack of fail
spout.ttl.purgatory: 30
# Min time (in msecs) to allow between 2 successive queries to ES
spout.min.delay.queries: 2000
# Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
spout.reset.fetchdate.after: 120
es.status.max.buckets: 50
es.status.max.urls.per.bucket: 2
# field to group the URLs into buckets
es.status.bucket.field: "key"
# fields to sort the URLs within a bucket
es.status.bucket.sort.field:
- "nextFetchDate"
- "url"
# field to sort the buckets
es.status.global.sort.field: "nextFetchDate"
# CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
es.status.max.start.offset: 500
# AggregationSpout : sampling improves the performance on large crawls
es.status.sample: false
# max allowed duration of a query in sec
es.status.query.timeout: -1
# AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
# use it as nextFetchDate
es.status.recentDate.increase: -1
es.status.recentDate.min.gap: -1
topology.metrics.consumer.register:
- class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
parallelism.hint: 1
es-crawler.flux
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
parallelism: 10
- id: "filespout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "seeds.txt"
- true
bolts:
- id: "filter"
className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
parallelism: 3
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 3
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 3
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 3
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 12
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 3
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 3
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 3
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "spout"
to: "status_metrics"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filespout"
to: "filter"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filter"
to: "status"
grouping:
streamId: "status"
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byDomain"
parsefilters.json
{
"com.digitalpebble.stormcrawler.parse.ParseFilters": [
{
"class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
"name": "XPathFilter",
"params": {
"canonical": "//*[@rel=\"canonical\"]/@href",
"parse.description": [
"//*[@name=\"description\"]/@content",
"//*[@name=\"Description\"]/@content"
],
"parse.title": [
"//TITLE",
"//META[@name=\"title\"]/@content"
],
"parse.keywords": "//META[@name=\"keywords\"]/@content"
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
"name": "LinkParseFilter",
"params": {
"pattern": "//FRAME/@src"
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
"name": "DomainParseFilter",
"params": {
"key": "domain",
"byHost": false
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata",
"name": "CommaSeparatedToMultivaluedMetadata",
"params": {
"keys": ["parse.keywords"]
}
}
]
}
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.rcsb.crawler</groupId>
<artifactId>stormcrawler</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>stormcrawler</name>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<stormcrawler.version>1.17</stormcrawler.version>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>1.3.2</version>
<executions>
<execution>
<goals>
<goal>exec</goal>
</goals>
</execution>
</executions>
<configuration>
<executable>java</executable>
<includeProjectDependencies>true</includeProjectDependencies>
<includePluginDependencies>false</includePluginDependencies>
<classpathScope>compile</classpathScope>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>1.3.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<createDependencyReducedPom>false</createDependencyReducedPom>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>org.apache.storm.flux.Flux</mainClass>
<manifestEntries>
<Change></Change>
<Build-Date></Build-Date>
</manifestEntries>
</transformer>
</transformers>
<!-- The filters below are necessary if you want to include the Tika
module -->
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
<filter>
<!-- https://issues.apache.org/jira/browse/STORM-2428 -->
<artifact>org.apache.storm:flux-core</artifact>
<excludes>
<exclude>org/apache/commons/**</exclude>
<exclude>org/apache/http/**</exclude>
<exclude>org/yaml/**</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>com.digitalpebble.stormcrawler</groupId>
<artifactId>storm-crawler-core</artifactId>
<version>${stormcrawler.version}</version>
</dependency>
<dependency>
<groupId>com.digitalpebble.stormcrawler</groupId>
<artifactId>storm-crawler-elasticsearch</artifactId>
<version>${stormcrawler.version}</version>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>1.2.3</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>flux-core</artifactId>
<version>1.2.3</version>
</dependency>
</dependencies>
</project>

Dennis
- 111
- 6