0

The tutorial here describes how to set up Stormcrawler to run with phantomJS, but phantomJS doesn't seem capable of sourcing and executing outlinking javascript pages (e.g., javascript code that's linked to outside of the immediate page's context). Chromedriver appears to be able to handle this case, however. How can I set up Stormcrawler to run with chromedriver instead of phantomJS?

Dennis
  • 111
  • 6
  • Smart way to earn points: to ask a question coming instantly with given by yourself a very long answer :) But it's not a question. – Prophet Apr 29 '21 at 15:49
  • @Prophet, I don't understand, how is this not a question? – Dennis Apr 29 '21 at 16:25
  • Question is when you asking a question. But you published it with the answer. So you are not asking, just trying to show others what you know. – Prophet Apr 29 '21 at 16:37
  • 1
    This was a question that spawned from a previous question I asked a number of days ago (https://stackoverflow.com/questions/67129360/stormcrawler-not-retrieving-all-text-content-from-web-page), which I spent a good amount of time working with the developer of stormcrawler to figure out, and I imagine others will have this same question. So, I figured I would post my experience in detail here as a reference for others. – Dennis Apr 29 '21 at 16:45
  • @Prophet https://stackoverflow.com/help/self-answer – Julien Nioche Apr 29 '21 at 18:21

1 Answers1

1

The basic set of steps you need to follow are:

  1. Install latest versions of Chrome and Chromedriver (below based on the tutorial here):
    # Install Google Chrome
    wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
    sudo apt install ./google-chrome-stable_current_amd64.deb
    
    # Install Chromedriver
    PLATFORM=linux64  # Adjust as necessary depending on your system
    VERSION=$(curl http://chromedriver.storage.googleapis.com/LATEST_RELEASE)
    curl -O http://chromedriver.storage.googleapis.com/$VERSION/chromedriver_$PLATFORM.zip
    unzip chromedriver_linux64.zip
    
    # Move executable into your path
    cp chromedriver /usr/bin/
    
  2. Specify the following selenium settings in your crawler configuration file (based on snippet from @JulienNioche here), including the address and port at which chromedriver will be running:
    http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
    https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
    selenium.addresses: "http://localhost:9515"
    selenium.setScriptTimeout: 10000
    selenium.pageLoadTimeout: 1000
    selenium.implicitlyWait: 1000
    selenium.capabilities:
      goog:chromeOptions:
        args:
        - "--no-sandbox"
        - "--disable-dev-shm-usage"
        - "--headless"
        - "--disable-gpu"
    
  3. Rebuild your Stormcrawler maven package: mvn clean install; mvn clean package
    • Only necessary if you modified any of your source configuration files, but doesn't hurt to rebuild again anyway
  4. Start chromedriver in background (defaults to port 9515): chromedriver --headless &
  5. [Optional if connecting to Elasticsearch] Set up your ES indices, if not already done so
  6. Start your topology (first in local mode as shown here to test your setup; if it doesn't crash, then you should be good to go in remote mode):
    storm jar target/stormcrawler-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux --local es-crawler.flux --sleep 600000
    

If things still don't work for you after following these steps, there may be an issue in one of your configuration files or a version incompatibility issue between one or more of the tools. In any case, I've provided a set of example configurations below that worked for me (as of the time of writing) which I hope may be of help in getting things working.


Example Configurations (for stormcrawler-elasticsearch setup with chromedriver)

Versions used at time of writing this answer:

  • Stormcrawler 1.17
  • Storm 1.2.3
  • Selenium 4.0.0-alpha-6 (no need to install this separately; will be downloaded and installed during maven build of stormcrawler 1.17)
  • Chromedriver 90.0.4430.24
  • Google Chrome 90.0.4430.93

crawler-conf.yaml

config:
  topology.workers: 3
  topology.message.timeout.secs: 3000
  topology.max.spout.pending: 100
  topology.debug: true

  fetcher.threads.number: 100

  # override the JVM parameters for the workers
  topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"

  # mandatory when using Flux
  topology.kryo.register:
    - com.digitalpebble.stormcrawler.Metadata

  # lists the metadata to persist to storage
  # these are not transfered to the outlinks
  metadata.persist:
   - _redirTo
   - error.cause
   - error.source
   - isSitemap
   - isFeed

  http.agent.name: "Anonymous Coward"
  http.agent.version: "1.0"
  http.agent.description: "built with StormCrawler 1.17"
  http.agent.url: "http://someorganization.com/"
  http.agent.email: "someone@someorganization.com"

  # The maximum number of bytes for returned HTTP response bodies.
  # The fetched page will be trimmed to 65KB in this case
  # Set -1 to disable the limit.
  http.content.limit: -1 # default 65536

  parsefilters.config.file: "parsefilters.json"
  urlfilters.config.file: "urlfilters.json"

  # revisit a page daily (value in minutes)
  # set it to -1 to never refetch a page
  fetchInterval.default: 1440

  # revisit a page with a fetch error after 2 hours (value in minutes)
  # set it to -1 to never refetch a page
  fetchInterval.fetch.error: 120

  # never revisit a page with an error (or set a value in minutes)
  fetchInterval.error: -1

  # configuration for the classes extending AbstractIndexerBolt
  # indexer.md.filter: "someKey=aValue"
  indexer.url.fieldname: "url"
  indexer.text.fieldname: "content"
  indexer.canonical.name: "canonical"
  indexer.md.mapping:
  - parse.title=title
  - parse.keywords=keywords
  - parse.description=description
  - domain=domain

  # Metrics consumers:
  topology.metrics.consumer.register:
     - class: "org.apache.storm.metric.LoggingMetricsConsumer"
       parallelism.hint: 1

  http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
  https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
  selenium.addresses: "http://localhost:9515"
  selenium.setScriptTimeout: 10000
  selenium.pageLoadTimeout: 1000
  selenium.implicitlyWait: 1000
  selenium.capabilities:
    goog:chromeOptions:
      args:
      - "--nosandbox"
      - "--disable-dev-shm-usage"
      - "--headless"
      - "--disable-gpu"

es-conf.yaml

config:
  # ES indexer bolt
  es.indexer.addresses: "localhost"
  es.indexer.index.name: "content"
  # es.indexer.pipeline: "_PIPELINE_"
  es.indexer.create: false
  es.indexer.bulkActions: 100
  es.indexer.flushInterval: "2s"
  es.indexer.concurrentRequests: 1

  # ES metricsConsumer
  es.metrics.addresses: "http://localhost:9200"
  es.metrics.index.name: "metrics"

  # ES spout and persistence bolt
  es.status.addresses: "http://localhost:9200"
  es.status.index.name: "status"
  es.status.routing: true
  es.status.routing.fieldname: "key"
  es.status.bulkActions: 500
  es.status.flushInterval: "5s"
  es.status.concurrentRequests: 1

    # spout config #

  # time in secs for which the URLs will be considered for fetching after a ack of fail
  spout.ttl.purgatory: 30

  # Min time (in msecs) to allow between 2 successive queries to ES
  spout.min.delay.queries: 2000

  # Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
  spout.reset.fetchdate.after: 120

  es.status.max.buckets: 50
  es.status.max.urls.per.bucket: 2
  # field to group the URLs into buckets
  es.status.bucket.field: "key"
  # fields to sort the URLs within a bucket
  es.status.bucket.sort.field:
   - "nextFetchDate"
   - "url"
  # field to sort the buckets
  es.status.global.sort.field: "nextFetchDate"

  # CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
  es.status.max.start.offset: 500

  # AggregationSpout : sampling improves the performance on large crawls
  es.status.sample: false

  # max allowed duration of a query in sec
  es.status.query.timeout: -1

  # AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
  # use it as nextFetchDate
  es.status.recentDate.increase: -1
  es.status.recentDate.min.gap: -1

  topology.metrics.consumer.register:
       - class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
         parallelism.hint: 1

es-crawler.flux

name: "crawler"

includes:
    - resource: true
      file: "/crawler-default.yaml"
      override: false

    - resource: false
      file: "crawler-conf.yaml"
      override: true

    - resource: false
      file: "es-conf.yaml"
      override: true

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
    parallelism: 10

  - id: "filespout"
    className: "com.digitalpebble.stormcrawler.spout.FileSpout"
    parallelism: 1
    constructorArgs:
      - "."
      - "seeds.txt"
      - true

bolts:
  - id: "filter"
    className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
    parallelism: 3
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 3
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 3
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 3
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 12
  - id: "index"
    className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
    parallelism: 3
  - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 3
  - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 3

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE

  - from: "spout"
    to: "status_metrics"
    grouping:
      type: SHUFFLE

  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["key"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parse"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filespout"
    to: "filter"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filter"
    to: "status"
    grouping:
      streamId: "status"
      type: CUSTOM
      customClass:
        className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
        constructorArgs:
          - "byDomain"

parsefilters.json

{
  "com.digitalpebble.stormcrawler.parse.ParseFilters": [
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
      "name": "XPathFilter",
      "params": {
        "canonical": "//*[@rel=\"canonical\"]/@href",
        "parse.description": [
            "//*[@name=\"description\"]/@content",
            "//*[@name=\"Description\"]/@content"
         ],
        "parse.title": [
            "//TITLE",
            "//META[@name=\"title\"]/@content"
         ],
         "parse.keywords": "//META[@name=\"keywords\"]/@content"
      }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
      "name": "LinkParseFilter",
      "params": {
         "pattern": "//FRAME/@src"
       }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
      "name": "DomainParseFilter",
      "params": {
        "key": "domain",
        "byHost": false
       }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata",
      "name": "CommaSeparatedToMultivaluedMetadata",
      "params": {
        "keys": ["parse.keywords"]
       }
    }
  ]
}

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modelVersion>4.0.0</modelVersion>
    <groupId>org.rcsb.crawler</groupId>
    <artifactId>stormcrawler</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>stormcrawler</name>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <stormcrawler.version>1.17</stormcrawler.version>
    </properties>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>1.3.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>exec</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <executable>java</executable>
                    <includeProjectDependencies>true</includeProjectDependencies>
                    <includePluginDependencies>false</includePluginDependencies>
                    <classpathScope>compile</classpathScope>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>1.3.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <transformers>
                                <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
                                <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>org.apache.storm.flux.Flux</mainClass>
                                    <manifestEntries>
                                        <Change></Change>
                                        <Build-Date></Build-Date>
                                    </manifestEntries>
                                </transformer>
                            </transformers>
                            <!-- The filters below are necessary if you want to include the Tika
                                module -->
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                                <filter>
                                    <!-- https://issues.apache.org/jira/browse/STORM-2428 -->
                                    <artifact>org.apache.storm:flux-core</artifact>
                                    <excludes>
                                        <exclude>org/apache/commons/**</exclude>
                                        <exclude>org/apache/http/**</exclude>
                                        <exclude>org/yaml/**</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

    <dependencies>
        <dependency>
            <groupId>com.digitalpebble.stormcrawler</groupId>
            <artifactId>storm-crawler-core</artifactId>
            <version>${stormcrawler.version}</version>
        </dependency>
        <dependency>
            <groupId>com.digitalpebble.stormcrawler</groupId>
            <artifactId>storm-crawler-elasticsearch</artifactId>
            <version>${stormcrawler.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.storm</groupId>
            <artifactId>storm-core</artifactId>
            <version>1.2.3</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.storm</groupId>
            <artifactId>flux-core</artifactId>
            <version>1.2.3</version>
        </dependency>
    </dependencies>
</project>
Dennis
  • 111
  • 6