1

I'm attempting to use Stormcrawler to crawl a set of pages on our website, and while it is able to retrieve and index some of the page's text, it's not capturing a large amount of other text on the page.

I've installed Zookeeper, Apache Storm, and Stormcrawler using the Ansible playbooks provided here (thank you a million for those!) on a server running Ubuntu 18.04, along with Elasticsearch and Kibana. For the most part, I'm using the configuration defaults, but have made the following changes:

  • For the Elastic index mappings, I've enabled _source: true, and turned on indexing and storing for all properties (content, host, title, url)
  • In the crawler-conf.yaml configuration, I've commented out all textextractor.include.pattern and textextractor.exclude.tags settings, to enforce capturing the whole page

After re-creating fresh ES indices, running mvn clean package, and then starting the crawler topology, stormcrawler begins doing its thing and content starts appearing in Elasticsearch. However, for many pages, the content that's retrieved and indexed is only a subset of all the text on the page, and usually excludes the main page text we are interested in.

For example, the text in the following XML path is not returned/indexed:

<html> <body> <div#maincontentcontainer.container> <div#docs-container> <div> <div.row> <div.col-lg-9.col-md-8.col-sm-12.content-item> <div> <div> <p> (text)

While the text in this path is returned:

<html> <body> <div> <div.container> <div.row> <p> (text)

Are there any additional configuration changes that need to be made beyond commenting out all specific tag include and exclude patterns? From my understanding of the documentation, the default settings for those options are to enforce the whole page to be indexed.

I would greatly appreciate any help. Thank you for the excellent software.

Below are my configuration files:

crawler-conf.yaml

config:
  topology.workers: 3
  topology.message.timeout.secs: 1000
  topology.max.spout.pending: 100
  topology.debug: false

  fetcher.threads.number: 100

  # override the JVM parameters for the workers
  topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"

  # mandatory when using Flux
  topology.kryo.register:
    - com.digitalpebble.stormcrawler.Metadata

  # metadata to transfer to the outlinks
  # metadata.transfer:
  # - customMetadataName

  # lists the metadata to persist to storage
  metadata.persist:
   - _redirTo
   - error.cause
   - error.source
   - isSitemap
   - isFeed
   
  http.agent.name: "My crawler"
  http.agent.version: "1.0"
  http.agent.description: ""
  http.agent.url: ""
  http.agent.email: ""

  # The maximum number of bytes for returned HTTP response bodies.
  http.content.limit: -1

  # FetcherBolt queue dump => comment out to activate
  # fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"

  parsefilters.config.file: "parsefilters.json"
  urlfilters.config.file: "urlfilters.json"

  # revisit a page daily (value in minutes)
  fetchInterval.default: 1440

  # revisit a page with a fetch error after 2 hours (value in minutes)
  fetchInterval.fetch.error: 120

  # never revisit a page with an error (or set a value in minutes)
  fetchInterval.error: -1

  # text extraction for JSoupParserBolt
  # textextractor.include.pattern:
  #  - DIV[id="maincontent"]
  #  - DIV[itemprop="articleBody"]
  #  - ARTICLE

  # textextractor.exclude.tags:
  #  - STYLE
  #  - SCRIPT

  # configuration for the classes extending AbstractIndexerBolt
  # indexer.md.filter: "someKey=aValue"
  indexer.url.fieldname: "url"
  indexer.text.fieldname: "content"
  indexer.canonical.name: "canonical"
  indexer.md.mapping:
  - parse.title=title
  - parse.keywords=keywords
  - parse.description=description
  - domain=domain

  # Metrics consumers:
  topology.metrics.consumer.register:
     - class: "org.apache.storm.metric.LoggingMetricsConsumer"
       parallelism.hint: 1

  http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
  https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
  selenium.addresses: "http://localhost:9515"

es-conf.yaml

config:
  # ES indexer bolt
  es.indexer.addresses: "localhost"
  es.indexer.index.name: "content"
  # es.indexer.pipeline: "_PIPELINE_"
  es.indexer.create: false
  es.indexer.bulkActions: 100
  es.indexer.flushInterval: "2s"
  es.indexer.concurrentRequests: 1
  
  # ES metricsConsumer
  es.metrics.addresses: "http://localhost:9200"
  es.metrics.index.name: "metrics"
  
  # ES spout and persistence bolt
  es.status.addresses: "http://localhost:9200"
  es.status.index.name: "status"
  es.status.routing: true
  es.status.routing.fieldname: "key"
  es.status.bulkActions: 500
  es.status.flushInterval: "5s"
  es.status.concurrentRequests: 1
  
    # spout config #
    
  # positive or negative filters parsable by the Lucene Query Parser
  # es.status.filterQuery: 
  #  - "-(key:stormcrawler.net)"
  #  - "-(key:digitalpebble.com)"

  # time in secs for which the URLs will be considered for fetching after a ack of fail
  spout.ttl.purgatory: 30
  
  # Min time (in msecs) to allow between 2 successive queries to ES
  spout.min.delay.queries: 2000

  # Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
  spout.reset.fetchdate.after: 120

  es.status.max.buckets: 50
  es.status.max.urls.per.bucket: 2
  # field to group the URLs into buckets
  es.status.bucket.field: "key"
  # fields to sort the URLs within a bucket
  es.status.bucket.sort.field: 
   - "nextFetchDate"
   - "url"
  # field to sort the buckets
  es.status.global.sort.field: "nextFetchDate"

  # CollapsingSpout : limits the deep paging by resetting the start offset for the ES query 
  es.status.max.start.offset: 500
  
  # AggregationSpout : sampling improves the performance on large crawls
  es.status.sample: false

  # max allowed duration of a query in sec 
  es.status.query.timeout: -1

  # AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
  # use it as nextFetchDate
  es.status.recentDate.increase: -1
  es.status.recentDate.min.gap: -1

  topology.metrics.consumer.register:
       - class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
         parallelism.hint: 1
         #whitelist:
         #  - "fetcher_counter"
         #  - "fetcher_average.bytes_fetched"
         #blacklist:
         #  - "__receive.*"

es-crawler.flux

name: "crawler"

includes:
    - resource: true
      file: "/crawler-default.yaml"
      override: false

    - resource: false
      file: "crawler-conf.yaml"
      override: true

    - resource: false
      file: "es-conf.yaml"
      override: true

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
    parallelism: 10

  - id: "filespout"
    className: "com.digitalpebble.stormcrawler.spout.FileSpout"
    parallelism: 1
    constructorArgs:
      - "."
      - "seeds.txt"
      - true

bolts:
  - id: "filter"
    className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
    parallelism: 3
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 3
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 3
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 3
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 12
  - id: "index"
    className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
    parallelism: 3
  - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 3
  - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 3

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE
      
  - from: "spout"
    to: "status_metrics"
    grouping:
      type: SHUFFLE     

  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["key"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parse"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filespout"
    to: "filter"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filter"
    to: "status"
    grouping:
      streamId: "status"
      type: CUSTOM
      customClass:
        className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
        constructorArgs:
          - "byDomain"

parsefilters.json

{
  "com.digitalpebble.stormcrawler.parse.ParseFilters": [
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
      "name": "XPathFilter",
      "params": {
        "canonical": "//*[@rel=\"canonical\"]/@href",
        "parse.description": [
            "//*[@name=\"description\"]/@content",
            "//*[@name=\"Description\"]/@content"
         ],
        "parse.title": [
            "//TITLE",
            "//META[@name=\"title\"]/@content"
         ],
         "parse.keywords": "//META[@name=\"keywords\"]/@content"
      }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
      "name": "LinkParseFilter",
      "params": {
         "pattern": "//FRAME/@src"
       }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
      "name": "DomainParseFilter",
      "params": {
        "key": "domain",
        "byHost": false
       }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata",
      "name": "CommaSeparatedToMultivaluedMetadata",
      "params": {
        "keys": ["parse.keywords"]
       }
    }
  ]
}

Attempting to use Chromedriver

I installed the latest versions of Chromedriver and Google Chrome for Ubuntu.

First I start chromedriver in headless mode at localhost:9515 as the stormcrawler user (via a separate python shell, as shown below), and then I restart the stormcrawler topology (also as stormcrawler user) but end up with a stack of errors related to Chrome. The odd thing however is that I can confirm chromedriver is running OK within the Python shell directly, and I can confirm that both the driver and browser are actively running via ps -ef). This same stack of errors also occurs when I attempt to simply start chromedriver from the command line (i.e., chromedriver --headless &).

Starting chromedriver in headless mode (in python3 shell)

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--window-size=1200x600')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-setuid-sandbox')
options.add_argument('--disable-extensions')
options.add_argument('--disable-infobars')
options.add_argument('--remote-debugging-port=9222')
options.add_argument('--user-data-dir=/home/stormcrawler/cache/google/chrome')
options.add_argument('--disable-gpu')
options.add_argument('--profile-directory=Default')
options.binary_location = '/usr/bin/google-chrome'
driver = webdriver.Chrome(chrome_options=options, port=9515, executable_path=r'/usr/bin/chromedriver')

Stack trace from starting stormcrawler topology Run command: storm jar target/stormcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 60000

9486 [Thread-26-fetcher-executor[3 3]] ERROR o.a.s.util - Async loop died!
java.lang.RuntimeException: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
  (unknown error: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Build info: version: '4.0.0-alpha-6', revision: '5f43a29cfc'
System info: host: 'stormcrawler-dev', ip: '127.0.0.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.15.0-33-generic', java.version: '1.8.0_282'
Driver info: driver.version: RemoteWebDriver
remote stacktrace: #0 0x55d590b21e89 <unknown>

    at com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol.configure(RemoteDriverProtocol.java:101) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
    at com.digitalpebble.stormcrawler.protocol.ProtocolFactory.<init>(ProtocolFactory.java:69) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
    at com.digitalpebble.stormcrawler.bolt.FetcherBolt.prepare(FetcherBolt.java:818) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
    at org.apache.storm.daemon.executor$fn__10180$fn__10193.invoke(executor.clj:803) ~[storm-core-1.2.3.jar:1.2.3]
    at org.apache.storm.util$async_loop$fn__624.invoke(util.clj:482) [storm-core-1.2.3.jar:1.2.3]
    at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
  (unknown error: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
...

Confirming that chromedriver and chrome are both running and reachable

~/stormcrawler$ ps -ef | grep -i 'driver'
stormcr+ 18862 18857  0 14:28 pts/0    00:00:00 /usr/bin/chromedriver --port=9515
stormcr+ 18868 18862  0 14:28 pts/0    00:00:00 /usr/bin/google-chrome --disable-background-networking --disable-client-side-phishing-detection --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-gpu --disable-hang-monitor --disable-infobars --disable-popup-blocking --disable-prompt-on-repost --disable-setuid-sandbox --disable-sync --enable-automation --enable-blink-features=ShadowDOMV0 --enable-logging --headless --log-level=0 --no-first-run --no-sandbox --no-service-autorun --password-store=basic --profile-directory=Default --remote-debugging-port=9222 --test-type=webdriver --use-mock-keychain --user-data-dir=/home/stormcrawler/cache/google/chrome --window-size=1200x600
stormcr+ 18899 18877  0 14:28 pts/0    00:00:00 /opt/google/chrome/chrome --type=renderer --no-sandbox --disable-dev-shm-usage --enable-automation --enable-logging --log-level=0 --remote-debugging-port=9222 --test-type=webdriver --allow-pre-commit-input --ozone-platform=headless --field-trial-handle=17069524199442920904,10206176048672570859,131072 --disable-gpu-compositing --enable-blink-features=ShadowDOMV0 --lang=en-US --headless --enable-crash-reporter --lang=en-US --num-raster-threads=1 --renderer-client-id=4 --shared-files=v8_context_snapshot_data:100

~/stormcrawler$ sudo netstat -lp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 localhost:9222          0.0.0.0:*               LISTEN      18026/google-chrome 
tcp        0      0 localhost:9515          0.0.0.0:*               LISTEN      18020/chromedriver  
Dennis
  • 111
  • 6
  • Hi Dennis, is the URL of the page you are fetching publicly available? I'd like to see if I can reproduce the issue. For the Elastic index mappings, is source is set to true you don't need to explicitly set the fields to stored and indexed. – Julien Nioche Apr 18 '21 at 07:12
  • I forgot 2 important questions: which version of SC do you use? Could you share your parsefilter.json file as well please? – Julien Nioche Apr 19 '21 at 16:16
  • @JulienNioche Of course, I've added the parsefilters.json file to my post. I'm using Storm 1.2.3, Zookeeper 3.5.6, and Storm-crawler-elasticsearch-archetype 1.17. – Dennis Apr 19 '21 at 19:35
  • 1
    @JulienNioche, Whoops, my apologies. This appears to just be a javascript issue. I'll try using the Selenium component in Stormcrawler... – Dennis Apr 20 '21 at 20:14
  • Hi @JulienNioche, I gave selenium and phantomJS a try, but it seems like phantomJS is not able to source outlinking javascript pages, which are critical for generating the content we wish to capture. However, in testing out Chromedriver with python, that seemed capable of doing so. Do you have any examples of incorporating headless Chromedriver instead of PhantomJS in Stormcrawler? (I can send you the URL for the web page again directly--would just prefer not to share it here to avoid it incidentally being used as a test case by others). – Dennis Apr 23 '21 at 16:49
  • 1
    https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/protocol/selenium/RemoteDriverProtocol.java takes an address and port - this should work with Chromedriver – Julien Nioche Apr 23 '21 at 17:29
  • @JulienNioche I see, so I can connect any driver so long as I provide the address and port it's running on. Thanks, I'll give that a try. For reference for others, I found the following article helpful for setting up Chromedriver in headless mode: https://intoli.com/blog/running-selenium-with-headless-chrome/ – Dennis Apr 23 '21 at 18:23
  • let us know how it goes – Julien Nioche Apr 25 '21 at 08:39
  • @JulienNioche Unfortunately, still running into some trouble. I've added details of the chromedriver experience as a separate section in my original post above. It looks like a Chrome issue, but I can confirm the installation works when I run it via python or the command line, so it seems to only happen when connecting to it via stormcrawler. – Dennis Apr 26 '21 at 21:17

1 Answers1

1

IIRC you need to set some additional config to work with ChomeDriver.

Alternatively (haven't tried yet) https://hub.docker.com/r/browserless/chrome would be a nice way of handling Chrome in a Docker container.

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28
  • Thank you, I was eventually able to get it working using the additional configuration settings you shared. However, this ended up leading me to discover a separate but related bug that I should point out to you--there's an issue with the default setting for `selenium.pageLoadTimeout` being `-1`, which results in the following error: `java.lang.RuntimeException: org.openqa.selenium.InvalidArgumentException: invalid argument: value must be a non-negative integer`. I needed to explicitly set this parameter to >0 in my `crawler-conf.yaml` file in order to get the topology working. – Dennis Apr 28 '21 at 23:18
  • 1
    Also, since my original question ended up turning into a different question, I'll plan to re-post the chromedriver setup as a separate question and can answer it myself with details of my final setup. – Dennis Apr 28 '21 at 23:27
  • Glad you got it to work. If you think selenium.pageLoadTimeout is a bug, please open an issue on Github. Thanks – Julien Nioche Apr 29 '21 at 07:30
  • 1
    Sure thing, done. And just posted the Q&A for the chromedriver setup here: https://stackoverflow.com/questions/67320758/how-do-you-set-up-stormcrawler-to-run-with-chromedriver-instead-of-phantomjs/67320759#67320759 – Dennis Apr 29 '21 at 15:46