0

I have few urls need to scrape using stormcrawler.As per [link]https://medium.com/analytics-vidhya/web-scraping-and-indexing-with-stormcrawler-and-elasticsearch-a105cb9c02ca[/link]I followed all steps and got scraped and loaded content in my elastic.

As per above blog, he used Flux command to inject topology to ES.

spouts: 
  - 
    className: com.digitalpebble.stormcrawler.spout.FileSpout
    constructorArgs: 
      - "stormcrawlertest-master/"
      - seeds.txt
      - true
    id: spout
    parallelism: 1
streams: 
  - 
    from: spout
    grouping: 
      customClass: 
        className: com.digitalpebble.stormcrawler.util.URLStreamGrouping
        constructorArgs: 
          - byHost
      streamId: status
      type: CUSTOM
    to: status

this will inject urls to ES. I followed the same class in Flux and create a main class

String[] argsa = new String[] { "-conf","/crawler-conf.yaml", "-conf","/es-conf.yaml", "-local" };      
ConfigurableTopology.start(new InjectorTopology(), argsa);
public class InjectorTopology extends ConfigurableTopology {
    @Override
    protected int run(String[] args) {
        TopologyBuilder builder = new TopologyBuilder();
        builder.setSpout("spout", new FileSpout("stormcrawlertest-master/","seeds.txt", true), 1);
        builder.setBolt("status", new StatusUpdaterBolt(), 1).customGrouping("spout",new URLStreamGrouping(Constants.PARTITION_MODE_HOST));
        return submit("ESInjectorInstance", conf, builder);
    }}

and clean and package by maven run python storm.py jar target/stormcrawlertest-1.0-SNAPSHOT.jar com.my.sitescraper.main.SiteScraper this is not injecting any urls to ES.

What I am missing.

  • Could your add your full error code? Do you run locally or on a cluster? – moosehead42 Jun 25 '21 at 06:10
  • Did you look at the tutorials? https://www.youtube.com/watch?v=8kpJLPdhvLw – Julien Nioche Jun 25 '21 at 06:30
  • thank you for reply. I am running the same in local, as per the blog its working fine and the site[link]https://www.getcosi.com/[/link] is clawed, with first page only, its not going to all links. My poc is to do those streams in Flux needs to work in java so i can debug, but ES status is not getting updated when i call my custom java class. yes i have gone through youtube video there also used Flux. how can we run as a service, for example a new url I need to crawl, if I add in seeds.txt at that time start crawl. and how we can run with out storm command. – operation_java Jun 25 '21 at 08:02
  • followed ReadMe watched youtube video. getting erro Exception in thread "main" org.apache.storm.utils.NimbusLeaderNotFoundException: Could not find leader nimbus from seed hosts [localhost]. Did you specify a valid list of nimbus hosts for config nimbus.seeds?\r\n\tat org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:120)\r\n\tat org.apache.storm.utils.NimbusClient.getCo – operation_java Jun 25 '21 at 18:31
  • my steps: os windows 10 java 1.8.0_201 python Python 3.6.5 storm unzip apache-storm-2.2.0 maven Apache Maven 3.6.3 Elastic(local) 7.12.0 C:\crawler>mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-elasticsearch-archetype -DarchetypeVersion=2.1 and run as per ReadMe. ther is no updates in Elastic. What I'm doing wrong.do we need zookeeper, nimbus supervisor run initially to run the application in a windows local system. – operation_java Jun 27 '21 at 07:18
  • getting this ERROR o.a.s.u.Utils - Halting process: Worker died\r\njava.lang.RuntimeException: Halting process: Worker died\r\n\tat org.apache.storm.utils.Utils.exitProcess(Utils.java:514) [storm-client-2.2.0.jar:2.2.0]\r\n\tat org.apache.storm.utils.Utils$3.run(Utils.java:837) [storm-client-2.2.0.jar:2.2.0]\r\n\tat org.apache.storm.executor.error.ReportErrorAndDie.uncaughtException(ReportErrorAndDie.java:41) [storm-client-2.2.0.jar:2.2.0]\r\n\tat java.lang.Thread.dispatchUncaughtException(Thread.java:1959) [?:1.8.0_201]\r\n' – operation_java Jun 27 '21 at 07:19

0 Answers0