18

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?

I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark?

Thamme Gowda
  • 11,249
  • 5
  • 50
  • 57
New Man
  • 219
  • 1
  • 2
  • 6
  • Try Nutch. This seems like a bad idea by the way. Spark is a compute engine. Something like Akka or LXD if you need containers are better if you have to distribute at all. Python is a terribly slow but very well thought language (a paradox). Perhaps you are coming from there. I am achieving 1,000,000 pages per day per source with a single node running my Goat Grazer packages at Github. Spark is well built for computations but not networking. https://github.com/asevans48. I plan API support, distribution; being generally heavier than Scrapy. – Andrew Scott Evans Nov 03 '16 at 16:32

5 Answers5

12

Spark adds essentially no value to this task.

Sure, you can do distributed crawling, but good crawling tools already support this out of the box. The datastructures provided by Spark such as RRDs are pretty much useless here, and just to launch crawl jobs, you could just use YARN, Mesos etc. directly at less overhead.

Sure, you could do this on Spark. Just like you could do a word processor on Spark, since it is turing complete... but it doesn't get any easier.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • May be, suppose if we need to collect data from huge number of web pages. So we can do it in distributed environment with spark with better time... In that way is it useful?? – New Man Apr 30 '15 at 15:25
  • 1
    So what? You need a large storage system, and some nodes. But you do not need Spark for that. Use HDFS+YARN, or whatever low-level you want. **People have been doing large crawls long before Spark**. – Has QUIT--Anony-Mousse Apr 30 '15 at 15:27
  • So your suggestion is for something else than spark. But if it is possible then why not, it's latest right? – New Man Apr 30 '15 at 15:49
  • It's changing. By next year, it will be all different. Use something that is reliable for a major project. Have you checked whether nutch can simply run on Hadoop itself? – Has QUIT--Anony-Mousse Apr 30 '15 at 16:10
  • No, I am just starting to do one sample on spark, in a stand alone first. I am comfortable with Jsoup. But not with anything else – New Man Apr 30 '15 at 16:20
  • 1
    Standalone makes even less sense. Keep your program lean and simple, instead of stacking layer ontop of layer until you cannot debug it anymore. – Has QUIT--Anony-Mousse Apr 30 '15 at 16:26
7

How about this way:

Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:

  1. split all the web pages to be crawled into a list of separate site, each site is small enough to fit in a single thread well: for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
  2. assign each base url(www.example.com/news/20150401) to a single thread, it is in the threads where the really data fetch happens
  3. save the result of each thread into FileSystem.

When the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:

  1. Split sites: def getPartitions: Array[Partition] is a good place to do the split task.
  2. Threads to crawl each split: def compute(part: Partition, context: TaskContext): Iterator[X] will be spread to all the executors of your application, run in parallel.
  3. save the rdd into HDFS.

The final program looks like:

class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}

class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {

  override protected def getPartitions: Array[CrawlPartition] = {
    val partitions = new ArrayBuffer[CrawlPartition]
    //split baseURL to subsets and populate the partitions
    partitions.toArray
  }

  override def compute(part: Partition, context: TaskContext): Iterator[X] = {
    val p = part.asInstanceOf[CrawlPartition]
    val baseUrl = p.baseURL

    new Iterator[X] {
       var nextURL = _
       override def hasNext: Boolean = {
         //logic to find next url if has one, fill in nextURL and return true
         // else false
       }          

       override def next(): X = {
         //logic to crawl the web page nextURL and return the content in X
       }
    } 
  }
}

object Crawl {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Crawler")
    val sc = new SparkContext(sparkConf)
    val crdd = new CrawlRDD("baseURL", sc)
    crdd.saveAsTextFile("hdfs://path_here")
    sc.stop()
  }
}
yjshen
  • 6,583
  • 3
  • 31
  • 40
  • I have no prior experience in spark, I would like to start spark, especially using Java. The way you explained correct. There may be lot of web pages. I need to split the url or make some url of pages. Can you suggest me one useful resources to start spark in java?? Then your answer can understand better.. Thanks for your reply – New Man Apr 30 '15 at 10:24
  • @NewMan, [Spark Doc](http://spark.apache.org/docs/latest/index.html) is a good place to start – yjshen Apr 30 '15 at 10:30
  • @NewMan, does the answer above works? If yes, please consider accept it :) – yjshen May 05 '15 at 16:51
  • 2
    @ Yijie Shen, i have started spark from that word count example given in the spark example. I have installed Scala 2.10.4, Hadoop 2.7.0 (http://localhost:50070/dfshealth.html#tab-overview),spark 1.3.1 (local;host:8080).how do i link these things in stand alone mode? – New Man May 07 '15 at 04:48
7

YES.

Check out the open source project: Sparkler (spark - crawler) https://github.com/USCDataScience/sparkler

Checkout Sparkler Internals for a flow/pipeline diagram. (Apologies, it is an SVG image I couldn't post it here)

This project wasn't available when the question was posted, however as of December 2016 it is one of the very active projects!.

Is it possible to crawl the Websites using Apache Spark?

The following pieces may help you understand why someone would ask such a question and also help you to answer it.

  • The creators of Spark framework wrote in the seminal paper [1] that RDDs would be less suitable for applications that make asynchronous finegrained updates to shared state, such as a storage system for a web application or an incremental web crawler
  • RDDs are key components in Spark. However, you can create traditional map reduce applications (with little or no abuse of RDDs)
  • There is a widely popular distributed web crawler called Nutch [2]. Nutch is built with Hadoop Map-Reduce (in fact, Hadoop Map Reduce was extracted out from the Nutch codebase)
  • If you can do some task in Hadoop Map Reduce, you can also do it with Apache Spark.

[1] http://dl.acm.org/citation.cfm?id=2228301
[2] http://nutch.apache.org/


PS: I am a co-creator of Sparkler and a Committer, PMC for Apache Nutch.


When I designed Sparkler, I created an RDD which is a proxy to Solr/Lucene based indexed storage. It enabled our crawler-databse RDD to make asynchronous finegrained updates to shared state, which otherwise is not possible natively.

Thamme Gowda
  • 11,249
  • 5
  • 50
  • 57
  • your project looks good and I would like to give it a try. Is it still maintained in 2021? – Francesco Mantovani Nov 27 '21 at 23:42
  • @FrancescoMantovani Yes, maintained by my friends at USC DataScience https://github.com/USCDataScience/ – Thamme Gowda Nov 29 '21 at 23:14
  • Oh, what a pleasure to meet you then @Thamme, they might be busy because I opened an issue but I had no reply: https://github.com/USCDataScience/sparkler/issues/238 I'm willing to cooperate for debugging, I have already tested twice on two different environments and the error is the same. I'm following the exact documentation – Francesco Mantovani Nov 30 '21 at 09:59
  • @FrancescoMantovani It's an open-source project. Pull requests are welcome! – Thamme Gowda Nov 30 '21 at 21:18
2

There is a project, called SpookyStuff, which is an

Scalable query engine for web scraping/data mashup/acceptance QA, powered by Apache Spark

Hope it helps!

Aito
  • 6,812
  • 3
  • 30
  • 41
1

I think the accepted answer is incorrect in one fundamental way; real-life large-scale web extraction is a pull process.

This is because often times requesting HTTP content is far less laborious task than building the response. I have built a small program, which is able to crawl 16 million pages a day with four CPU cores and 3GB RAM and that was not even optimized very well. For similar server such load (~200 requests per second) is not trivial and usually requires many layers of optimization.

Real web-sites can for example break their cache system if you crawl them too fast (instead of having most popular pages in the cache, it can get flooded with the long-tail content of the crawl). So in that sense, a good web-scraper always respects robots.txt etc.

The real benefit of the distributed crawler doesn't come from splitting the workload of one domain, but from splitting the work load of many domains to a single distributed process so that the one process can confidently track how many requests the system puts through.

Of course in some cases you want to be the bad boy and screw the rules; however, in my experience, such products don't stay alive long, since the web-site owners like to protect their assets from things, which look like DoS attacks.

Golang is very good for building web scrapers, since it has channels as native data type and they support pull-queues very well. Because HTTP protocol and scraping in general is slow, you can include the extraction pipelines as part of the process, which will lower the amount of data to be stored in the data warehouse system. You can crawl one TB with spending less than $1 worth of resources and do it fast when using Golang and Google Cloud (probably able to do with AWS and Azure also).

Spark gives you no additional value. Using wget as a client is clever, since it automatically respects robots.txt properly: parallel domain specific pull queue to wget is the way to go if you are working professionally.

Ahti Ahde
  • 1,078
  • 10
  • 12