Running a Spark Word Count in IntelliJ

Question

I've spent hours going through You Tube vids and tutorials trying to understand how I run a run a word count program for Spark, in Scala, and the turn it into a jar file. I'm getting utterly confused now.

I got Hello World running, and I've learn about going to the libraries to add in Apache.spark.spark-core, but now I'm getting

Error: Could not find or load main class WordCount

Further more I'm utterly bewildered why these two tutorials which I thought we teaching the same thing seem to differ so much: tutorial1 tutorial2

The second one seems to be twice as long as the first and it throws in things that the first didn't mention. Should I be relying on either of these to help me get a simple word count program and jar up and running?

Ps. My code currently looks like this. I copied it from somewhere:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._

object WordCount {
  def main(args: Array[String]) {

    val sc = new SparkContext( "local", "Word Count", "/usr/local/spark", Nil, Map(), Map())
    val input = sc.textFile("../Data/input.txt")
    val count = input.flatMap(line ⇒ line.split(" "))
      .map(word ⇒ (word, 1))
      .reduceByKey(_ + _)
    count.saveAsTextFile("outfile")
    System.out.println("OK");
  }
}

Your first link is a PDF out of your computer... We can't access that — OneCricketeer, Sep 02 '17 at 05:44
@cricket_007 It's here: "Setting up spark 2.0 with intellij community edition.pdf" https://www.ibm.com/developerworks/community/files/app#/file/b41505ac-141b-45a2-84cd-1b6a8d5ae653 — Dmytro Mitin, Sep 02 '17 at 09:52

score 2 · Accepted Answer · edited Sep 02 '17 at 18:04

In IntelliJ Idea do File -> New -> Project -> Scala -> SBT -> (select location and name for project) -> Finish.

Write in build.sbt

scalaVersion := "2.11.11"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.2.0"

Do sbt update in command line (from within your main project folder) or press refresh button in SBT Tool window inside IntelliJ Idea).

Write your code in src/main/scala/WordCount.scala

import org.apache.spark.{SparkConf, SparkContext}

object WordCount {
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("Word Count")
      .setSparkHome("src/main/resources")
    val sc = new SparkContext(conf)
    val input = sc.textFile("src/main/resources/input.txt")
    val count = input.flatMap(line ⇒ line.split(" "))
      .map(word ⇒ (word, 1))
      .reduceByKey(_ + _)
    count.saveAsTextFile("src/main/resources/outfile")
    println("OK")
  }
}

Put your file as src/main/resources/input.txt

Run your code: Ctrl+Shift+F10 or sbt run

In folder src/main/resources there should appear new subfolder outfile with several files.

Console output:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/09/02 14:57:08 INFO SparkContext: Running Spark version 2.2.0
17/09/02 14:57:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/09/02 14:57:09 WARN Utils: Your hostname, dmitin-HP-Pavilion-Notebook resolves to a loopback address: 127.0.1.1; using 192.168.1.104 instead (on interface wlan0)
17/09/02 14:57:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/09/02 14:57:09 INFO SparkContext: Submitted application: Word Count
17/09/02 14:57:09 INFO SecurityManager: Changing view acls to: dmitin
17/09/02 14:57:09 INFO SecurityManager: Changing modify acls to: dmitin
17/09/02 14:57:09 INFO SecurityManager: Changing view acls groups to: 
17/09/02 14:57:09 INFO SecurityManager: Changing modify acls groups to: 
17/09/02 14:57:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(dmitin); groups with view permissions: Set(); users  with modify permissions: Set(dmitin); groups with modify permissions: Set()
17/09/02 14:57:10 INFO Utils: Successfully started service 'sparkDriver' on port 38186.
17/09/02 14:57:10 INFO SparkEnv: Registering MapOutputTracker
17/09/02 14:57:10 INFO SparkEnv: Registering BlockManagerMaster
17/09/02 14:57:10 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/09/02 14:57:10 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/09/02 14:57:10 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-d90a4735-6a2b-42b2-85ea-55b0ed9b1dfd
17/09/02 14:57:10 INFO MemoryStore: MemoryStore started with capacity 1950.3 MB
17/09/02 14:57:10 INFO SparkEnv: Registering OutputCommitCoordinator
17/09/02 14:57:10 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/09/02 14:57:11 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.104:4040
17/09/02 14:57:11 INFO Executor: Starting executor ID driver on host localhost
17/09/02 14:57:11 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46432.
17/09/02 14:57:11 INFO NettyBlockTransferService: Server created on 192.168.1.104:46432
17/09/02 14:57:11 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/09/02 14:57:11 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.104, 46432, None)
17/09/02 14:57:11 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.104:46432 with 1950.3 MB RAM, BlockManagerId(driver, 192.168.1.104, 46432, None)
17/09/02 14:57:11 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.104, 46432, None)
17/09/02 14:57:11 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.104, 46432, None)
17/09/02 14:57:12 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 214.5 KB, free 1950.1 MB)
17/09/02 14:57:12 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 20.4 KB, free 1950.1 MB)
17/09/02 14:57:12 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.104:46432 (size: 20.4 KB, free: 1950.3 MB)
17/09/02 14:57:12 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:16
17/09/02 14:57:12 INFO FileInputFormat: Total input paths to process : 1
17/09/02 14:57:12 INFO SparkContext: Starting job: saveAsTextFile at WordCount.scala:20
17/09/02 14:57:12 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:18)
17/09/02 14:57:12 INFO DAGScheduler: Got job 0 (saveAsTextFile at WordCount.scala:20) with 1 output partitions
17/09/02 14:57:12 INFO DAGScheduler: Final stage: ResultStage 1 (saveAsTextFile at WordCount.scala:20)
17/09/02 14:57:12 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
17/09/02 14:57:12 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
17/09/02 14:57:12 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:18), which has no missing parents
17/09/02 14:57:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.7 KB, free 1950.1 MB)
17/09/02 14:57:13 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.7 KB, free 1950.1 MB)
17/09/02 14:57:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.104:46432 (size: 2.7 KB, free: 1950.3 MB)
17/09/02 14:57:13 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/09/02 14:57:13 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:18) (first 15 tasks are for partitions Vector(0))
17/09/02 14:57:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/09/02 14:57:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 4873 bytes)
17/09/02 14:57:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/09/02 14:57:13 INFO HadoopRDD: Input split: file:/home/dmitin/Projects/sparkdemo/src/main/resources/input.txt:0+11
17/09/02 14:57:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1154 bytes result sent to driver
17/09/02 14:57:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 289 ms on localhost (executor driver) (1/1)
17/09/02 14:57:13 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:18) finished in 0,321 s
17/09/02 14:57:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/09/02 14:57:13 INFO DAGScheduler: looking for newly runnable stages
17/09/02 14:57:13 INFO DAGScheduler: running: Set()
17/09/02 14:57:13 INFO DAGScheduler: waiting: Set(ResultStage 1)
17/09/02 14:57:13 INFO DAGScheduler: failed: Set()
17/09/02 14:57:13 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at WordCount.scala:20), which has no missing parents
17/09/02 14:57:13 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 65.3 KB, free 1950.0 MB)
17/09/02 14:57:13 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 23.3 KB, free 1950.0 MB)
17/09/02 14:57:13 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.104:46432 (size: 23.3 KB, free: 1950.3 MB)
17/09/02 14:57:13 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
17/09/02 14:57:13 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at WordCount.scala:20) (first 15 tasks are for partitions Vector(0))
17/09/02 14:57:13 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
17/09/02 14:57:13 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, ANY, 4621 bytes)
17/09/02 14:57:13 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
17/09/02 14:57:13 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
17/09/02 14:57:13 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 10 ms
17/09/02 14:57:13 INFO FileOutputCommitter: Saved output of task 'attempt_20170902145712_0001_m_000000_1' to file:/home/dmitin/Projects/sparkdemo/src/main/resources/outfile/_temporary/0/task_20170902145712_0001_m_000000
17/09/02 14:57:13 INFO SparkHadoopMapRedUtil: attempt_20170902145712_0001_m_000000_1: Committed
17/09/02 14:57:13 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1224 bytes result sent to driver
17/09/02 14:57:13 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 221 ms on localhost (executor driver) (1/1)
17/09/02 14:57:13 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
17/09/02 14:57:13 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at WordCount.scala:20) finished in 0,223 s
17/09/02 14:57:13 INFO DAGScheduler: Job 0 finished: saveAsTextFile at WordCount.scala:20, took 1,222133 s
OK
17/09/02 14:57:13 INFO SparkContext: Invoking stop() from shutdown hook
17/09/02 14:57:13 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040
17/09/02 14:57:13 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/09/02 14:57:13 INFO MemoryStore: MemoryStore cleared
17/09/02 14:57:13 INFO BlockManager: BlockManager stopped
17/09/02 14:57:13 INFO BlockManagerMaster: BlockManagerMaster stopped
17/09/02 14:57:13 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/09/02 14:57:13 INFO SparkContext: Successfully stopped SparkContext
17/09/02 14:57:13 INFO ShutdownHookManager: Shutdown hook called
17/09/02 14:57:13 INFO ShutdownHookManager: Deleting directory /tmp/spark-663047b2-415a-45b5-bcad-20bd18270baa

Process finished with exit code 0

Thanks - this is good. I'm currently getting an error when I hit run though "Error:(1, 12) object apache is not a member of package org import org.apache.spark.{SparkConf, SparkContext}" which I am investigating. — Reddspark, Sep 02 '17 at 15:33
@user1761806 Looks like an issue with resolving dependencies. Try `sbt clean` and then `sbt update`. Or try to re-import the project in IntelliJ. — Dmytro Mitin, Sep 02 '17 at 15:38
Actually I worked it out. The sbt update, that should be done in your main project folder, not just from any location as I was doing. — Reddspark, Sep 02 '17 at 15:54
@user1761806 You're right. Sorry I didn't mention this. It's hard to guess all possible difficulties with launching something. — Dmytro Mitin, Sep 02 '17 at 18:08

score 0 · Answer 2 · answered Sep 02 '17 at 05:52

0

You can always do WordCount extends App and this should work. I believe its about the way you have structured your project.

Read more about the app trait here.

http://www.scala-lang.org/api/2.12.1/scala/App.html

In any case, please make sure that your directory layout looks like this.

./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/WordCount.scala

answered Sep 02 '17 at 05:52

Sam Upra

737
5
12

Yeah I saw the extends app method mentioned ,but assumed the more Java style way is used by most devs? (Ps. I'm new to Scala and haven't done Java since my uni days in a byegone era) – Reddspark Sep 02 '17 at 10:43
I believe that if you have the directory layout in the way I mentioned, you wont even need extends App :) – Sam Upra Sep 02 '17 at 13:37
Extending App has nothing to do with the directory layout. It is how you define the executable class. https://stackoverflow.com/questions/11667630/difference-between-using-app-trait-and-main-method-in-scala – OneCricketeer Sep 02 '17 at 13:54

score 0 · Answer 3 · answered Sep 02 '17 at 06:14

check the sample code which i written show below

package com.spark.app

import org.scalatra._
import org.apache.spark.{ SparkContext, SparkConf }

class MySparkAppServlet extends MySparkAppStack {

  get("/wc") {
        val inputFile = "/home/limitless/Documents/projects/test/my-spark-app/README.md"
        val outputFile = "/home/limitless/Documents/projects/test/my-spark-app/README.txt"
        val conf = new SparkConf().setAppName("wordCount").setMaster("local[*]")
        val sc = new SparkContext(conf)
        val input =  sc.textFile(inputFile)
        val words = input.flatMap(line => line.split(" "))
        val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
        counts.saveAsTextFile(outputFile)
    }

}

score 0 · Answer 4 · edited May 11 '21 at 12:25

package com.application.spark

import org.apache.spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

object WordCountDemo {
  def main(args: Array[String]): Unit = {
    //With using SparkContext
    val inputFile = "/Users/arupprasad/Desktop/MyFirstApplicationWithSpark1/src/main/resources/WordCountFile.txt"
    val outputFile = "/Users/arupprasad/Desktop/MyFirstApplicationWithSpark1/src/main/resources/WordCountOutput"
    val conf = new SparkConf().setAppName("wordCount").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val input =  sc.textFile(inputFile)
    val words = input.flatMap(line => line.split(" "))
    val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}

    counts.saveAsTextFile(outputFile)

    //With using SparkSession
    val sparkSession = SparkSession.builder.master("local").appName("Scala Spark Example").getOrCreate()

    // import sparkSession.implicits._
    val sparkContext = sparkSession.sparkContext

    val inputFile = "/Users/arupprasad/Desktop/MyFirstApplicationWithSpark1/src/main/resources/WordCountFile.txt"
    val outputFile = "/Users/arupprasad/Desktop/MyFirstApplicationWithSpark1/src/main/resources/WordCountOutput"
    val input =  sparkContext.textFile(inputFile)
    val words = input.flatMap(line => line.split(" "))
    val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}

    counts.saveAsTextFile(outputFile)
  }
}

While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. — C. Peck, May 12 '21 at 06:45

Running a Spark Word Count in IntelliJ

4 Answers4

Linked