0

I am using Spark locally on my Mac. My version is 2.2.1 and I am trying to replicate a classification example using Naive Bayes using this link - https://spark.apache.org/docs/2.2.1/ml-classification-regression.html#naive-bayes

For this, I am unable to load the sample data

import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

Above code throws this error -

org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/my_user_name/data/mllib/sample_libsvm_data.txt;
  at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:626)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
  ... 50 elided

How do I load this data so that I can continue further analysis ?

Regressor
  • 1,843
  • 4
  • 27
  • 67
  • You can refer this - https://stackoverflow.com/questions/27299923/how-to-load-local-file-in-sc-textfile-instead-of-hdfs. Just prefix the file path with "file:////" to read it from local. – Sc0rpion Dec 26 '18 at 17:50
  • i dont actually know where the file is. The link in my question does not show the location of the file. – Regressor Dec 26 '18 at 18:30
  • 1
    https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt – Sc0rpion Dec 26 '18 at 18:44

1 Answers1

0

You can load it into an RDD first...

val textFile = sc.textFile("data/mllib/sample_libsvm_data.txt")

Then convert to DataFrame like below (given that you know the schema)...

val df = textFile.toDF(dfSchema)

uh_big_mike_boi
  • 3,350
  • 4
  • 33
  • 64