14

trying to read data from url using spark on databricks community edition platform i tried to use spark.read.csv and using SparkFiles but still, i am missing some simple point

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True) 

df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)

got path related error:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv;'

i also tried someother way

val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString

 # val list = content.split("\n").filter(_ != "")
   val rdd = sc.parallelize(content)
   val df = rdd.toDF

SyntaxError: invalid syntax
  File "<command-332010883169993>", line 16
    val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString
              ^
SyntaxError: invalid syntax

data should be loaded directly to databricks folder or i should be able load directly from url using spark.read, any suggestions

user3190018
  • 890
  • 13
  • 26
arya
  • 436
  • 1
  • 5
  • 18

2 Answers2

20

Try this.

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

**df = spark.read.csv("file://"+SparkFiles.get("adult.csv"), header=True, inferSchema= True)**

Just fetching few columns of your csv url.

df.select("age","workclass","fnlwgt","education").show(10);
>>> df.select("age","workclass","fnlwgt","education").show(10);
+---+----------------+------+---------+
|age|       workclass|fnlwgt|education|
+---+----------------+------+---------+
| 39|       State-gov| 77516|Bachelors|
| 50|Self-emp-not-inc| 83311|Bachelors|
| 38|         Private|215646|  HS-grad|
| 53|         Private|234721|     11th|
| 28|         Private|338409|Bachelors|
| 37|         Private|284582|  Masters|
| 49|         Private|160187|      9th|
| 52|Self-emp-not-inc|209642|  HS-grad|
| 31|         Private| 45781|  Masters|
| 42|         Private|159449|Bachelors|
+---+----------------+------+---------+

SparkFiles get the absolute path of the file which is local to your driver or worker. That's the reason why it was not able to find it.

vikrant rana
  • 4,509
  • 6
  • 32
  • 72
3

Above answer works but might be error prone some times SparkFiles.get will return null

#1 is more prominent way of getting a file from any url or public s3 location


Option 1 :

IOUtils.toString will do the trick see the docs of apache commons io jar will be already present in any spark cluster whether its databricks or any other spark installation.

Below is the scala way of doing this... I have taken a raw git hub csv file for this example ... can change based on the requirements.

import org.apache.commons.io.IOUtils // jar will be already there in spark cluster no need to worry
import java.net.URL 

val urlfile=new URL("https://raw.githubusercontent.com/lrjoshi/webpage/master/public/post/c159s.csv")
  val testcsvgit = IOUtils.toString(urlfile,"UTF-8").lines.toList.toDS()
  val testcsv = spark
                .read.option("header", true)
                .option("inferSchema", true)
                .csv(testcsvgit)
  testcsv.show

Result :

+-----------+------+----+----+---+-----+
|Experiment |Virus |Cell| MOI|hpi|Titer|
+-----------+------+----+----+---+-----+
|      EXP I| C159S|OFTu| 0.1|  0| 4.75|
|      EXP I| C159S|OFTu| 0.1|  6| 2.75|
|      EXP I| C159S|OFTu| 0.1| 12| 2.75|
|      EXP I| C159S|OFTu| 0.1| 24|  5.0|
|      EXP I| C159S|OFTu| 0.1| 48|  5.5|
|      EXP I| C159S|OFTu| 0.1| 72|  7.0|
|      EXP I| C159S| STU| 0.1|  0| 4.75|
|      EXP I| C159S| STU| 0.1|  6| 3.75|
|      EXP I| C159S| STU| 0.1| 12|  4.0|
|      EXP I| C159S| STU| 0.1| 24| 3.75|
|      EXP I| C159S| STU| 0.1| 48| 3.25|
|      EXP I| C159S| STU| 0.1| 72| 3.25|
|      EXP I| C159S|OFTu|10.0|  0|  6.5|
|      EXP I| C159S|OFTu|10.0|  6| 4.75|
|      EXP I| C159S|OFTu|10.0| 12| 4.75|
|      EXP I| C159S|OFTu|10.0| 24| 6.25|
|      EXP I| C159S|OFTu|10.0| 48|  6.5|
|      EXP I| C159S|OFTu|10.0| 72|  7.0|
|      EXP I| C159S| STU|10.0|  0|  7.0|
|      EXP I| C159S| STU|10.0|  6| 4.75|
+-----------+------+----+----+---+-----+
only showing top 20 rows

Option 2 : in Scala

import java.net.URL
import org.apache.spark.SparkFiles
val urlfile="https://raw.githubusercontent.com/lrjoshi/webpage/master/public/post/c159s.csv"
spark.sparkContext.addFile(urlfile)

val df = spark.read
.option("inferSchema", true)
.option("header", true)
.csv("file://"+SparkFiles.get("c159s.csv"))
df.show

Result : Will be same as Option #1 like below

+-----------+------+----+----+---+-----+
|Experiment |Virus |Cell| MOI|hpi|Titer|
+-----------+------+----+----+---+-----+
|      EXP I| C159S|OFTu| 0.1|  0| 4.75|
|      EXP I| C159S|OFTu| 0.1|  6| 2.75|
|      EXP I| C159S|OFTu| 0.1| 12| 2.75|
|      EXP I| C159S|OFTu| 0.1| 24|  5.0|
|      EXP I| C159S|OFTu| 0.1| 48|  5.5|
|      EXP I| C159S|OFTu| 0.1| 72|  7.0|
|      EXP I| C159S| STU| 0.1|  0| 4.75|
|      EXP I| C159S| STU| 0.1|  6| 3.75|
|      EXP I| C159S| STU| 0.1| 12|  4.0|
|      EXP I| C159S| STU| 0.1| 24| 3.75|
|      EXP I| C159S| STU| 0.1| 48| 3.25|
|      EXP I| C159S| STU| 0.1| 72| 3.25|
|      EXP I| C159S|OFTu|10.0|  0|  6.5|
|      EXP I| C159S|OFTu|10.0|  6| 4.75|
|      EXP I| C159S|OFTu|10.0| 12| 4.75|
|      EXP I| C159S|OFTu|10.0| 24| 6.25|
|      EXP I| C159S|OFTu|10.0| 48|  6.5|
|      EXP I| C159S|OFTu|10.0| 72|  7.0|
|      EXP I| C159S| STU|10.0|  0|  7.0|
|      EXP I| C159S| STU|10.0|  6| 4.75|
+-----------+------+----+----+---+-----+
only showing top 20 rows

If you are using windows .csv("file:/// should be used instead of //

Option 3: Use httpclient library

import org.apache.http.client.methods.HttpGet
import org.apache.http.impl.client.HttpClients
import org.apache.http.util.EntityUtils
import org.apache.spark.sql.{DataFrame, SparkSession}

object ReadFromUrl {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("ReadFromUrl")
      .master("local[*]")
      .getOrCreate()

    val url = "https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv"
    val httpClient = HttpClients.createDefault()
    val httpGet = new HttpGet(url)
    val response = httpClient.execute(httpGet)
    val entity = response.getEntity
    val data = EntityUtils.toString(entity)

    val tempFile = java.io.File.createTempFile("airtravel", ".csv")
    tempFile.deleteOnExit()
    val writer = new java.io.PrintWriter(tempFile)
    writer.write(data)
    writer.close()

    val df: DataFrame = spark.read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load(tempFile.getAbsolutePath)

    df.show()
    
    spark.stop()
  }
}

Result :

+-----+-------+-------+-------+
|Month| "1958"| "1959"| "1960"|
+-----+-------+-------+-------+
|  JAN|  340.0|  360.0|  417.0|
|  FEB|  318.0|  342.0|  391.0|
|  MAR|  362.0|  406.0|  419.0|
|  APR|  348.0|  396.0|  461.0|
|  MAY|  363.0|  420.0|  472.0|
|  JUN|  435.0|  472.0|  535.0|
|  JUL|  491.0|  548.0|  622.0|
|  AUG|  505.0|  559.0|  606.0|
|  SEP|  404.0|  463.0|  508.0|
|  OCT|  359.0|  407.0|  461.0|
|  NOV|  310.0|  362.0|  390.0|
|  DEC|  337.0|  405.0|  432.0|
+-----+-------+-------+-------+
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121