I have a need to pass a very big input file to Scastie. I mean how can Scastie which is online code editor read a file which is available at my local machine, for example
val lines = sc.textfile("....mdb/u.data")
I have a need to pass a very big input file to Scastie. I mean how can Scastie which is online code editor read a file which is available at my local machine, for example
val lines = sc.textfile("....mdb/u.data")
Some asked this on the team's Gitter channel.
The Scastie team member first asked how big the file is, then recommended to put it in a Gist on Github and to use the raw url to read it in.
This works only for small files. The limits of files on Gist are explained in their Developer Guide.
If you need the full contents of the file, you can make a GET request to the URL specified by raw_url. Be aware that for files larger than ten megabytes, you'll need to clone the gist via the URL provided by git_pull_url.
So 10 MB is your limit. Also note that you can't use a SparkContext
(denoted by sc
in your question) without identifying the library to the online environment.
To do that, you'll have to add the SBT dependency.
Build Settings
on the left part of the interface.Scala Version
to a version compatible with the Spark we'll choose, in our case 2.11.12.Extra Sbt Configuration
place the following dependencies: libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.3",
"org.apache.spark" %% "spark-sql" % "2.4.3"
)
You won't be able to read url content directly using sc.textFile
, that is only for reading local/HDFS text files. You'll have to get the content first, wrangle it into shape and get a DataFrame
out of it.
The answer shown here describes how to access a web url using Source from the Scala Standard Library.
At the request of the OP, here's an implementation on scastie.