0

I have a need to pass a very big input file to Scastie. I mean how can Scastie which is online code editor read a file which is available at my local machine, for example

val lines = sc.textfile("....mdb/u.data")
Mario Galic
  • 47,285
  • 6
  • 56
  • 98
Carolyn Cordeiro
  • 1,525
  • 3
  • 11
  • 26
  • Maybe you should set up an appropriate development environment in your machine rather than using a site which was designed just for sharing small snippets. – Luis Miguel Mejía Suárez Sep 01 '20 at 12:14
  • Makes sense what you said,but my machine is already occupied,and I don't want to spend money on cloud stuff ,they charge heavu both GCP and AWS to use scala and spark ,for me learning the Spark concepts is important rather than over spending – Carolyn Cordeiro Sep 01 '20 at 22:58

1 Answers1

4

Some asked this on the team's Gitter channel.

The Scastie team member first asked how big the file is, then recommended to put it in a Gist on Github and to use the raw url to read it in.

This works only for small files. The limits of files on Gist are explained in their Developer Guide.

If you need the full contents of the file, you can make a GET request to the URL specified by raw_url. Be aware that for files larger than ten megabytes, you'll need to clone the gist via the URL provided by git_pull_url.

So 10 MB is your limit. Also note that you can't use a SparkContext(denoted by sc in your question) without identifying the library to the online environment.

To do that, you'll have to add the SBT dependency.

  • Navigate to Build Settings on the left part of the interface.
  • Set the Scala Version to a version compatible with the Spark we'll choose, in our case 2.11.12.
  • Under Extra Sbt Configuration place the following dependencies:
    libraryDependencies ++= Seq(
       "org.apache.spark" %% "spark-core" % "2.4.3",
       "org.apache.spark" %% "spark-sql" % "2.4.3"
    )

You won't be able to read url content directly using sc.textFile, that is only for reading local/HDFS text files. You'll have to get the content first, wrangle it into shape and get a DataFrame out of it.

The answer shown here describes how to access a web url using Source from the Scala Standard Library.

At the request of the OP, here's an implementation on scastie.

kfkhalili
  • 996
  • 1
  • 11
  • 24