Use for questions specific to Apache Spark 2.0. For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-2.0]
464 questions
8
votes
1 answer
Scala case class ignoring import in the Spark shell
I hope there is an obvious answer to this question!
I've just upgraded to Spark v2.0 and have an odd problem with the spark-shell (Scala 2.11 build).
If I enter the following minimal Scala,
import java.sql.Timestamp
case class Crime(caseNumber:…

Max van Daalen
- 317
- 2
- 14
7
votes
2 answers
How can I join a spark live stream with all the data collected by another stream during its entire life cycle?
I have two spark streams, in the first comes data related to products: their price to the supplier, the currency, their description, the supplier id. These data are enriched by the category, guessed by the analysis of the description and the price…

Claudio D'Alicandro
- 73
- 5
7
votes
2 answers
How does Spark 2.0 handle column nullability?
In the recently released The Data Engineer's Guide to Apache Spark, the authors stated (page 74):
"...when you define a schema where all columns are declared to not
have null values - Spark will not enforce that and will happily let
null values…

Wes
- 648
- 7
- 14
7
votes
1 answer
jsontostructs to Row in spark structured streaming
I'm using Spark 2.2 and i'm trying to read the JSON messages from Kafka, transform them to DataFrame and have them as a Row:
spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
…

Martin Brisiak
- 3,872
- 12
- 37
- 51
7
votes
0 answers
Spark Streaming - Stopped worker throws FileNotFoundException
I am running a spark streaming application on a cluster composed by three nodes, each one with a worker and three executors (so a total of 9 executors). I am using the spark standalone mode (version 2.1.1).
The application is run with a spark-submit…

Davide Mandrini
- 156
- 7
7
votes
1 answer
Kryo Serialization for Spark 2.x Dataset
Is Kryo serialization still required when working with the Dataset API?
Because Datasets use Encoders for or serialization and deserialization:
Does Kyro serialization even work for Datasets? (Provided the right config is passed to Spark, and…

y.mazari
- 374
- 2
- 8
7
votes
3 answers
How to use dataset to groupby
I have a request to use rdd to do so:
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
…

monkeysjourney
- 83
- 1
- 1
- 5
7
votes
1 answer
Read parquet into spark dataset ignoring missing fields
Lets assume I create a parquet file as follows :
case class A (i:Int,j:Double,s:String)
var l1 = List(A(1,2.0,"s1"),A(2,3.0,"S2"))
val ds = spark.createDataset(l1)
ds.write.parquet("/tmp/test.parquet")
Is it possible to read it into a Dataset of…

indraneel
- 405
- 4
- 10
7
votes
3 answers
Why does SparkSQL require two literal escape backslashes in the SQL query?
When I run the below Scala code from the Spark 2.0 REPL (spark-shell), it runs as I intended it, splitting the string with a simple regular expression.
import org.apache.spark.sql.SparkSession
// Create session
val sparkSession =…

Bjørn Nielsen
- 73
- 1
- 5
7
votes
1 answer
Dynamic Allocation for Spark Streaming
I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark…

Akhila Lankala
- 193
- 1
- 11
7
votes
0 answers
Apache Spark | java.lang.AssertionError: assertion failed
I am using Apache Spark 2.0.2 and facing following issue while using cartesian product in Spark Streaming module.
I am using compression codec as snappy but facing the same issue while using the default one (LZ4), also using kryo for…

Sameer
- 91
- 1
- 7
7
votes
1 answer
Spark 2.0 memory fraction
I am working with Spark 2.0, the job starts by sorting the input data and storing its output on HDFS.
I was getting out of memory errors, the solution was to increase the value of "spark.shuffle.memoryFraction" from 0.2 to 0.8 and this solved the…

syl
- 419
- 2
- 5
- 17
7
votes
3 answers
GroupByKey with datasets in Spark 2.0 using Java
I have a dataset containing data like the following:
|c1| c2|
---------
| 1 | a |
| 1 | b |
| 1 | c |
| 2 | a |
| 2 | b |
...
Now, I want to get the data grouped like the following (col1: String Key, col2: List):
| c1| c2 |
-----------
| 1…

Andreas
- 130
- 2
- 7
7
votes
3 answers
How to build Spark from the sources from the Download Spark page?
I tried to install and build Spark 2.0.0 on Ubuntu VM with Ubuntu 16.04 as follows:
Install Java
sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
Install Scala
Go to their…

Michael Westen
- 169
- 2
- 10
6
votes
1 answer
Merging schemas when reading parquet files fails because of incompatible data types int and bigint
When trying to load parquet files with schema merge
df = spark.read.option("mergeSchema", "true").parquet('some_path/partition_date')
df.show()
I'm getting the following exception:
Py4JJavaError: An error occurred while calling…

saloua
- 2,433
- 4
- 27
- 37