Questions tagged [apache-spark-2.0]

Use for questions specific to Apache Spark 2.0. For general questions related to Apache Spark use the tag [apache-spark].

464 questions
8
votes
1 answer

Scala case class ignoring import in the Spark shell

I hope there is an obvious answer to this question! I've just upgraded to Spark v2.0 and have an odd problem with the spark-shell (Scala 2.11 build). If I enter the following minimal Scala, import java.sql.Timestamp case class Crime(caseNumber:…
Max van Daalen
  • 317
  • 2
  • 14
7
votes
2 answers

How can I join a spark live stream with all the data collected by another stream during its entire life cycle?

I have two spark streams, in the first comes data related to products: their price to the supplier, the currency, their description, the supplier id. These data are enriched by the category, guessed by the analysis of the description and the price…
7
votes
2 answers

How does Spark 2.0 handle column nullability?

In the recently released The Data Engineer's Guide to Apache Spark, the authors stated (page 74): "...when you define a schema where all columns are declared to not have null values - Spark will not enforce that and will happily let null values…
Wes
  • 648
  • 7
  • 14
7
votes
1 answer

jsontostructs to Row in spark structured streaming

I'm using Spark 2.2 and i'm trying to read the JSON messages from Kafka, transform them to DataFrame and have them as a Row: spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") …
7
votes
0 answers

Spark Streaming - Stopped worker throws FileNotFoundException

I am running a spark streaming application on a cluster composed by three nodes, each one with a worker and three executors (so a total of 9 executors). I am using the spark standalone mode (version 2.1.1). The application is run with a spark-submit…
7
votes
1 answer

Kryo Serialization for Spark 2.x Dataset

Is Kryo serialization still required when working with the Dataset API? Because Datasets use Encoders for or serialization and deserialization: Does Kyro serialization even work for Datasets? (Provided the right config is passed to Spark, and…
y.mazari
  • 374
  • 2
  • 8
7
votes
3 answers

How to use dataset to groupby

I have a request to use rdd to do so: val test = Seq(("New York", "Jack"), ("Los Angeles", "Tom"), ("Chicago", "David"), ("Houston", "John"), ("Detroit", "Michael"), ("Chicago", "Andrew"), ("Detroit", "Peter"), …
monkeysjourney
  • 83
  • 1
  • 1
  • 5
7
votes
1 answer

Read parquet into spark dataset ignoring missing fields

Lets assume I create a parquet file as follows : case class A (i:Int,j:Double,s:String) var l1 = List(A(1,2.0,"s1"),A(2,3.0,"S2")) val ds = spark.createDataset(l1) ds.write.parquet("/tmp/test.parquet") Is it possible to read it into a Dataset of…
7
votes
3 answers

Why does SparkSQL require two literal escape backslashes in the SQL query?

When I run the below Scala code from the Spark 2.0 REPL (spark-shell), it runs as I intended it, splitting the string with a simple regular expression. import org.apache.spark.sql.SparkSession // Create session val sparkSession =…
7
votes
1 answer

Dynamic Allocation for Spark Streaming

I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark…
7
votes
0 answers

Apache Spark | java.lang.AssertionError: assertion failed

I am using Apache Spark 2.0.2 and facing following issue while using cartesian product in Spark Streaming module. I am using compression codec as snappy but facing the same issue while using the default one (LZ4), also using kryo for…
Sameer
  • 91
  • 1
  • 7
7
votes
1 answer

Spark 2.0 memory fraction

I am working with Spark 2.0, the job starts by sorting the input data and storing its output on HDFS. I was getting out of memory errors, the solution was to increase the value of "spark.shuffle.memoryFraction" from 0.2 to 0.8 and this solved the…
7
votes
3 answers

GroupByKey with datasets in Spark 2.0 using Java

I have a dataset containing data like the following: |c1| c2| --------- | 1 | a | | 1 | b | | 1 | c | | 2 | a | | 2 | b | ... Now, I want to get the data grouped like the following (col1: String Key, col2: List): | c1| c2 | ----------- | 1…
Andreas
  • 130
  • 2
  • 7
7
votes
3 answers

How to build Spark from the sources from the Download Spark page?

I tried to install and build Spark 2.0.0 on Ubuntu VM with Ubuntu 16.04 as follows: Install Java sudo apt-add-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer Install Scala Go to their…
Michael Westen
  • 169
  • 2
  • 10
6
votes
1 answer

Merging schemas when reading parquet files fails because of incompatible data types int and bigint

When trying to load parquet files with schema merge df = spark.read.option("mergeSchema", "true").parquet('some_path/partition_date') df.show() I'm getting the following exception: Py4JJavaError: An error occurred while calling…
saloua
  • 2,433
  • 4
  • 27
  • 37
1 2
3
30 31