Highest Voted 'apache-spark-2.0' Questions

8

votes

1 answer

Scala case class ignoring import in the Spark shell

I hope there is an obvious answer to this question! I've just upgraded to Spark v2.0 and have an odd problem with the spark-shell (Scala 2.11 build). If I enter the following minimal Scala, import java.sql.Timestamp case class Crime(caseNumber:…

scala apache-spark apache-spark-2.0

asked Aug 02 '16 at 16:44

Max van Daalen

317
2
14

7

votes

2 answers

How can I join a spark live stream with all the data collected by another stream during its entire life cycle?

I have two spark streams, in the first comes data related to products: their price to the supplier, the currency, their description, the supplier id. These data are enriched by the category, guessed by the analysis of the description and the price…

apache-spark pyspark spark-streaming amazon-kinesis apache-spark-2.0

asked Jan 17 '18 at 11:26

Claudio D'Alicandro

73
5

7

votes

2 answers

How does Spark 2.0 handle column nullability?

In the recently released The Data Engineer's Guide to Apache Spark, the authors stated (page 74): "...when you define a schema where all columns are declared to not have null values - Spark will not enforce that and will happily let null values…

apache-spark pyspark apache-spark-sql apache-spark-2.0

asked Nov 24 '17 at 21:01

Wes

648
7
14

7

votes

1 answer

jsontostructs to Row in spark structured streaming

I'm using Spark 2.2 and i'm trying to read the JSON messages from Kafka, transform them to DataFrame and have them as a Row: spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") …

java apache-spark apache-spark-sql apache-spark-2.0 spark-structured-streaming

asked Oct 12 '17 at 16:25

Martin Brisiak

3,872
12
37
51

7

votes

0 answers

Spark Streaming - Stopped worker throws FileNotFoundException

I am running a spark streaming application on a cluster composed by three nodes, each one with a worker and three executors (so a total of 9 executors). I am using the spark standalone mode (version 2.1.1). The application is run with a spark-submit…

apache-spark spark-streaming apache-spark-2.0 apache-spark-standalone

asked Sep 09 '17 at 11:56

Davide Mandrini

156
7

7

votes

1 answer

Kryo Serialization for Spark 2.x Dataset

Is Kryo serialization still required when working with the Dataset API? Because Datasets use Encoders for or serialization and deserialization: Does Kyro serialization even work for Datasets? (Provided the right config is passed to Spark, and…

kryo apache-spark-dataset apache-spark-2.0

asked Jun 24 '17 at 08:36

y.mazari

374
2
8

7

votes

3 answers

How to use dataset to groupby

I have a request to use rdd to do so： val test = Seq(("New York", "Jack"), ("Los Angeles", "Tom"), ("Chicago", "David"), ("Houston", "John"), ("Detroit", "Michael"), ("Chicago", "Andrew"), ("Detroit", "Peter"), …

apache-spark dataset apache-spark-2.0

asked Jun 07 '17 at 06:12

monkeysjourney

83
1
1
5

7

votes

1 answer

Read parquet into spark dataset ignoring missing fields

Lets assume I create a parquet file as follows : case class A (i:Int,j:Double,s:String) var l1 = List(A(1,2.0,"s1"),A(2,3.0,"S2")) val ds = spark.createDataset(l1) ds.write.parquet("/tmp/test.parquet") Is it possible to read it into a Dataset of…

apache-spark apache-spark-sql parquet apache-spark-dataset apache-spark-2.0

asked Apr 23 '17 at 10:59

indraneel

405
4
10

7

votes

3 answers

Why does SparkSQL require two literal escape backslashes in the SQL query?

When I run the below Scala code from the Spark 2.0 REPL (spark-shell), it runs as I intended it, splitting the string with a simple regular expression. import org.apache.spark.sql.SparkSession // Create session val sparkSession =…

scala apache-spark apache-spark-sql apache-spark-2.0

asked Jan 20 '17 at 13:39

Bjørn Nielsen

73
1
5

7

votes

1 answer

Dynamic Allocation for Spark Streaming

I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark…

apache-spark spark-streaming dynamic-allocation apache-spark-2.0 apache-spark-1.6

asked Dec 22 '16 at 23:02

Akhila Lankala

193
1
11

7

votes

0 answers

Apache Spark | java.lang.AssertionError: assertion failed

I am using Apache Spark 2.0.2 and facing following issue while using cartesian product in Spark Streaming module. I am using compression codec as snappy but facing the same issue while using the default one (LZ4), also using kryo for…

apache-spark-2.0

asked Dec 19 '16 at 11:59

Sameer

91
1
7

7

votes

1 answer

Spark 2.0 memory fraction

I am working with Spark 2.0, the job starts by sorting the input data and storing its output on HDFS. I was getting out of memory errors, the solution was to increase the value of "spark.shuffle.memoryFraction" from 0.2 to 0.8 and this solved the…

memory apache-spark out-of-memory distributed-computing apache-spark-2.0

asked Sep 23 '16 at 12:05

syl

419
2
5
17

7

votes

3 answers

GroupByKey with datasets in Spark 2.0 using Java

I have a dataset containing data like the following: |c1| c2| --------- | 1 | a | | 1 | b | | 1 | c | | 2 | a | | 2 | b | ... Now, I want to get the data grouped like the following (col1: String Key, col2: List): | c1| c2 | ----------- | 1…

java apache-spark group-by dataset apache-spark-2.0

asked Sep 08 '16 at 12:26

Andreas

130
2
7

7

votes

3 answers

How to build Spark from the sources from the Download Spark page?

I tried to install and build Spark 2.0.0 on Ubuntu VM with Ubuntu 16.04 as follows: Install Java sudo apt-add-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer Install Scala Go to their…

scala ubuntu sbt apache-spark-2.0

asked Sep 01 '16 at 23:51

Michael Westen

169
2
10

6

votes

1 answer

Merging schemas when reading parquet files fails because of incompatible data types int and bigint

When trying to load parquet files with schema merge df = spark.read.option("mergeSchema", "true").parquet('some_path/partition_date') df.show() I'm getting the following exception: Py4JJavaError: An error occurred while calling…

python apache-spark pyspark parquet apache-spark-2.0

asked Oct 22 '20 at 10:02

saloua

2,433
4
27
37

Questions tagged [apache-spark-2.0]