Questions tagged [apache-spark-dataset]

Spark Dataset is a strongly typed collection of objects mapped to a relational schema. It supports the similar optimizations to Spark DataFrames providing type-safe programming interface at the same time.

External links:

Related tags: , , ,

950 questions
22
votes
2 answers

S3 SlowDown error in Spark on EMR

I am getting this error when writing a parquet file, this has started to happen recently com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503;…
22
votes
2 answers

How to get keys and values from MapType column in SparkSQL DataFrame

I have data in a parquet file which has 2 fields: object_id: String and alpha: Map<>. It is read into a data frame in sparkSQL and the schema looks like this: scala> alphaDF.printSchema() root |-- object_id: string (nullable = true) |-- ALPHA: map…
22
votes
2 answers

Create DataFrame with null value for few column

I am trying to create a DataFrame using RDD. First I am creating a RDD using below code - val account = sc.parallelize(Seq( (1, null, 2,"F"), (2, 2, 4, "F"), …
Avijit
  • 1,770
  • 5
  • 16
  • 34
20
votes
1 answer

How to create a Spark Dataset from an RDD

I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. How do we convert that RDD to a DataSet? Note the newer spark.ml apis require inputs in the Dataset format.
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
19
votes
3 answers

How to name aggregate columns?

I'm using Spark in Scala and my aggregated columns are anonymous. Is there a convenient way to rename multiple columns from a dataset? I thought about imposing a schema with as but the key column is a struct (due to the groupBy operation), and I…
Emre
  • 5,976
  • 7
  • 29
  • 42
18
votes
3 answers

Why is the error "Unable to find encoder for type stored in a Dataset" when encoding JSON using case classes?

I've written spark job: object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application").setMaster("local") val sc = new SparkContext(conf) val ctx = new…
17
votes
1 answer

How to read ".gz" compressed file using spark DF or DS?

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS? Details : File is csv with tab delimited.
prady
  • 563
  • 4
  • 9
  • 24
17
votes
2 answers

How to convert DataFrame to Dataset in Apache Spark in Java?

I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.read.json("/tmp/persons.json") val ds = df.as[Person] ds.printSchema but in Java version I don't know how to convert Dataframe to Dataset?…
Milad Khajavi
  • 2,769
  • 9
  • 41
  • 66
16
votes
2 answers

Partition data for efficient joining for Spark dataframe/dataset

I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key are shuffled to same executor so joining is more efficient (if one has shuffle related…
16
votes
1 answer

Spark simpler value_counts

Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts() the functionality of Pandas in Spark to: The resulting object will be in descending order so that the first element is the most…
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
15
votes
1 answer

Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?

I always thought that dataset/dataframe API's are the same.. and the only difference is that dataset API will give you compile time safety. Right ? So.. I have very simple case: case class Player (playerID: String, birthYear: Int) val playersDs:…
15
votes
2 answers

Spark Dataset select with typedcolumn

Looking at the select() function on the spark DataSet there are various generated function signatures: (c1: TypedColumn[MyClass, U1],c2: TypedColumn[MyClass, U2] ....) This seems to hint that I should be able to reference the members of MyClass…
Jeremy
  • 682
  • 2
  • 8
  • 17
14
votes
5 answers

How to lower the case of column names of a data frame but not its values?

How to lower the case of column names of a data frame but not its values? using RAW Spark SQL and Dataframe methods ? Input data frame (Imagine I have 100's of these columns in uppercase) NAME | COUNTRY | SRC | CITY |…
user1870400
  • 6,028
  • 13
  • 54
  • 115
14
votes
2 answers

Mapping Spark DataSet row values into new hash column

Given the following DataSet values as inputData: column0 column1 column2 column3 A 88 text 99 Z 12 test 200 T 120 foo 12 In Spark, what is an efficient way to compute a new hash column, and append it to a…
14
votes
2 answers

Spark structured streaming - join static dataset with streaming dataset

I'm using Spark structured streaming to process records read from Kafka. Here's what I'm trying to achieve: (a) Each record is a Tuple2 of type (Timestamp, DeviceId). (b) I've created a static Dataset[DeviceId] which contains the set of all valid…
1
2
3
63 64