Questions tagged [apache-spark-dataset]

Spark Dataset is a strongly typed collection of objects mapped to a relational schema. It supports the similar optimizations to Spark DataFrames providing type-safe programming interface at the same time.

External links:

SPARK-9999 - Dataset API on top of Catalyst/DataFrame
Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zaharia. Introducing Spark Datasets. https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html

Related tags: apache-spark, apache-spark-sql, spark-dataframe, rdd

950 questions

votes

2 answers

S3 SlowDown error in Spark on EMR

I am getting this error when writing a parquet file, this has started to happen recently com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503;…

scala apache-spark amazon-s3 amazon-emr apache-spark-dataset

asked Sep 07 '17 at 18:59

Mikel San Vicente

3,831
2
21
39

votes

2 answers

How to get keys and values from MapType column in SparkSQL DataFrame

I have data in a parquet file which has 2 fields: object_id: String and alpha: Map<>. It is read into a data frame in sparkSQL and the schema looks like this: scala> alphaDF.printSchema() root |-- object_id: string (nullable = true) |-- ALPHA: map…

scala apache-spark dataframe apache-spark-sql apache-spark-dataset

asked Nov 15 '16 at 05:25

lloydh

votes

2 answers

Create DataFrame with null value for few column

I am trying to create a DataFrame using RDD. First I am creating a RDD using below code - val account = sc.parallelize(Seq( (1, null, 2,"F"), (2, 2, 4, "F"), …

scala apache-spark apache-spark-sql apache-spark-dataset

asked Sep 13 '16 at 07:24

Avijit

1,770
5
16
34

votes

1 answer

How to create a Spark Dataset from an RDD

I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. How do we convert that RDD to a DataSet? Note the newer spark.ml apis require inputs in the Dataset format.

scala apache-spark dataset apache-spark-dataset

asked May 29 '16 at 18:52

WestCoastProjects

58,982
91
316
560

votes

3 answers

How to name aggregate columns?

I'm using Spark in Scala and my aggregated columns are anonymous. Is there a convenient way to rename multiple columns from a dataset? I thought about imposing a schema with as but the key column is a struct (due to the groupBy operation), and I…

scala apache-spark apache-spark-dataset

asked Jul 25 '16 at 19:35

Emre

5,976
7
29
42

votes

3 answers

Why is the error "Unable to find encoder for type stored in a Dataset" when encoding JSON using case classes?

I've written spark job: object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application").setMaster("local") val sc = new SparkContext(conf) val ctx = new…

scala apache-spark apache-spark-dataset apache-spark-encoders

asked Jan 11 '16 at 06:46

Milad Khajavi

2,769
9
41
66

votes

1 answer

How to read ".gz" compressed file using spark DF or DS?

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS? Details : File is csv with tab delimited.

apache-spark apache-spark-sql gzip apache-spark-dataset

asked Mar 26 '18 at 11:43

prady

votes

2 answers

How to convert DataFrame to Dataset in Apache Spark in Java?

I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.read.json("/tmp/persons.json") val ds = df.as[Person] ds.printSchema but in Java version I don't know how to convert Dataframe to Dataset?…

java apache-spark apache-spark-sql apache-spark-dataset

asked Jan 07 '16 at 11:35

Milad Khajavi

2,769
9
41
66

votes

2 answers

Partition data for efficient joining for Spark dataframe/dataset

I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key are shuffled to same executor so joining is more efficient (if one has shuffle related…

apache-spark apache-spark-sql partitioning apache-spark-dataset

asked Jan 09 '18 at 02:22

Rainfield

1,172
2
14
29

votes

1 answer

Spark simpler value_counts

Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts() the functionality of Pandas in Spark to: The resulting object will be in descending order so that the first element is the most…

apache-spark apache-spark-sql apache-spark-dataset

asked Nov 21 '16 at 17:21

Georg Heiler

16,916
36
162
292

votes

1 answer

Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?

I always thought that dataset/dataframe API's are the same.. and the only difference is that dataset API will give you compile time safety. Right ? So.. I have very simple case: case class Player (playerID: String, birthYear: Int) val playersDs:…

apache-spark dataframe apache-spark-sql apache-spark-dataset

asked May 02 '18 at 07:40

Pawel Niezgoda

votes

2 answers

Spark Dataset select with typedcolumn

Looking at the select() function on the spark DataSet there are various generated function signatures: (c1: TypedColumn[MyClass, U1],c2: TypedColumn[MyClass, U2] ....) This seems to hint that I should be able to reference the members of MyClass…

scala apache-spark apache-spark-dataset

asked Jul 28 '16 at 16:39

Jeremy

votes

5 answers

How to lower the case of column names of a data frame but not its values?

How to lower the case of column names of a data frame but not its values? using RAW Spark SQL and Dataframe methods ? Input data frame (Imagine I have 100's of these columns in uppercase) NAME | COUNTRY | SRC | CITY |…

apache-spark apache-spark-sql apache-spark-dataset

asked Feb 07 '18 at 23:28

user1870400

6,028
13
54
115

votes

2 answers

Mapping Spark DataSet row values into new hash column

Given the following DataSet values as inputData: column0 column1 column2 column3 A 88 text 99 Z 12 test 200 T 120 foo 12 In Spark, what is an efficient way to compute a new hash column, and append it to a…

scala apache-spark apache-spark-sql apache-spark-dataset

asked Nov 06 '17 at 22:00

Jesús Zazueta

1,160
1
17
32

votes

2 answers

Spark structured streaming - join static dataset with streaming dataset

I'm using Spark structured streaming to process records read from Kafka. Here's what I'm trying to achieve: (a) Each record is a Tuple2 of type (Timestamp, DeviceId). (b) I've created a static Dataset[DeviceId] which contains the set of all valid…

scala apache-spark apache-spark-sql apache-spark-dataset spark-structured-streaming

asked Oct 02 '17 at 22:04

jithinpt

1,204
2
16
33

Prev 1

…

63 64 Next