Questions tagged [apache-spark-dataset]

Spark Dataset is a strongly typed collection of objects mapped to a relational schema. It supports the similar optimizations to Spark DataFrames providing type-safe programming interface at the same time.

External links:

SPARK-9999 - Dataset API on top of Catalyst/DataFrame
Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zaharia. Introducing Spark Datasets. https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html

Related tags: apache-spark, apache-spark-sql, spark-dataframe, rdd

950 questions

votes

2 answers

Encode an ADT / sealed trait hierarchy into Spark DataSet column

If I want to store an Algebraic Data Type (ADT) (ie a Scala sealed trait hierarchy) within a Spark DataSet column, what is the best encoding strategy? For example, if I have an ADT where the leaf types store different kinds of data: sealed trait…

scala apache-spark apache-spark-dataset apache-spark-encoders

asked Dec 08 '16 at 01:03

Ben Hutchison

2,433
2
21
25

votes

0 answers

Spark - RelationalGroupedDataset vs. KeyvalueGroupedDataset? When should I use each of them?

When grouping a Dataset in Spark, there are two methods: groupBy and groupByKey[K]. groupBy returns RelationalGroupedDataset, while groupByKey[K] returns KeyvalueGroupedDataset. What are the differences between them? Under what circumstances…

apache-spark aggregation apache-spark-dataset

asked Jan 30 '18 at 05:29

CyberPlayerOne

3,078
5
30
51

votes

3 answers

java.lang.UnsupportedOperationException: Error in spark when writing

When I try to write the dataset into parquet files, I get below error 18/11/05 06:25:43 ERROR FileFormatWriter: Aborting job null. org.apache.spark.SparkException: Job aborted due to stage failure: Task 84 in stage 1.0 failed 4 times, most recent…

apache-spark apache-spark-dataset

asked Nov 05 '18 at 06:37

John Humanyun

votes

1 answer

Spark java : Creating a new Dataset with a given schema

I have this code that is working well in scala : val schema = StructType(Array( StructField("field1", StringType, true), StructField("field2", TimestampType, true), StructField("field3", DoubleType, true), …

java scala apache-spark apache-spark-dataset

asked Aug 01 '18 at 14:13

Nakeuh

1,757
3
26
65

votes

4 answers

How to convert the datasets of Spark Row into string?

I have written the code to access the Hive table using SparkSQL. Here is the code: SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example") .master("local[*]") .config("hive.metastore.uris",…

java string apache-spark apache-spark-sql apache-spark-dataset

asked Feb 22 '17 at 10:54

Jaffer Wilson

7,029
10
62
139

votes

1 answer

Differences between Spark's Row and InternalRow types

Currently Spark has two implementations for Row: import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.InternalRow What is the need to have both of them? Do they represent the same encoded entities but one used internally (internal…

apache-spark apache-spark-sql apache-spark-dataset

asked Feb 02 '17 at 22:08

marios

8,874
3
38
62

votes

1 answer

Apache Spark 2.0: java.lang.UnsupportedOperationException: No Encoder found for java.time.LocalDate

I am using Apache Spark 2.0 and creating case class for mention schema for DetaSet. When i am trying to define custom encoder according to How to store custom objects in Dataset?, for java.time.LocalDate i got following exception:…

scala apache-spark apache-spark-sql apache-spark-dataset apache-spark-encoders

asked Aug 03 '16 at 09:44

Harmeet Singh Taara

6,483
20
73
126

votes

2 answers

Why do columns change to nullable in Apache Spark SQL?

Why is nullable = true used after some functions are executed even though there are no NaN values in the DataFrame. val myDf = Seq((2,"A"),(2,"B"),(1,"C")) .toDF("foo","bar") .withColumn("foo",…

apache-spark apache-spark-sql apache-spark-dataset

asked Nov 15 '16 at 06:53

Georg Heiler

16,916
36
162
292

votes

2 answers

Create Spark Dataset from a CSV file

I would like to create a Spark Dataset from a simple CSV file. Here are the contents of the CSV file: name,state,number_of_people,coolness_index trenton,nj,"10","4.5" bedford,ny,"20","3.3" patterson,nj,"30","2.2" camden,nj,"40","8.8" Here is the…

apache-spark apache-spark-dataset

asked Sep 16 '16 at 01:02

Powers

18,150
10
103
108

votes

2 answers

Spark StringIndexer.fit is very slow on large records

I have large data records formatted as the following sample: // +---+------+------+ // |cid|itemId|bought| // +---+------+------+ // |abc| 123| true| // |abc| 345| true| // |abc| 567| true| // |def| 123| true| // |def| 345| true| //…

apache-spark apache-spark-ml apache-spark-dataset

asked Jul 23 '18 at 19:00

Rengasami Ramanujam

1,858
4
19
29

votes

1 answer

Scala spark: how to use dataset for a case class with the schema has snake_case?

scala apache-spark apache-spark-dataset

asked Apr 16 '18 at 08:46

Gal

5,338
5
33
55

votes

2 answers

How to drop malformed rows while reading csv with schema Spark?

While I am using Spark DataSet to load a csv file. I prefer designating schema clearly. But I find there are a few rows not compliant with my schema. A column should be double, but some rows are non-numeric values. Is it possible to filter all rows…

scala csv apache-spark apache-spark-dataset

asked Apr 09 '18 at 02:46

Zhe Hou

votes

1 answer

Spark 2 Dataset Null value exception

Getting this null error in spark Dataset.filter Input CSV: name,age,stat abc,22,m xyz,,s Working code: case class Person(name: String, age: Long, stat: String) val peopleDS = spark.read.option("inferSchema","true") .option("header",…

scala apache-spark apache-spark-sql apache-spark-dataset

asked Jan 15 '17 at 19:08

xstack2000

votes

1 answer

Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

I'm having some trouble encoding data when some columns that are of type Option[Seq[String]] are missing from our data source. Ideally I would like the missing column data to be filled with None. Scenario: We have some parquet files that we are…

scala apache-spark apache-spark-dataset

asked Jan 03 '17 at 23:50

pigate

votes

3 answers

Convert scala list to DataFrame or DataSet

I am new to Scala. I am trying to convert a scala list (which is holding the results of some calculated data on a source DataFrame) to Dataframe or Dataset. I am not finding any direct method to do that. However, I have tried the following process…

scala apache-spark apache-spark-sql apache-spark-dataset apache-spark-encoders

asked Sep 08 '16 at 18:15

Leo

Prev 1 2

…

63 64 Next