Questions tagged [apache-spark-dataset]

Spark Dataset is a strongly typed collection of objects mapped to a relational schema. It supports the similar optimizations to Spark DataFrames providing type-safe programming interface at the same time.

External links:

Related tags: , , ,

950 questions
14
votes
2 answers

Encode an ADT / sealed trait hierarchy into Spark DataSet column

If I want to store an Algebraic Data Type (ADT) (ie a Scala sealed trait hierarchy) within a Spark DataSet column, what is the best encoding strategy? For example, if I have an ADT where the leaf types store different kinds of data: sealed trait…
13
votes
0 answers

Spark - RelationalGroupedDataset vs. KeyvalueGroupedDataset? When should I use each of them?

When grouping a Dataset in Spark, there are two methods: groupBy and groupByKey[K]. groupBy returns RelationalGroupedDataset, while groupByKey[K] returns KeyvalueGroupedDataset. What are the differences between them? Under what circumstances…
CyberPlayerOne
  • 3,078
  • 5
  • 30
  • 51
12
votes
3 answers

java.lang.UnsupportedOperationException: Error in spark when writing

When I try to write the dataset into parquet files, I get below error 18/11/05 06:25:43 ERROR FileFormatWriter: Aborting job null. org.apache.spark.SparkException: Job aborted due to stage failure: Task 84 in stage 1.0 failed 4 times, most recent…
John Humanyun
  • 915
  • 3
  • 10
  • 25
12
votes
1 answer

Spark java : Creating a new Dataset with a given schema

I have this code that is working well in scala : val schema = StructType(Array( StructField("field1", StringType, true), StructField("field2", TimestampType, true), StructField("field3", DoubleType, true), …
Nakeuh
  • 1,757
  • 3
  • 26
  • 65
12
votes
4 answers

How to convert the datasets of Spark Row into string?

I have written the code to access the Hive table using SparkSQL. Here is the code: SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example") .master("local[*]") .config("hive.metastore.uris",…
Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
12
votes
1 answer

Differences between Spark's Row and InternalRow types

Currently Spark has two implementations for Row: import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.InternalRow What is the need to have both of them? Do they represent the same encoded entities but one used internally (internal…
marios
  • 8,874
  • 3
  • 38
  • 62
12
votes
1 answer

Apache Spark 2.0: java.lang.UnsupportedOperationException: No Encoder found for java.time.LocalDate

I am using Apache Spark 2.0 and creating case class for mention schema for DetaSet. When i am trying to define custom encoder according to How to store custom objects in Dataset?, for java.time.LocalDate i got following exception:…
11
votes
2 answers

Why do columns change to nullable in Apache Spark SQL?

Why is nullable = true used after some functions are executed even though there are no NaN values in the DataFrame. val myDf = Seq((2,"A"),(2,"B"),(1,"C")) .toDF("foo","bar") .withColumn("foo",…
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
11
votes
2 answers

Create Spark Dataset from a CSV file

I would like to create a Spark Dataset from a simple CSV file. Here are the contents of the CSV file: name,state,number_of_people,coolness_index trenton,nj,"10","4.5" bedford,ny,"20","3.3" patterson,nj,"30","2.2" camden,nj,"40","8.8" Here is the…
Powers
  • 18,150
  • 10
  • 103
  • 108
10
votes
2 answers

Spark StringIndexer.fit is very slow on large records

I have large data records formatted as the following sample: // +---+------+------+ // |cid|itemId|bought| // +---+------+------+ // |abc| 123| true| // |abc| 345| true| // |abc| 567| true| // |def| 123| true| // |def| 345| true| //…
10
votes
1 answer

Scala spark: how to use dataset for a case class with the schema has snake_case?

I have the following case class: case class User(userId: String) and the following schema: +--------------------+------------------+ | col_name| data_type| +--------------------+------------------+ | user_id| …
Gal
  • 5,338
  • 5
  • 33
  • 55
10
votes
2 answers

How to drop malformed rows while reading csv with schema Spark?

While I am using Spark DataSet to load a csv file. I prefer designating schema clearly. But I find there are a few rows not compliant with my schema. A column should be double, but some rows are non-numeric values. Is it possible to filter all rows…
Zhe Hou
  • 227
  • 1
  • 2
  • 10
10
votes
1 answer

Spark 2 Dataset Null value exception

Getting this null error in spark Dataset.filter Input CSV: name,age,stat abc,22,m xyz,,s Working code: case class Person(name: String, age: Long, stat: String) val peopleDS = spark.read.option("inferSchema","true") .option("header",…
10
votes
1 answer

Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

I'm having some trouble encoding data when some columns that are of type Option[Seq[String]] are missing from our data source. Ideally I would like the missing column data to be filled with None. Scenario: We have some parquet files that we are…
pigate
  • 351
  • 1
  • 7
  • 16
10
votes
3 answers

Convert scala list to DataFrame or DataSet

I am new to Scala. I am trying to convert a scala list (which is holding the results of some calculated data on a source DataFrame) to Dataframe or Dataset. I am not finding any direct method to do that. However, I have tried the following process…
1 2
3
63 64