Questions tagged [apache-spark-dataset]

Spark Dataset is a strongly typed collection of objects mapped to a relational schema. It supports the similar optimizations to Spark DataFrames providing type-safe programming interface at the same time.

External links:

SPARK-9999 - Dataset API on top of Catalyst/DataFrame
Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zaharia. Introducing Spark Datasets. https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html

Related tags: apache-spark, apache-spark-sql, spark-dataframe, rdd

950 questions

-1

votes

1 answer

calling SPARK SQL inside map function

In my code I am have a requirement where I need to call spark sql for each of the rows of a dataset. Now, spark sql requires SparkSession inside map function, which is not possible to pass as a broadcast Variable. So, is there anyway to call Spark…

apache-spark apache-spark-sql apache-spark-dataset

asked Feb 07 '19 at 07:14

A Learner

-1

votes

1 answer

DataFrame and DataSet - converting values to pair

Sample Input (black coloured text) and Output (red coloured text) I have a DataFrame (one in black), how can I transform it to one like in red? (column number, value) [Image is attached] val df = spark.read.format("csv").option("inferSchema",…

apache-spark apache-spark-sql apache-spark-dataset

asked Jan 20 '19 at 05:42

Sarfaraz Hussain

-1

votes

1 answer

convert integer into date to count number of days

I need to convert Integer to date(yyyy-mm-dd) format, to calculate number of days. registryDate 20130826 20130829 20130816 20130925 20130930 20130926 Desired output: registryDate TodaysDate DaysInBetween …

scala apache-spark apache-spark-sql user-defined-functions apache-spark-dataset

asked Nov 24 '18 at 13:39

Ravi Anand Vicky

-1

votes

1 answer

How to create dataset from stored (variable or parameter) Seq

I have a function like: def createDataset[T](seq:Seq[T]): Dataset[T] = { import spark.implicits._ seq.toDS() } And this is not compiling, it doesn't find toDS function. It also doesn't work in this way def createDataset[T](t:T): Dataset[T]…

scala apache-spark generics apache-spark-dataset

asked Nov 09 '18 at 10:48

Pau Trepat

-1

votes

3 answers

Avoid specifying schema twice (Spark/scala)

I need to iterate over data frame in specific order and apply some complex logic to calculate new column. Also my strong preference is to do it in generic way so I do not have to list all columns of a row and do df.as[my_record] or case Row(...) =>…

scala apache-spark apache-spark-sql apache-spark-dataset

asked Nov 08 '18 at 11:10

Dr Y Wit

2,000
9
16

-1

votes

1 answer

Spark Dataset - Average function

I'm using Spark with Scala, and trying to find the best way to group Dataset by key, and get average + sum together. For example, I have Dataset[Player] , and Player consists of: playerId , yearSignup, level , points. I want to group this dataset…

scala apache-spark dataset apache-spark-dataset

asked Oct 03 '18 at 15:01

Ben Haim Shani

-1

votes

1 answer

Append fields to JSON dataset Java-Spark

I'm using Java-Spark to load JSON into Dataset as follow: Dataset df = spark.read().json(jsonFile); Let's say that my JSON looks like: { "field1": { "key1":"value1" } } Now I want to add a new field to make my JSON to…

apache-spark apache-spark-dataset

asked Sep 04 '18 at 10:01

Ya Ko

-1

votes

1 answer

Converting from org.apache.spark.sql.Dataset to CoordinateMatrix

I have a spark SQL dataset whose schema defined as follows, User_id | Item_id | Bought_Status I would like to convert this to a Sparse matrix to apply recommender systems algorithms. This is very huge RDD datasets so I…

scala apache-spark apache-spark-mllib apache-spark-dataset

asked Jun 20 '18 at 10:53

Rengasami Ramanujam

1,858
4
19
29

-1

votes

1 answer

Apache Spark Java - how to iterate through row dataset and remove null fields

I'm trying to build the spark application which reads the data from Hive table and output will be written as JSON. In below code, I have to iterate through row dataset and remove the null fields before output. I'm expecting my output like, please…

java apache-spark apache-spark-sql apache-spark-dataset

asked May 18 '18 at 17:38

Srinivas

-1

votes

1 answer

Getting Null Pointer exception while performing operations on dataframe spark

I am using following code to create dataframe from RDD. I am able to perform operations on RDD and RDD is not empty. I tried out following two approaches. With both I am getting same exception. Approach 1: Build dataset using…

apache-spark apache-spark-sql spark-streaming apache-spark-dataset

asked May 17 '18 at 07:50

Chetan Shirke

-1

votes

1 answer

how to merge spark row

Hi i have a DataSet of Track.class i want to merge all tracks that are within same interval of time for example 5 min .i.e any tracks start after a track that ends within 5 min before will be the same track.its look like fusion task. my input…

apache-spark apache-spark-sql apache-spark-dataset

asked Apr 24 '18 at 10:40

sandevfares

-1

votes

1 answer

Join spark dataset with complex condition

Consider a bean as follows: class Bean { String id; String joinColumn; } I have two datasets of this Bean and need to join them on joinColumn but the condition to join is not equals to. I need to have a logic that compares joinColumn for…

apache-spark join apache-spark-sql apache-spark-dataset

asked Apr 12 '18 at 05:40

Abhay Dubey

-1

votes

1 answer

How to remove/filter element from WrappedArray column

I'm facing an issue with manipulating a WrappedArray column. I want to remove/filter element from the WrappedArray column in a Spark dataset. The WrappedArray contain objects, for example, I have a dataset contain following…

apache-spark apache-spark-sql apache-spark-dataset

asked Mar 29 '18 at 05:37

Alex

-1

votes

1 answer

Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization?

To take advantage of Dataset's optimization, do I have to explicitly use Dataframe's methods (e.g. df.select(col("name"), col("age"), etc) or calling any Dataset's methods - even RDD-like methods (e.g. filter, map, etc) would also allow for…

java apache-spark apache-spark-sql apache-spark-dataset

asked Feb 23 '17 at 06:28

Glide

20,235
26
86
135

-1

votes

1 answer

Apache Spark Performance Issue

We thought of using Apache Spark to match records faster, but we are finding it highly inefficient than SQL matching using select statement. Using, JavaSparkContext javaSparkContext = new JavaSparkContext(new…

sql apache-spark apache-spark-sql apache-spark-dataset

asked Feb 17 '17 at 19:03

Nischay

Prev 1 2 3

…

63 64 Next