Questions tagged [apache-spark-dataset]

Spark Dataset is a strongly typed collection of objects mapped to a relational schema. It supports the similar optimizations to Spark DataFrames providing type-safe programming interface at the same time.

External links:

Related tags: , , ,

950 questions
-1
votes
1 answer

calling SPARK SQL inside map function

In my code I am have a requirement where I need to call spark sql for each of the rows of a dataset. Now, spark sql requires SparkSession inside map function, which is not possible to pass as a broadcast Variable. So, is there anyway to call Spark…
A Learner
  • 157
  • 1
  • 5
  • 16
-1
votes
1 answer

DataFrame and DataSet - converting values to pair

Sample Input (black coloured text) and Output (red coloured text) I have a DataFrame (one in black), how can I transform it to one like in red? (column number, value) [Image is attached] val df = spark.read.format("csv").option("inferSchema",…
-1
votes
1 answer

convert integer into date to count number of days

I need to convert Integer to date(yyyy-mm-dd) format, to calculate number of days. registryDate 20130826 20130829 20130816 20130925 20130930 20130926 Desired output: registryDate TodaysDate DaysInBetween …
-1
votes
1 answer

How to create dataset from stored (variable or parameter) Seq

I have a function like: def createDataset[T](seq:Seq[T]): Dataset[T] = { import spark.implicits._ seq.toDS() } And this is not compiling, it doesn't find toDS function. It also doesn't work in this way def createDataset[T](t:T): Dataset[T]…
Pau Trepat
  • 697
  • 1
  • 6
  • 24
-1
votes
3 answers

Avoid specifying schema twice (Spark/scala)

I need to iterate over data frame in specific order and apply some complex logic to calculate new column. Also my strong preference is to do it in generic way so I do not have to list all columns of a row and do df.as[my_record] or case Row(...) =>…
Dr Y Wit
  • 2,000
  • 9
  • 16
-1
votes
1 answer

Spark Dataset - Average function

I'm using Spark with Scala, and trying to find the best way to group Dataset by key, and get average + sum together. For example, I have Dataset[Player] , and Player consists of: playerId , yearSignup, level , points. I want to group this dataset…
Ben Haim Shani
  • 265
  • 4
  • 15
-1
votes
1 answer

Append fields to JSON dataset Java-Spark

I'm using Java-Spark to load JSON into Dataset as follow: Dataset df = spark.read().json(jsonFile); Let's say that my JSON looks like: { "field1": { "key1":"value1" } } Now I want to add a new field to make my JSON to…
Ya Ko
  • 509
  • 2
  • 4
  • 19
-1
votes
1 answer

Converting from org.apache.spark.sql.Dataset to CoordinateMatrix

I have a spark SQL dataset whose schema defined as follows, User_id | Item_id | Bought_Status I would like to convert this to a Sparse matrix to apply recommender systems algorithms. This is very huge RDD datasets so I…
-1
votes
1 answer

Apache Spark Java - how to iterate through row dataset and remove null fields

I'm trying to build the spark application which reads the data from Hive table and output will be written as JSON. In below code, I have to iterate through row dataset and remove the null fields before output. I'm expecting my output like, please…
-1
votes
1 answer

Getting Null Pointer exception while performing operations on dataframe spark

I am using following code to create dataframe from RDD. I am able to perform operations on RDD and RDD is not empty. I tried out following two approaches. With both I am getting same exception. Approach 1: Build dataset using…
-1
votes
1 answer

how to merge spark row

Hi i have a DataSet of Track.class i want to merge all tracks that are within same interval of time for example 5 min .i.e any tracks start after a track that ends within 5 min before will be the same track.its look like fusion task. my input…
-1
votes
1 answer

Join spark dataset with complex condition

Consider a bean as follows: class Bean { String id; String joinColumn; } I have two datasets of this Bean and need to join them on joinColumn but the condition to join is not equals to. I need to have a logic that compares joinColumn for…
-1
votes
1 answer

How to remove/filter element from WrappedArray column

I'm facing an issue with manipulating a WrappedArray column. I want to remove/filter element from the WrappedArray column in a Spark dataset. The WrappedArray contain objects, for example, I have a dataset contain following…
Alex
  • 57
  • 1
  • 5
-1
votes
1 answer

Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization?

To take advantage of Dataset's optimization, do I have to explicitly use Dataframe's methods (e.g. df.select(col("name"), col("age"), etc) or calling any Dataset's methods - even RDD-like methods (e.g. filter, map, etc) would also allow for…
Glide
  • 20,235
  • 26
  • 86
  • 135
-1
votes
1 answer

Apache Spark Performance Issue

We thought of using Apache Spark to match records faster, but we are finding it highly inefficient than SQL matching using select statement. Using, JavaSparkContext javaSparkContext = new JavaSparkContext(new…
Nischay
  • 168
  • 2
  • 14