Questions tagged [scala-spark]
49 questions
0
votes
0 answers
AWS Glue scala spark job failing - org.apache.spark.util.collection.CompactBuffer[] not registered in Kryo
The below code segment is failing as per the SparkUI history server.
segmentIdToTripIdsRDD.join(segmentIdToRSMSegmentRDD)
.map(tuple => {
val tripIds: Iterable[String] = tuple._2._1._1
…

Aki008
- 405
- 2
- 6
- 19
0
votes
1 answer
Spark extract values from Json struct
I have a spark dataframe column (custHeader) in the below format and I want to extract the value of the key - phone into a separate column. trying to use the from_json function, but it is giving me a null value.
valArr:array
element:struct
…

marc
- 319
- 1
- 5
- 20
0
votes
1 answer
Spark broadcasts right dataset from left join, which causes org.apache.spark.sql.execution.OutOfMemorySparkException
Spark broadcasts right dataset from left join, which causes org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize, despite I used settings to disable…

alsetr
- 13
- 3
0
votes
0 answers
Spark ColumnarBatches and storing them in InMmeoryRelation for fast queries in spark scala
I have been trying to implement InMemoryRelation based on spark ColumnarBatches, so far I have not been able to store the vectorised columnarbatch into the relation. Is there a way to achieve this without going with an intermediary representation…
0
votes
3 answers
Convert Vector[String] to Dataframe in Scala Spark
I have this Vector[String]:
user_uid,score,value
255938,34096,8
259117,34599,10
253664,28891,7
how can I convert it to DataFrame?
I already tried this:
val dataInVectorRow = dataInVectorString
.map(_.split("\\s+"))
.map(x =>…

AT181903
- 11
- 4
0
votes
1 answer
spark scala exploding struct array throwing error ambigous reference to fields
Currently I'm working on exploding a struct array with pair of keys are same.
{
"A": [{
"AA": {
"AB": "21",
"AC": "R",
"AD": "20222832522117601",
"AE": "2",
"AF": {
…

instancedeveloper
- 11
- 3
0
votes
0 answers
Get Error Records from deequ VerificationSuite
When we run any deequ VerificationSuite, can we see the input data exception records with respect to each rule when there is any error on rule. For ex: if rule1 failed for 10 records out of total 100 records, I see only summary which says this…

PythonDeveloper
- 289
- 1
- 4
- 24
0
votes
1 answer
save dataframe with records limit but also make sure same value is not across multiple files
suppose I have this dataframe:
id
value
A
1
A
2
A
3
B
1
B
2
C
1
D
1
D
2
and so on. basically I want to make sure even with records limit any certain id can only appear in one single file(suppose number of entries with that…

ForkPork
- 37
- 4
0
votes
1 answer
Why different behavior when mixed case are used, vs same case are used in spark 3.2
I am running a simple query in spark 3.2
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 =…

ASR
- 53
- 6
0
votes
1 answer
Issues running Graph queries after upgrading Spark 2.4.3 to 3.1.3
We are upgrading the Scala Spark
Spark from 2.4.3 to 3.1.3
scalaVersion from 2.11.8 to 2.12.10
spark-cassandra-connector from 2.4.2 to 3.1.0
Cassandra version 3.2 and all the subsequent dependancies.
We are facing following issues,
[error]…

user21166408
- 1
- 1
0
votes
1 answer
How to create Scala trait which stores data from other columns in dataset and then create new dataset with column storing the trait in Scala?
I am new to Scala and am currently studying datasets for Scala and Spark. Based on my input dataset below, I am trying to create a new dataset (see below). In the new dataset, I aim to have a new column which contains a Scala trait…

AIBball
- 101
- 1
- 1
- 5
0
votes
0 answers
How to integrate Intellij and Databricks, like when using the jdwp with a regular Spark cluster?
I have been looking online for awhile, but have found nothing, thus this question. I would like to be able to debug my Apache Spark code (written in Scala) remotely on Databricks, similar to the way it can be done on regular Spark clusters using the…

MrMuppet
- 547
- 1
- 4
- 12
0
votes
0 answers
Spark - Map udf to windows in spark dataframe
Problem Statement:
Have to group InputDf based on multiple columns (accountGuid, appID, deviceGuid, deviceMake) and order each group by time
Need to check if the test Df exists in the exact sequence in each window
If it exists, create a new…

sujoy majumder
- 1
- 2
0
votes
0 answers
How to use Google Session Token in Spark to connect to Google Cloud Storage bucket
I want to read data from Google Storage Bucket using Google Session Token in Spark Application.
Here instead of json.keyfile I want to use Google Session Key in spark conf option.
I tried with json.key file but Actually I am looking for Google…

Sachin Patil
- 1
- 3
0
votes
1 answer
Add a tag to the list in the DataFrame based on the threshold given for the values in the list in Scala Spark
I have a Dataframe that has a column "grades" containing a list of Grade objects that have 2 fields: name (String) and value (Double). I would like to add the word PASS to the list of tags if there is a Grade on the list with the name: HOME and a…

xard4sTR
- 25
- 6