Questions tagged [scala-spark]
49 questions
0
votes
1 answer
Add a tag to the list in the DataFrame based on the data from the second DataFrame
I have two DataFrames - the first one with the columns model, cnd, age, tags (this is a repeatable field - String list/array), min, max and the second one with the main_model column.
I would like to add the MAIN tag to the first DataFrame to the…

xard4sTR
- 25
- 6
0
votes
1 answer
Convert spark scala dataset of one type to another
I have a dataset with following case class type:
case class AddressRawData(
addressId: String,
customerId: String,
address: String
)
I want to…

Nikhil Padole
- 97
- 1
- 13
0
votes
1 answer
inequality test of two columns from same dataframe in pyspark
in scala spark we can filter if column A value is not equal to column B or same dataframe as
df.filter(col("A")=!=col("B"))
How we can do this same in Pyspark ?
I have tried differemt options like
df.filter(~(df["A"] == df["B"])) and != operator but…
0
votes
0 answers
Spark streaming dropping left join events when right side data is empty
I have two streams, 'left' stream and 'right' stream. I would like to do a leftOuter join on the streams. I would like to collect the events on 'left' stream that couldn't join with 'right' stream.
The watermark delay on both the streams is…
0
votes
0 answers
DataFrame values Changing after adding columns using withColumn
I have created a dataframe by reading the data from db2 and dataframe looks like below.
df1.show()
Table_Name | Source_count | Target_Count
----------------------------------------
Test_tab | 12750 | 12750
After that, I have added 4…

phani437
- 1
- 1
0
votes
0 answers
Pyspark equivalent of Scala Spark
I have the following code in Scala:
val checkedValues = inputDf.rdd.map(row => {
val size = row.length
val items = for (i <- 0 until size) yield {
val fieldName = row.schema.fieldNames(i)
val sourceField = sourceFields(fieldName)…

Tarique
- 463
- 3
- 16
0
votes
1 answer
Why we use Val for Accumulators and not Var in scala?
Why we use Val instead of Var for accumulators? If its like a counter that's shared across for multiple executor nodes to just update/change it, then it means reassigning a Val right?
val accum = sc.longAccumulator("New Accumulator")

user23062
- 45
- 1
- 4
0
votes
2 answers
Coalesce dynamic column list from two datasets
I am trying to translate a pyspark job, which is dynamically coalescing the columns from two datasets with additional filters/condition.
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c not in…

sinsom
- 19
- 3
0
votes
2 answers
How to use when() .otherwise function in Spark with multiple conditions
This is my first post so let me know if I need to give more details.
I am trying to create a boolean column, "immediate", that shows true when at least on of the columns has some data in it. If all are null then the column should be false. I am…

jackdotdi
- 24
- 3
0
votes
0 answers
ScalaSpark - Difference between 2 dataframes - Identify inserts, updates and deletes
I am trying to translate below code from pyspark to scala.
I am able to successfully create the dataframes from input data.
from pyspark.sql.functions import col, array, when, array_remove, lit, size, coalesce
from pyspark.sql.types import *
…

sinsom
- 19
- 3
0
votes
1 answer
How Spark broadcast the data in Broadcast Join
How Spark broadcast the data when we use Broadcast Join with hint - As I can see when we use the broadcast hint: It calls this function
def broadcast[T](df: Dataset[T]): Dataset[T] = {
Dataset[T](df.sparkSession,
…

sho
- 176
- 2
- 12
0
votes
0 answers
Joing large RDDs in scala spark
I want to join large(1TB) data RDD with medium(10GB) size data RDD. There was an earlier processing on large data with was completing in 8 hours. I then joined the medium sized data to get an info that need to be add to the processing(its a simple…

user0712
- 43
- 6
0
votes
1 answer
How to convert 'Jul 24 2022' to '2022-07-24' in spark sql
I want to convert a string date column to a date or timestamp (YYYY-MM-DD). How can i do it in scala Spark Sql ?
Input:
D1
Apr 24 2022|
Jul 08 2021|
Jan 16 2022|
Expected :
D2
2022-04-24|
2021-07-08|
2022-01-16|

Namrata
- 1
0
votes
1 answer
Need to add quotes for all in spark
Need to add quotes for all in spark dataframe
Input:
val someDF = Seq(
| ("user1", "math","algebra-1","90"),
| ("user1", "physics","gravity","70")
| ).toDF("user_id", "course_id","lesson_name","score")
Actual…
0
votes
1 answer
Cannot stream files in subfolders with wildcards in pySpark streaming
This code works only if I make directory="s3://bucket/folder/2022/10/18/4/*"
from pyspark.sql.functions import from_json
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 30)
directory =…

Salsa Steve
- 89
- 3
- 10