Questions tagged [scala-spark]

49 questions
0
votes
1 answer

Add a tag to the list in the DataFrame based on the data from the second DataFrame

I have two DataFrames - the first one with the columns model, cnd, age, tags (this is a repeatable field - String list/array), min, max and the second one with the main_model column. I would like to add the MAIN tag to the first DataFrame to the…
0
votes
1 answer

Convert spark scala dataset of one type to another

I have a dataset with following case class type: case class AddressRawData( addressId: String, customerId: String, address: String ) I want to…
Nikhil Padole
  • 97
  • 1
  • 13
0
votes
1 answer

inequality test of two columns from same dataframe in pyspark

in scala spark we can filter if column A value is not equal to column B or same dataframe as df.filter(col("A")=!=col("B")) How we can do this same in Pyspark ? I have tried differemt options like df.filter(~(df["A"] == df["B"])) and != operator but…
0
votes
0 answers

Spark streaming dropping left join events when right side data is empty

I have two streams, 'left' stream and 'right' stream. I would like to do a leftOuter join on the streams. I would like to collect the events on 'left' stream that couldn't join with 'right' stream. The watermark delay on both the streams is…
0
votes
0 answers

DataFrame values Changing after adding columns using withColumn

I have created a dataframe by reading the data from db2 and dataframe looks like below. df1.show() Table_Name | Source_count | Target_Count ---------------------------------------- Test_tab | 12750 | 12750 After that, I have added 4…
0
votes
0 answers

Pyspark equivalent of Scala Spark

I have the following code in Scala: val checkedValues = inputDf.rdd.map(row => { val size = row.length val items = for (i <- 0 until size) yield { val fieldName = row.schema.fieldNames(i) val sourceField = sourceFields(fieldName)…
Tarique
  • 463
  • 3
  • 16
0
votes
1 answer

Why we use Val for Accumulators and not Var in scala?

Why we use Val instead of Var for accumulators? If its like a counter that's shared across for multiple executor nodes to just update/change it, then it means reassigning a Val right? val accum = sc.longAccumulator("New Accumulator")
user23062
  • 45
  • 1
  • 4
0
votes
2 answers

Coalesce dynamic column list from two datasets

I am trying to translate a pyspark job, which is dynamically coalescing the columns from two datasets with additional filters/condition. conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c not in…
sinsom
  • 19
  • 3
0
votes
2 answers

How to use when() .otherwise function in Spark with multiple conditions

This is my first post so let me know if I need to give more details. I am trying to create a boolean column, "immediate", that shows true when at least on of the columns has some data in it. If all are null then the column should be false. I am…
jackdotdi
  • 24
  • 3
0
votes
0 answers

ScalaSpark - Difference between 2 dataframes - Identify inserts, updates and deletes

I am trying to translate below code from pyspark to scala. I am able to successfully create the dataframes from input data. from pyspark.sql.functions import col, array, when, array_remove, lit, size, coalesce from pyspark.sql.types import * …
0
votes
1 answer

How Spark broadcast the data in Broadcast Join

How Spark broadcast the data when we use Broadcast Join with hint - As I can see when we use the broadcast hint: It calls this function def broadcast[T](df: Dataset[T]): Dataset[T] = { Dataset[T](df.sparkSession, …
sho
  • 176
  • 2
  • 12
0
votes
0 answers

Joing large RDDs in scala spark

I want to join large(1TB) data RDD with medium(10GB) size data RDD. There was an earlier processing on large data with was completing in 8 hours. I then joined the medium sized data to get an info that need to be add to the processing(its a simple…
user0712
  • 43
  • 6
0
votes
1 answer

How to convert 'Jul 24 2022' to '2022-07-24' in spark sql

I want to convert a string date column to a date or timestamp (YYYY-MM-DD). How can i do it in scala Spark Sql ? Input: D1 Apr 24 2022| Jul 08 2021| Jan 16 2022| Expected : D2 2022-04-24| 2021-07-08| 2022-01-16|
0
votes
1 answer

Need to add quotes for all in spark

Need to add quotes for all in spark dataframe Input: val someDF = Seq( | ("user1", "math","algebra-1","90"), | ("user1", "physics","gravity","70") | ).toDF("user_id", "course_id","lesson_name","score") Actual…
0
votes
1 answer

Cannot stream files in subfolders with wildcards in pySpark streaming

This code works only if I make directory="s3://bucket/folder/2022/10/18/4/*" from pyspark.sql.functions import from_json from pyspark.streaming import StreamingContext ssc = StreamingContext(sc, 30) directory =…