Highest Voted 'scala-spark' Questions

0

votes

1 answer

Add a tag to the list in the DataFrame based on the data from the second DataFrame

I have two DataFrames - the first one with the columns model, cnd, age, tags (this is a repeatable field - String list/array), min, max and the second one with the main_model column. I would like to add the MAIN tag to the first DataFrame to the…

asked Dec 11 '22 at 19:25

xard4sTR

25
6

0

votes

1 answer

Convert spark scala dataset of one type to another

I have a dataset with following case class type: case class AddressRawData( addressId: String, customerId: String, address: String ) I want to…

scala apache-spark scala-spark

asked Dec 10 '22 at 05:46

Nikhil Padole

97
1
13

0

votes

1 answer

inequality test of two columns from same dataframe in pyspark

in scala spark we can filter if column A value is not equal to column B or same dataframe as df.filter(col("A")=!=col("B")) How we can do this same in Pyspark ? I have tried differemt options like df.filter(~(df["A"] == df["B"])) and != operator but…

apache-spark pyspark apache-spark-sql scala-spark

asked Dec 06 '22 at 18:45

Arslan Ali

1

0

votes

0 answers

Spark streaming dropping left join events when right side data is empty

I have two streams, 'left' stream and 'right' stream. I would like to do a leftOuter join on the streams. I would like to collect the events on 'left' stream that couldn't join with 'right' stream. The watermark delay on both the streams is…

apache-spark pyspark spark-structured-streaming scala-spark

asked Nov 30 '22 at 14:20

Sai Dhanush

1

0

votes

0 answers

DataFrame values Changing after adding columns using withColumn

I have created a dataframe by reading the data from db2 and dataframe looks like below. df1.show() Table_Name | Source_count | Target_Count ---------------------------------------- Test_tab | 12750 | 12750 After that, I have added 4…

apache-spark pyspark apache-spark-sql aws-glue scala-spark

asked Nov 16 '22 at 10:08

phani437

1
1

0

votes

0 answers

Pyspark equivalent of Scala Spark

I have the following code in Scala: val checkedValues = inputDf.rdd.map(row => { val size = row.length val items = for (i <- 0 until size) yield { val fieldName = row.schema.fieldNames(i) val sourceField = sourceFields(fieldName)…

scala apache-spark pyspark scala-spark

asked Oct 28 '22 at 14:32

Tarique

463
3
16

0

votes

1 answer

Why we use Val for Accumulators and not Var in scala?

Why we use Val instead of Var for accumulators? If its like a counter that's shared across for multiple executor nodes to just update/change it, then it means reassigning a Val right? val accum = sc.longAccumulator("New Accumulator")

apache-spark scala-spark

asked Oct 26 '22 at 23:17

user23062

45
1
4

0

votes

2 answers

Coalesce dynamic column list from two datasets

I am trying to translate a pyspark job, which is dynamically coalescing the columns from two datasets with additional filters/condition. conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c not in…

scala apache-spark apache-spark-sql scala-spark

asked Oct 20 '22 at 04:11

sinsom

19
3

0

votes

2 answers

How to use when() .otherwise function in Spark with multiple conditions

This is my first post so let me know if I need to give more details. I am trying to create a boolean column, "immediate", that shows true when at least on of the columns has some data in it. If all are null then the column should be false. I am…

apache-spark pyspark scala-spark

asked Oct 18 '22 at 09:11

jackdotdi

24
3

0

votes

0 answers

ScalaSpark - Difference between 2 dataframes - Identify inserts, updates and deletes

I am trying to translate below code from pyspark to scala. I am able to successfully create the dataframes from input data. from pyspark.sql.functions import col, array, when, array_remove, lit, size, coalesce from pyspark.sql.types import * …

scala apache-spark pyspark apache-spark-sql scala-spark

asked Oct 16 '22 at 21:43

sinsom

19
3

0

votes

1 answer

How Spark broadcast the data in Broadcast Join

How Spark broadcast the data when we use Broadcast Join with hint - As I can see when we use the broadcast hint: It calls this function def broadcast[T](df: Dataset[T]): Dataset[T] = { Dataset[T](df.sparkSession, …

apache-spark scala-spark

asked Oct 01 '22 at 14:11

sho

176
2
12

0

votes

0 answers

Joing large RDDs in scala spark

I want to join large(1TB) data RDD with medium(10GB) size data RDD. There was an earlier processing on large data with was completing in 8 hours. I then joined the medium sized data to get an info that need to be add to the processing(its a simple…

scala apache-spark rdd scala-spark

asked Sep 03 '22 at 15:48

user0712

43
6

0

votes

1 answer

How to convert 'Jul 24 2022' to '2022-07-24' in spark sql

I want to convert a string date column to a date or timestamp (YYYY-MM-DD). How can i do it in scala Spark Sql ? Input: D1 Apr 24 2022| Jul 08 2021| Jan 16 2022| Expected : D2 2022-04-24| 2021-07-08| 2022-01-16|

date apache-spark-sql date-conversion string-to-datetime scala-spark

asked Aug 19 '22 at 20:36

Namrata

1

0

votes

1 answer

Need to add quotes for all in spark

Need to add quotes for all in spark dataframe Input: val someDF = Seq( | ("user1", "math","algebra-1","90"), | ("user1", "physics","gravity","70") | ).toDF("user_id", "course_id","lesson_name","score") Actual…

apache-spark apache-spark-sql bigdata scala-spark

asked Jul 29 '22 at 07:41

rajasekar k

1

0

votes

1 answer

Cannot stream files in subfolders with wildcards in pySpark streaming

This code works only if I make directory="s3://bucket/folder/2022/10/18/4/*" from pyspark.sql.functions import from_json from pyspark.streaming import StreamingContext ssc = StreamingContext(sc, 30) directory =…

apache-spark pyspark spark-streaming spark-structured-streaming scala-spark

asked Jul 19 '22 at 19:09

Salsa Steve

89
3
10

Questions tagged [scala-spark]