0

I have an Iceberg table something like:

Columns: [custID,X, uniqueTransId ------]

[127, 2, a0, -----]

[127,2, a1, -----]

[127,3, a2, -----]

[127,4, a3, -----]

[127,5, a4, -----]

[128,6, a5, -----]

[129, 7, a6, -----]

[129, 8, a7, -----]

[130, 2, a8, -----]

[130, 2, a9, -----]

[Null, 3, a10, -----]

Result Columns: [custId, X, uniqueTransId, newField, ------]

[127, 2, a0, 2_3_4_5, -----]

[127, 2, a1, 2_3_4_5, ------]

[127, 3, a2, 2_4_5, ------]

[127, 4, a3, 2_3_5, ------]

[127, 5, a4, 2_3_4, ------]

[128, 6, a5, Null, ------]

[129, 7, a6, 8, ------]

[129, 8, a7, 7, ------]

[130, 2, a8, 2, ------]

[130, 2, a9, 2, ------]

[Null, 3, a10, Null, ------]

how can we achieve this?

This is kind of for same Customer ID, I need to put X of all instance other than the current one to a new column. For example: For custId 127, there are 4 rows and the Value of field X are 2, 3, 4, 5. For each row, i need the Value of field X of other transactions: For transaction a1, it will be 3,4 and 5. Similarly for transaction a2, it will be 2,4 and 5.

For custId 128, there are no matching rows where same custID is present, so there is no value for new Field and it should be set to Null in result.

May be it doesn't need a self join and some other query can do the work here, but with my limited understanding, it seems a self join problem.

Alok Singh
  • 31
  • 5
  • What is the actual question? How is a self-join relevant? Where do the values for `newfield` come from? (or are they supposed to be generated? if so, explain the logic). Without more information I fear this question cannot be answered. – Paul Maxwell Aug 08 '23 at 04:14
  • The value for newField come from field X from all other rows where custID is same. I have updated with more information. Please check – Alok Singh Aug 08 '23 at 04:18
  • How do you interact with the table normally (e.g. Spark, Hive, Presto) to perform SQL operations? What is the name of the table? – Paul Maxwell Aug 08 '23 at 04:24
  • We use Apache Spark to interact with the table. Table name can be taken as custTable – Alok Singh Aug 08 '23 at 04:26
  • https://stackoverflow.com/questions/76912694/extract-from-list-of-json This is not related to extension of this, but wanted you to take a look at that. – Alok Singh Aug 16 '23 at 10:30

1 Answers1

0

For SQL (untested):

SELECT
      t1.custID
    , t1.X
    , t1.uniqueTransId
    , concat_ws('_', collect_list(t2.X)) AS newfield
FROM custTable t1
LEFT JOIN custTable t2 ON t1.custID = t2.custID AND t1.X != t2.X
GROUP BY
      t1.custID
    , t1.X
    , t1.uniqueTransId
ORDER BY
      t1.custID
    , t1.X;

Here it is (in Postgres) as a fiddle: https://dbfiddle.uk/5vOBqjBA


original reply

Try the following - but note it is untested

from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_list, array_remove, array_join, col

# Create a Spark session
spark = SparkSession.builder.getOrCreate()

df = spark.read.format("iceberg").load("custTable")

# group by 'custId' and collect all 'X' values into an array
df_grouped = df.groupBy("custId").agg(collect_list("X").alias("X_values"))

# Perform a self join on 'custId'
df_joined = df.join(df_grouped, "custId")

# exclude the current row's 'X' value
# and concatenate the remaining 'X' values
df_final = df_joined.withColumn("newField", array_join(array_remove("X_values", "X"), "_"))

df_final.show()
Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
  • Though Apache Spark is used at the backend, we only have to pass SQL query which is then taken to Spark and used. So, effectively I need SQL query only to do the same. (there are many other operations that I am doing using SQL Query only by doing joins etcs with other tables.) – Alok Singh Aug 08 '23 at 06:03
  • I added an SQL approach but as I don't have an environment to test in I used Postgres - and hope the "translation" works – Paul Maxwell Aug 09 '23 at 05:59
  • This is working fine in Iceberg. One more thing I would like to ask is if I need the list sorted, how I can do that? Can you update your Query above for the same? – Alok Singh Aug 09 '23 at 08:28
  • One case we missed is when X is same for the custID in 2 different transaction. We still need that to be shown. I have updated the example above for custID 130 – Alok Singh Aug 09 '23 at 11:02
  • Also as in the example(updated), For custid 127, the X repeated in uniquetransId a0 and a1, So the new Field must contain the X of other field even if it is same. Refer the example result above for custId 127 above. – Alok Singh Aug 09 '23 at 11:13
  • Change I did : `ON (t1.custID = t2.custID AND t1.X != t2.X) OR (t1.custID = t2.custID AND t1.X == t2.X AND t1.uniqueTransId != t2.uniqueTransId)` This worked for all the extra things I mentioned. (I made few tweaks in your solution) But still the order of the newField is something which is not fixed yet. – Alok Singh Aug 09 '23 at 11:37
  • newField shall come as : 2_3_4_5 (in order) and not as 3_2_4_5 or so. Can you help in this? – Alok Singh Aug 09 '23 at 11:44
  • 1
    sort_array(collect_list(t2.X)) worked for me. – Alok Singh Aug 09 '23 at 11:50
  • Good to know. So you do have the solution now? – Paul Maxwell Aug 09 '23 at 21:53
  • Yes, the solution works. One problem on trying with this approach is that it is Taking around 40 min to run as the table is having around 2.5M rows. Is there som optimization we can do? – Alok Singh Aug 11 '23 at 09:20
  • Ask another question for optimisation. In that question you will need the query, the explain plan and current indexing of the tables. Add the url of that question, under this question, as a comment. – Paul Maxwell Aug 12 '23 at 00:09