0

I have two columns (such as):

from to
1 2
1 3
2 4
4 2
4 2
4 3
3 3

And I want to create a transition matrix (where sum of rows in a columns add up to 1):

         1.      2.     3.   4.

1.      0        0      0    0
2.      0.5*     0      0    2/3
3.      0.5      0.5    1    1/3
4.      0        0.5    0    0

where 1 -> 2 would be : (the number of times 1 (in 'from') is next to 2 (in 'to)) / (total times 1 points to any value).

LaC
  • 5
  • 4
  • Your sample expected output does not match with the transformation requested. Please provide correct sample input/expected output. – Azhar Khan Nov 26 '22 at 10:42
  • @AzharKhan yes I updated the question. Let me know if it is more clear – LaC Nov 26 '22 at 15:15

1 Answers1

1

You can create this kind of transition matrix using a window and pivot.

First some dummy data:

import pandas as pd
import numpy as np

np.random.seed(42)
x = np.random.randint(1,5,100)
y = np.random.randint(1,5,100)

df = spark.createDataFrame(pd.DataFrame({'from': x, 'to': y}))
df.show()
+----+---+
|from| to|
+----+---+
|   3|  3|
|   4|  2|
|   1|  2|
...

To create a pct column, first group the data by unique combinations of from/to and get the counts. With that aggregated dataframe, create a new column, pct that uses the Window to find the total number of records for each from group which is used as the denominator.

Lastly, pivot the table to make the to values columns and the pct data the values of the matrix.

from pyspark.sql import functions as F, Window

w = Window().partitionBy('from')
grp = df.groupBy('from', 'to').count().withColumn('pct', F.col('count') / F.sum('count').over(w))

res = grp.groupBy('from').pivot('to').agg(F.round(F.first('pct'), 2))
res.show()
+----+----+----+----+----+
|from|   1|   2|   3|   4|
+----+----+----+----+----+
|   1| 0.2| 0.2|0.25|0.35|
|   2|0.27|0.31|0.19|0.23|
|   3|0.46|0.17|0.21|0.17|
|   4|0.13|0.13| 0.5|0.23|
+----+----+----+----+----+
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • Hi, thank you for your help @Zelazny7! This works! But I actually need the sum of the columns to be equal to 1 and not the sum of the rows. I also change the order of to and from such as below – LaC Nov 26 '22 at 14:52
  • w = Window().partitionBy('to') grp = data.groupBy('to', 'from').count().withColumn('pct', F.col('count') / F.sum('count').over(w)) res = grp.groupBy('to').pivot('from').agg(F.round(F.first('pct'), 2)) res.show() – LaC Nov 26 '22 at 14:52
  • @lac, I had the same understanding as zelazny7 when you said you need sum of columns. But it looks like what you are looking for is: for a given column, sum of its all rows. Well in that case, just transpose the matrix. This, being a summary table, should work with "toPandas().T" or if you want Spark way then check https://stackoverflow.com/questions/74574775/pivoting-a-single-row-dataframe-where-groupby-can-not-be-applied/74579589#74579589. – Azhar Khan Nov 27 '22 at 05:03
  • Hi, thank you for your help. I want to do it in spark and looks like it is not that easy if you have that many rows/columns. @AzharKhan any chance we can sum the rows given one column? – LaC Nov 27 '22 at 10:15