0

I have a dataframe:

data = [['p1',  't1'],  ['p4',  't2'], ['p2', 't1'],['p4', 't3'],
       ['p4', 't3'],   ['p3', 't1'],]
sdf = spark.createDataFrame(data, schema = ['id', 'text'])
sdf.show()
+---+----+
| id|text|
+---+----+
| p1|  t1|
| p4|  t2|
| p2|  t1|
| p4|  t3|
| p4|  t3|
| p3|  t1|
+---+----+

I want to group by text. If the text does not change, then the rank remains. For example

+---+----+----+
| id|text|rank|
+---+----+----+
| p1|  t1|   1|
| p2|  t1|   1|
| p3|  t1|   1|
| p4|  t2|   2|
| p4|  t3|   3|
| p4|  t3|   3|
+---+----+----

Unfortunately, the rank function does not give what I need.

w = Window.partitionBy("text").orderBy("id")
sdf2 = sdf.withColumn("rank", F.rank().over(w))
sdf2.show()
+---+----+----+
| id|text|rank|
+---+----+----+
| p1|  t1|   1|
| p2|  t1|   2|
| p3|  t1|   3|
| p4|  t2|   1|
| p4|  t3|   1|
| p4|  t3|   1|
+---+----+----+
Ric S
  • 9,073
  • 3
  • 25
  • 51
Rory
  • 471
  • 2
  • 11
  • You want to group only by text or by id too? For instance, if the third row was `p3, t2` instead of `p3, t1`, what would that rank be? – Ric S Mar 13 '23 at 14:54
  • Yeah i want group only by text. if the third row was p3, t2 instead of p3, t1. the rank would be 2 – Rory Mar 13 '23 at 14:55

1 Answers1

1

It seems you are not looking to rank your observations within a group, but to convert a categorical variable into numeric. You can do this with StringIndexer:

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol='text', outputCol='rank', stringOrderType='alphabetAsc')
indexer_fitted = indexer.fit(sdf)
sdf = indexer_fitted.transform(sdf)

sdf = sdf.withColumn('rank', F.col('rank').cast('int'))
sdf.show()
+---+----+----+
| id|text|rank|
+---+----+----+
| p1|  t1|   0|
| p2|  t1|   0|
| p3|  t1|   0|
| p4|  t2|   1|
| p4|  t3|   2|
| p4|  t3|   2|
+---+----+----+
Ric S
  • 9,073
  • 3
  • 25
  • 51