1

I am 'translating' a python code to pyspark. I would like to use an existing column as index for a dataframe. I did this in python using pandas. The small piece of code below explains what I did. Thanks for helping.

df.set_index('colx',drop=False,inplace=True)
# Ordena index
df.sort_index(inplace=True)

I expect the result to be a dataframe with 'colx' as index.

  • 3
    Spark DataFrames do not have a concept of an index (or order in general). You *can* do `df = df.sort("colx")` but that's primarily for display purposes and you can't rely on that order for computations (without using a `Window`). Or maybe you want to add a [`row_number` ordering by `colx`?](https://stackoverflow.com/a/46740396/5858851) – pault May 30 '19 at 17:20
  • Possible duplicate of [Spark Dataframe :How to add a index Column : Aka Distributed Data Index](https://stackoverflow.com/questions/43406887/spark-dataframe-how-to-add-a-index-column-aka-distributed-data-index) – Ram Ghadiyaram Sep 12 '19 at 19:31

2 Answers2

1

add index to pyspark dataframe as a column and use it

rdd_df = df.rdd.zipWithIndex()
df_index = rdd_df.toDF()
#and extract the columns
df_index = df_index.withColumn('colA', df_index['_1'].getItem("'colA"))
df_index = df_index.withColumn('colB', df_index['_1'].getItem("'colB"))
hanzgs
  • 1,498
  • 17
  • 44
0

This is not how it works with Spark. No such concept exists.

One can add a column to an RDD zipWithIndex by convert DF to RDD and back, but that is a new column, so not the same thing.

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
thebluephantom
  • 16,458
  • 8
  • 40
  • 83