0

I defined a function in PySpark which is-

def add_ids(X):
    schema_new = X.schema.add("id_col", LongType(), False)
    _X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new)
    cols_arranged = [_X.columns[-1]] + _X.columns[0:len(_X.columns) - 1]
    return _X.select(*cols_arranged)

In the function above, I'm creating a new column(with the name of id_col) that gets appended to the dataframe which is basically just the index number of each row and it finally moves the id_col to the leftmost side.

The data I'm using

>>> X.show(4)
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
only showing top 4 rows

Output of the function

>>> add_ids(X).show(4)
+------+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|id_col|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+------+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|     0|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|
|     1|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|     2|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|     3|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
+------+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
only showing top 4 rows

All of this works fine but the issue is when I run the following two commands

>>> X.show(4)
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
only showing top 4 rows

>>> X.columns
['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'id_col']

If you look at the result of X.columns, you'll notice id_col at the end. But when I ran the X.show(4) a line earlier, it doesn't show id_col as a column.

Now when I try running add_ids(X).show(4), I get the following error

pyspark.sql.utils.AnalysisException: "Reference 'id_col' is ambiguous, could be: id_col, id_col.;"

What is it that I am doing wrong?

zero323
  • 322,348
  • 103
  • 959
  • 935
Clock Slave
  • 7,627
  • 15
  • 68
  • 109
  • 1
    Just echoing the answer below, your issue is that even though the function returns a new DataFrame, you modify the schema of `X` each time you call `add_ids(X)`. After the first call, `X` already has an `id_col`. That's why you have an error on the second call. – pault Sep 11 '18 at 13:59
  • @pault But `X` is within the function scope, right? Why is that changing my dataframe outside that scope? – Clock Slave Sep 11 '18 at 15:51

1 Answers1

2

The mistake is here:

schema_new = X.schema.add("id_col", LongType(), False)

If you check the source you'll see that the add method modifies data in place.

It is easier to see on a simplified example:

from pyspark.sql.types import *

schema = StructType()
schema.add(StructField("foo", IntegerType()))

schema
StructType(List(StructField(foo,IntegerType,true)))

As you see the schema object has been modified.

Instead of using add method you should rebuild the schema:

schema_new = StructType(schema.fields + [StructField("id_col", LongType(), False)])

Alternatively you can create a deep copy of the object:

import copy

old_schema = StructType()
new_schehma = copy.deepcopy(old_schema).add(StructField("foo", IntegerType()))

old_schema
StructType(List())
new_schehma
StructType(List(StructField(foo,IntegerType,true)))
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Your answer works for me, so I accepted it. I have one confusion though. The `X` is within the function scope of `add_ids`, right? Why is that changing my dataframe outside of that? – Clock Slave Sep 11 '18 at 15:52
  • 2
    There is nothing unusual or Spark specific about it. You pass reference to mutable object to the function, so when the object is modified, changes are visible outside the closure. There is no copy-on-change here, like for example in R. – zero323 Sep 11 '18 at 16:17
  • Could you elaborate a little more or point me to somewhere I can read more on this? For example, if I have a simple function like- `def func(a):a=a+2; return a` and declare `a= 2` followed by `func(a)`, it doesnt change the value of a. It will only return `4`. I believe there's something lacking in my understanding but I dont know what terms to google to get more info. – Clock Slave Sep 11 '18 at 17:52
  • https://stackoverflow.com/a/986145 would be a good place to start. Any decent introduction to Python book, should have a decent explanation of the topic, but I cannot recommend anything particular at the moment. – zero323 Sep 11 '18 at 19:59
  • 1
    @ClockSlave your example doesn't show what's happening because integers are [immutable](https://stackoverflow.com/questions/8056130/immutable-vs-mutable-types). Instead try: `def func(a): a += [2]; return a` and `a=[1]` – pault Sep 11 '18 at 20:26
  • Oh. Now I see what you are talking about. I was passing a mutable object all along. Thanks, @pault, user6910411. – Clock Slave Sep 12 '18 at 01:23