4

I am trying to join two PySpark dataframes. On the long run, I want to change a value in df1 if it has a match in df2. I am working on a Jupyter notebook.

When I try the join and afterwards do .show(), everything succeedes. However, as soon as I want to convert to a Pandas dataframe, I get a value error.

Code that works fine:

temp = df_1.join(df_2, on='number').limit(30)
temp.show()

I already found Pyspark - saveAsTable throws index error while show() dataframe works perfectly, but other than increasing the limit, I have tried to count, both operations succeeded, so i am not entirely sure whether this is caused by lazy evaluation.

Code that causes trouble:

temp = df_1.join(df_2, on='number').limit(30)
temp.toPandas()

What I would like to see is the well formatted first few lines of the joined table

Whenever I use .toPandas(), I get a value error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This error message does not make much sense to me in this context.

Does anybody have any hints? Could this be due to duplicate column headers? Am I overlooking something?

edit: I am trying to provide some more code:

w2 = Window.partitionBy(df_0a["key"]).orderBy(df_0a["int_value_tf"].desc())
df_1 = df_0a.select('*',f.rank().over(w2).alias('rn'))\
    .filter(f.col("rn") == 1).drop("rn")

df_2 = df_0a.join(df_0b, df_0a.number == df_0b.c_number)\
    .drop(df_0b.c_number)\
    .withColumn("pn", f.when(f.col("p") == "NaN", f.col("ic1")).otherwise(f.col("p")))\
    .filter("NOT pn == 'NaN'")

@cs95: I know that this still isn't really a copy-paste kind of information. If it helps, I can try to pseudonymize the first few lines of all involved tables?

Both dataframes are fairly large, df_0b does have different column names to df_0a.

Anne
  • 583
  • 5
  • 15
  • 1
    Do you have code to reproduce the problem? – cs95 Aug 18 '19 at 18:09
  • 1
    does temp.show(30) work? (given that temp only has 30 rows because you're limiting it?). Could be that there is one corrupted row (maybe introduced by some UDF), which would only be evaluated when calling toPandas() and not for show() (if it's not in the first 20 rows) and also not in count() – Paul Aug 18 '19 at 20:54
  • 2
    So, it turns out that error was thrown because the columns had duplicate names - I did a select on a subselection and the join worked fine. I find the error rather unintuitive. – Anne Aug 19 '19 at 07:54
  • you got it right – vikrant rana Aug 23 '19 at 17:47

1 Answers1

4

Upgrading Anne's answer in the comment above, for future use:

This happens if columns have duplicate names, even if they contain aliases and selected with .select()

df_a.join(df_b,on="c")
df_a.select("df_a.colname","df_b.colname").toPandas() <-- does not work
df_a.select("df_a.colname","df_b.colname").show() <-- works correctly
zalmane
  • 336
  • 3
  • 5