0

I'm trying to run a function across the cartesian product of two PySpark DataFrames with:

joined = dataframe1.rdd.cartesian(dataframe2.rdd)
collected = joined.collect()
for tuple in collected:
    print tuple

# new_rdd = joined.map(function_to_pass_in)

But I get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-72-bf547304ed8b> in <module>()
    29 collected = joined.collect()
    30 for tuple in collected:
---> 31     print tuple
    32 
    33 # new_rdd = joined.map(function_to_pass_in)

/opt/spark/spark-1.3.0/python/pyspark/sql/types.pyc in __repr__(self)
  1212             # call collect __repr__ for nested objects
  1213             return ("Row(%s)" % ", ".join("%s=%r" % (n, getattr(self, n))
-> 1214                                           for n in self.__FIELDS__))
  1215 
  1216         def __reduce__(self):

/opt/spark/spark-1.3.0/python/pyspark/sql/types.pyc in <genexpr>((n,))
  1212             # call collect __repr__ for nested objects
  1213             return ("Row(%s)" % ", ".join("%s=%r" % (n, getattr(self, n))
-> 1214                                           for n in self.__FIELDS__))
  1215 
  1216         def __reduce__(self):

IndexError: tuple index out of range

Interestingly enough, the following code works without error:

joined = dataframe1.rdd.cartesian(dataframe2.rdd)
print joined.count()
for tuple in joined.collect():
    print tuple

Why does calling ".count" on my resulting rdd make this work? Shouldn't it work without having to do that? Am I missing something?

lukewitmer
  • 1,153
  • 3
  • 11
  • 21
  • Could you provide a reproducible example? – zero323 Jun 25 '15 at 21:20
  • On a side note an idiomatic way of obtaining Cartesian product would be to use join: `joined = dataframe1.join(dataframe2)`. – zero323 Jun 25 '15 at 21:23
  • When I put together a simple example, I don't get the error. I'm digging into what exactly I am doing differently in my more complex code that I can't exactly paste up here yet. More info coming today hopefully. Thanks for asking for the example! It means that I am doing something else wrong somewhere. Also, good point on the alternative way of obtaining the Cartesian product with join. – lukewitmer Jun 26 '15 at 14:30
  • I think that the reason for my error is that my spark dataframes contain variable amounts of data, derived from json of no particular specified format or schema. Further, rather than checking the schema of each row, it takes the schema from the first row, which is then incorrect for subsequent rows. Hopefully that makes sense. I have implemented a work around where I store my json in an RDD of objects as opposed to populating the dataframe with the actual data. So far so good. – lukewitmer Jun 26 '15 at 21:38
  • @lukewitmer if i may ask, would you mind sharing the code you used to deal with this issue? I am new to spark and are finding this issue – Manuel G Mar 07 '16 at 13:24
  • @ManuelG very sorry, but the code I used is proprietary for my previous employer. Is your data of no particular format or schema, or is it all structured the same way? If it is structured the same, it should work. Otherwise, can you create some kind of object to hold the disparate data then fill the RDD with those objects? – lukewitmer Mar 09 '16 at 01:47

0 Answers0