0

I have an RDD that contains the following [('column 1',value), ('column 2',value), ('column 3',value), ... , ('column 100',value)]. I want to create a dataframe that contains a single column with tuples.

The closest I have gotten is:

schema = StructType((StructField("char", StringType(), False), (StructField("count", IntegerType(), False))))
    my_udf = udf(lambda w, c: (w,c), schema)

and then

df.select(my_udf('char', 'int').alias('char_int'))

but this produces a dataframe with a column of lists, not tuples.

Community
  • 1
  • 1
kostas
  • 1,959
  • 1
  • 24
  • 43
  • There is no such thing as column of tuples. Struct is the only representation for product types. – zero323 Jul 08 '16 at 11:55
  • I understand that but that does not help me with my question. Starting with a dataframe that contains two columns, how do I end up with a dataframe that contains a single column which is a tuple of the previous two? – kostas Jul 08 '16 at 12:12
  • I found [this](http://stackoverflow.com/questions/32799595/how-to-merge-two-columns-of-a-dataframe-in-spark-into-one-2-tuple) question that seems sort of similar. Maybe the answers posted there are helpful. – ffmmmm Jul 08 '16 at 12:23

1 Answers1

1

struct is a s correct way to represent product types, like tuple, in Spark SQL and this is exactly what you get using your code:

df = (sc.parallelize([("a", 1)]).toDF(["char", "int"])
    .select(my_udf("char", "int").alias("pair")))
df.printSchema()

## root
##  |-- pair: struct (nullable = true)
##  |    |-- char: string (nullable = false)
##  |    |-- count: integer (nullable = false)

There is no other way to represent a tuple unless you want to create an UDT (no longer supported in 2.0.0) or store pickled objects as BinaryType.

Moreover struct fields are locally represented as tuple:

isinstance(df.first().pair, tuple)
## True

I guess you may be confused by square brackets when you call show:

df.show()

## +-----+
## | pair|
## +-----+
## |[a,1]|
## +-----+

which are simply a representation of choice render by JVM counterpart and don't indicate Python types.

zero323
  • 322,348
  • 103
  • 959
  • 935