1

How to transpose rows to columns using RDD or data frame without pivot.

SessionId,date,orig, dest, legind, nbr

1   9/20/16,abc0,xyz0,o,1
1   9/20/16,abc1,xyz1,o,2
1   9/20/16,abc2,xyz2,i,3
1   9/20/16,abc3,xyz3,i,4

So I want to generate new schema like:

SessionId,date,orig1, orig2, orig3, orig4, dest1, dest2, dest3,dest4

1,9/20/16,abc0,abc1,null, null, xyz0,xyz1, null, null

Logic is if:

  • nbr is 1 and legind = o then orig1 value (fetch from row 1) ...

  • nbr is 3 and legind = i then dest1 value (fetch from row 3)

So how to transpose the rows to columns...

Any idea will be great appreciated.

Tried with below option but its just flatten all in single row..

val keys = List("SessionId");
val selectFirstValueOfNoneGroupedColumns =
  df.columns
    .filterNot(keys.toSet)
    .map(_ -> "first").toMap
val grouped =
  df.groupBy(keys.head, keys.tail: _*)
    .agg(selectFirstValueOfNoneGroupedColumns).show()
zero323
  • 322,348
  • 103
  • 959
  • 935
Ankur
  • 21
  • 1
  • 6
  • Weclome to SO. please spend some time on [help](http://stackoverflow.com/help/how-to-ask) to get clariry of how maturely you can ask question and format neatly – Ram Ghadiyaram Oct 01 '16 at 17:59
  • Yes it is straightforward and it already has an answer! Follow the link I gave. – The Archetypal Paul Oct 01 '16 at 18:28
  • @TheArchetypalPaul : the link you provided, it's different question and approach. – Ankur Oct 01 '16 at 18:34
  • Ah, nothing to do with transpose then. It's not at all clear what your desired output is -please provide more data, It's particularly odd why you have abc0, and abc2 in the output, but xyz0, xyz1. It may be obvious to you what you're doing here, but it certainly isn't obvious to the reader that doesn't have the knowledge of the context. – The Archetypal Paul Oct 01 '16 at 18:38
  • My mistake... thanks for pointing out , output should be : `1,9/20/16,abc0,abc1,null, null, xyz0,xyz1, null, null` – Ankur Oct 01 '16 at 18:42
  • @TheArchetypalPaul Can you remove the tag of duplicate, it's not duplicate question. – Ankur Oct 01 '16 at 19:16
  • I have voted to re-open. – The Archetypal Paul Oct 01 '16 at 20:27

1 Answers1

1

It is relatively simple if you use pivot function. First lets create a data set like the one in your question:

import org.apache.spark.sql.functions.{concat, first, lit, when}

val df = Seq(
  ("1", "9/20/16", "abc0", "xyz0", "o", "1"),
  ("1", "9/20/16", "abc1", "xyz1", "o", "2"),
  ("1", "9/20/16", "abc2", "xyz2", "i", "3"),
  ("1", "9/20/16", "abc3", "xyz3", "i", "4")
).toDF("SessionId", "date", "orig", "dest", "legind", "nbr")

then define and attach helper columns:

// This will be the column name
val key = when($"legind" === "o", concat(lit("orig"), $"nbr"))
           .when($"legind" === "i", concat(lit("dest"), $"nbr"))

// This will be the value
val value = when($"legind" === "o", $"orig")     // If o take origin
              .when($"legind" === "i", $"dest")  // If i take dest

val withKV =  df.withColumn("key", key).withColumn("value", value)

This will result in a DataFrame like this:

+---------+-------+----+----+------+---+-----+-----+
|SessionId|   date|orig|dest|legind|nbr|  key|value|
+---------+-------+----+----+------+---+-----+-----+
|        1|9/20/16|abc0|xyz0|     o|  1|orig1| abc0|
|        1|9/20/16|abc1|xyz1|     o|  2|orig2| abc1|
|        1|9/20/16|abc2|xyz2|     i|  3|dest3| xyz2|
|        1|9/20/16|abc3|xyz3|     i|  4|dest4| xyz3|
+---------+-------+----+----+------+---+-----+-----+

Next let's define a list of possible levels:

val levels = Seq("orig", "dest").flatMap(x => (1 to 4).map(y => s"$x$y"))

and finally pivot

val result = withKV
  .groupBy($"sessionId", $"date")
  .pivot("key", levels)
  .agg(first($"value", true)).show

And the result is:

+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+
|sessionId|   date|orig1|orig2|orig3|orig4|dest1|dest2|dest3|dest4|
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+
|        1|9/20/16| abc0| abc1| null| null| null| null| xyz2| xyz3|
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+
zero323
  • 322,348
  • 103
  • 959
  • 935