16

I have an RDD like this:

1 2 3
4 5 6
7 8 9

It is a matrix. Now I want to transpose the RDD like this:

1 4 7
2 5 8
3 6 9

How can I do this?

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
赵祥宇
  • 497
  • 3
  • 9
  • 19

3 Answers3

15

Say you have an N×M matrix.

If both N and M are so small that you can hold N×M items in memory, it doesn't make much sense to use an RDD. But transposing it is easy:

val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
val transposed = sc.parallelize(rdd.collect.toSeq.transpose)

If N or M is so large that you cannot hold N or M entries in memory, then you cannot have an RDD line of this size. Either the original or the transposed matrix is impossible to represent in this case.

N and M may be of a medium size: you can hold N or M entries in memory, but you cannot hold N×M entries. In this case you have to blow up the matrix and put it together again:

val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
// Split the matrix into one number per line.
val byColumnAndRow = rdd.zipWithIndex.flatMap {
  case (row, rowIndex) => row.zipWithIndex.map {
    case (number, columnIndex) => columnIndex -> (rowIndex, number)
  }
}
// Build up the transposed matrix. Group and sort by column index first.
val byColumn = byColumnAndRow.groupByKey.sortByKey().values
// Then sort by row index.
val transposed = byColumn.map {
  indexedRow => indexedRow.toSeq.sortBy(_._1).map(_._2)
}
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
  • 1
    actually I have a very big matrix in text file,100000*100000. – 赵祥宇 Apr 01 '15 at 13:09
  • 1
    In the text file it is like what I say in the question,1 2 3 \n 4 5 6 \n 7 8 9 like this,now I have to transpose the textFile,I don't think your method can work,it may out of memory(a array of 100000*100000)is too large for memory.do you have another method? – 赵祥宇 Apr 01 '15 at 13:12
  • You're right, I did not consider this case. I'll update the answer, hopefully with something useful! – Daniel Darabos Apr 01 '15 at 13:16
  • Done. Please take another look! – Daniel Darabos Apr 01 '15 at 13:34
  • @DevanMS: The case when N or M is very large is already addressed. In this case either the input or the output RDD cannot be represented in Spark as an `RDD[Seq[Int]]`. You could always represent it as an `RDD[((Long, Long), Int)]` (`(row, column) -> value`), but then transposing is trivial. What is your specific question? It's probably best to ask it in a separate question and maybe drop a link here. – Daniel Darabos Jan 11 '16 at 14:27
  • @DanielDarabos Can you provide the third solution for Dataset? I have a huge dataset, with multiple columns of type Array[String], all of same size (for one row). Using an explode and zip/transpose function takes too long. Thank you so much. – D. Müller Nov 07 '19 at 20:15
  • Oh, that sounds like a tough problem! Sorry, I don't know how to do it. If you figure something out, consider posting it as an answer here! It may be useful for others too. – Daniel Darabos Nov 08 '19 at 11:52
5

A first draft without using collect(), so everything runs worker side and nothing is done on driver:

val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))

rdd.flatMap(row => (row.map(col => (col, row.indexOf(col))))) // flatMap by keeping the column position
   .map(v => (v._2, v._1)) // key by column position
   .groupByKey.sortByKey   // regroup on column position, thus all elements from the first column will be in the first row
   .map(_._2)              // discard the key, keep only value

The problem with this solution is that the columns in the transposed matrix will end up shuffled if the operation is performed in a distributed system. Will think of an improved version

My idea is that in addition to attach the 'column number' to each element of the matrix, we attach also the 'row number'. So we could key by column position and regroup by key like in the example, but then we could reorder each row on the row number and then strip row/column numbers from the result. I just don't have a way to know the row number when importing a file into an RDD.

You might think it's heavy to attach a column and a row number to each matrix element, but i guess that's the price to pay to have the possibility to process your input as chunks in a distributed fashion and thus handle huge matrices.

Will update the answer when i find a solution to the ordering problem.

Martin
  • 451
  • 6
  • 10
  • My missing part what the zipWithIndex from Daniel's answer. Didn't know about this one so thanks to make me learn something. Didn't test his solution but indeed the zipWithIndex gives you the missing row number information and can so be used to reorder the transposed rows. – Martin Apr 01 '15 at 14:04
  • I have tried Daniel's solution and it is correct.As u say u missed zipWithIndex.Thank you for your answer~~ – 赵祥宇 Apr 07 '15 at 03:38
  • Great solution @Martin. Could you please tell me how can I write the same for Java-7 (without lambda expression). – Rajiur Rahman May 01 '15 at 23:20
  • 1
    @RajiurRahman i'm not sure you really want to write Spark code in Java-7 because without lambdas that must just be a pain in the a**. For the 2 lambdas i used you'll have to define separate functions. Spark is a good excuse to move to Scala or at least to Java-8. – Martin May 06 '15 at 11:53
5

As of Spark 1.6 you can use the pivot operation on DataFrames, depending on the actual shape of your data, if you put it into a DF you could pivot columns to rows, the following databricks blog is very useful as it describes in detail a number of pivoting use cases with code examples

51zero
  • 189
  • 1
  • 3
  • your first link is dead (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.GroupedData) – Juh_ Jun 14 '17 at 15:14