Spark extracting values from a Row

Question

I have the following dataframe

val transactions_with_counts = sqlContext.sql(
  """SELECT user_id AS user_id, category_id AS category_id,
  COUNT(category_id) FROM transactions GROUP BY user_id, category_id""")

I'm trying to convert the rows to Rating objects but since x(0) returns an array this fails

val ratings = transactions_with_counts
  .map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt))

error: value toInt is not a member of Any

zero323 · Accepted Answer · 2016-08-09T20:02:19.617

108

Lets start with some dummy data:

val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user_id", "category_id")

val transactions_with_counts = transactions
  .groupBy($"user_id", $"category_id")
  .count

transactions_with_counts.printSchema

// root
// |-- user_id: integer (nullable = false)
// |-- category_id: integer (nullable = false)
// |-- count: long (nullable = false)

There are a few ways to access Row values and keep expected types:

Pattern matching

import org.apache.spark.sql.Row

transactions_with_counts.map{
  case Row(user_id: Int, category_id: Int, rating: Long) =>
    Rating(user_id, category_id, rating)
}

Typed get* methods like getInt, getLong:

transactions_with_counts.map(
  r => Rating(r.getInt(0), r.getInt(1), r.getLong(2))
)

getAs method which can use both names and indices:
```
transactions_with_counts.map(r => Rating(
  r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2)
))
```
It can be used to properly extract user defined types, including mllib.linalg.Vector. Obviously accessing by name requires a schema.
Converting to statically typed Dataset (Spark 1.6+ / 2.0+):
```
transactions_with_counts.as[(Int, Int, Long)]
```

edited Aug 09 '16 at 20:02

answered Oct 08 '15 at 07:23

zero323

322,348
103
959
935

Which is the most efficient way out of above four approaches you have mentioned....? – Dilan Aug 11 '17 at 11:15
@Dilan Pattern matching statically typed options might be slower (the latter one has some other performance implication). `getAs[_]` and `get*` should be similar but are painful to use. – zero323 Aug 11 '17 at 11:33
1. What does it mean by "the latter one has some other performance implication"...? 2. Does getAs[_] and get* better than the pattern matching in terms of performance ? – Dilan Aug 12 '17 at 07:00
I'm using the 1st method described above for `DataFrame` having **nullable** columns like this `case Row(usrId: Int, usrName: String, null, usrMobile: Int) => ...` and `case Row(usrId: Int, usrName: String, usrAge: Int, null) => ...` which results in long *case expressions* (and I have several cases). Is there a cleaner way (more condensed, less boilerplate / repetitive stuff) to do it? Please answer with an example. – y2k-shubham Jan 22 '18 at 07:53
Excellent answers Mr 0323. – thebluephantom Aug 14 '19 at 18:22

user-asterix · Answer 2 · 2017-05-15T17:06:09.527

Using Datasets you can define Ratings as follows:

case class Rating(user_id: Int, category_id:Int, count:Long)

The Rating class here has a column name 'count' instead of 'rating' as zero323 suggested. Thus the rating variable is assigned as follows:

val transactions_with_counts = transactions.groupBy($"user_id", $"category_id").count

val rating = transactions_with_counts.as[Rating]

This way you will not run into run-time errors in Spark because your Rating class column name is identical to the 'count' column name generated by Spark on run-time.

Spark extracting values from a Row

2 Answers2

Linked

Related