0

Currently I have an input file(millions of records) where all the records contain a 2 character Identifier. Multiple lines in this input file will be concatenated into only one record in the output file, and how this is determined is SOLELY based on the sequential order of the Identifier

For example, the records would begin as below

1A
1B
1C
2A
2B
2C
1A
1C
2B
2C
1A
1B
1C

1A marks the beginning of a new record, so the output file would have 3 records in this case. Everything between the "1A"s will be combined into one record

1A+1B+1C+2A+2B+2C
1A+1C+2B+2C
1A+1B+1C

The number of records between the "1A"s varies, so I have to iterate through and check the Identifier.

I am unsure how to approach this situation using scala/spark. My strategy is to:

  1. Load the Input file into the dataframe.

  2. Create an Identifier column based on substring of record.

  3. Create a new column, TempID and a variable, x that is set to 0

  4. Iterate through the dataframe

  5. if Identifier =1A, x = x+1

  6. TempID= variable x

Then create a UDF to concat records with the same TempID.

To summarize my question: How would I iterate through the dataframe, check the value of Identifier column, then assign a tempID(whose value increases by 1 if the value of identifier column is 1A)

zero323
  • 322,348
  • 103
  • 959
  • 935

2 Answers2

1

This is dangerous. The issue is that spark is not guaranteed keep the same order among elements, especially since they might cross partition boundaries. So when you iterate over them you could get a different order back. This also has to happen entirely sequentially, so at that point why not just skip spark entirely and run it as regular scala code as a preproccessing step before getting to spark.

My recommendation would be to either look into writing a custom data inputformat/data source, or perhaps you could use "1A" as a record delimiter similar to this question.

Joe K
  • 18,204
  • 2
  • 36
  • 58
1

First - usually "iterating" over a DataFrame (or Spark's other distributed collection abstractions like RDD and Dataset) is either wrong or impossible. The term simply does not apply. You should transform these collections using Spark's functions instead of trying to iterate over them.

You can achieve your goal (or - almost, details to follow) using Window Functions. The idea here would be to (1) add an "id" column to sort by, (2) use a Window function (based on that ordering) to count the number of previous instances of "1A", and then (3) using these "counts" as the "group id" that ties all records of each group together, and group by it:

import functions._
import spark.implicits._

// sample data:
val df = Seq("1A", "1B", "1C", "2A", "2B", "2C", "1A", "1C", "2B", "2C", "1A", "1B", "1C").toDF("val")

val result = df.withColumn("id", monotonically_increasing_id())          // add row ID
  .withColumn("isDelimiter", when($"val" === "1A", 1).otherwise(0))      // add group "delimiter" indicator
  .withColumn("groupId", sum("isDelimiter").over(Window.orderBy($"id"))) // add groupId using Window function
  .groupBy($"groupId").agg(collect_list($"val") as "list") // NOTE: order of list might not be guaranteed!
  .orderBy($"groupId").drop("groupId")                                   // removing groupId

result.show(false)
// +------------------------+
// |list                    |
// +------------------------+
// |[1A, 1B, 1C, 2A, 2B, 2C]|
// |[1A, 1C, 2B, 2C]        |
// |[1A, 1B, 1C]            |
// +------------------------+

(if having the result as a list does not fit your needs, I'll leave it to you to transform this column to whatever you need)

The major caveat here is that collect_list does not necessarily guarantee preserving order - once you use groupBy, the order is potentially lost. So - the order within each resulting list might be wrong (the separation to groups, however, is necessarily correct). If that's important to you, it can be worked around by collecting a list of a column that also contains the "id" column and using it later to sort these lists.


EDIT: realizing this answer isn't complete without solving this caveat, and realizing it's not trivial - here's how you can solve it:

Define the following UDF:

val getSortedValues = udf { (input: mutable.Seq[Row]) => input
  .map { case Row (id: Long, v: String) => (id, v) }
  .sortBy(_._1)
  .map(_._2)
}

Then, replace the row .groupBy($"groupId").agg(collect_list($"val") as "list") in the suggested solution above with these rows:

  .groupBy($"groupId")
  .agg(collect_list(struct($"id" as "_1", $"val" as "_2")) as "list")
  .withColumn("list", getSortedValues($"list"))

This way we necessarily preserve the order (with the price of sorting these small lists).

Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
  • I tried this and it worked exactly as I had wanted. It seems that the lists are in order for now, but I am unsure if this will last when I test on cluster(currently testing in an IDE). For your suggestion to maintain the list order, do you mean I simply do ["collect_list($"id") as "list"] ? Thank you – Timothy Emanuel Parker Jan 30 '18 at 21:49
  • Good to know! It worked well locally for me too, but indeed I'm not sure if that would be true with real data (partitioned over a number of nodes). Using `"collect_list($"id") as "list"` won't be enough as you'll lose the data itself, you'll need to create some kind of _struct_ that contains both the id and the data and then collect that, e.g. `collect_list(struct($"id", $"val") as "list")` and then find a way to sort these structs and get the "val"s (might require use of a UDF) – Tzach Zohar Jan 30 '18 at 21:59
  • OK - edited answer to show how this can be done as it's not trivial. Hope this helps! – Tzach Zohar Jan 30 '18 at 22:14