0

Question:

I have below 2 dataframes stored in an array. Data is already partitioned by SECURITY_ID.

Dataframe 1 (DF1):
+-------------+----------+----------+--------+---------+---------+
| ACC_SECURITY|ACCOUNT_NO|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|
+-------------+----------+--------+---------+-----------+--------+
|9161530335G71|  91615303|1111    |     1000|      35G71|  -20000|
|9161530435G71|  91615304|2222    |     2000|      35G71|   -2883|
|9161530235G71|  91615302|3333    |     3000|      35G71|    2000|
|9211530135G71|  92115301|4444    |     4000|      35G71|    8003|
+-------------+----------+--------+---------+-----------+--------+

Dataframe 2 (DF2):
+-------------+----------+----------+--------+---------+---------+
| ACC_SECURITY|ACCOUNT_NO|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|
+-------------+----------+--------+---------+-----------+--------+
|3FA34789290X2|  3FA34789|5555    |     5000|      290X2|  -20000|
|32934789290X2|  32934789|6666    |     6000|      290X2|   -2883|
|00000019290X2|  00000019|7777    |     7000|      290X2|    2000|
|3S534789290X2|  3S534789|8888    |     8000|      290X2|    8003|
+-------------+----------+--------+---------+-----------+--------+

Trial:

How do I process each dataframe separately and under each dataframe, I want to process one row at a time. I tried the below

def methodA(d1: DataFrame): Unit {
    val securityIds = d1.select("SECURITY_ID").distinct.collect.flatMap(_.toSeq)
    val bySecurityArray = securityIds.map(securityIds => d1.where($"SECURITY_ID" <=> securityIds))

    for(i <- 0 until bySecurityArray.length) {
        allocOneDF = bySecurityArray(i).toDF()
        print("Number of partitions: " + allocProcessDF.rdd.getNumPartitions)
        methodB(allocProcessDF)
    }
} 

def methodA(d1: DataFrame): Unit {
    import org.apache.spark.api.java.function.ForeachPartitionFunction
    df.foreachPartition(ds => {

    //Tried below while and also foreach... its same result.
    //Option 1  
    while (ds.hasNext) {
        allocProcess(ds.next())
    }

    //Option 2
    ds.foreach(row => allocProcess(row))

    })

}

I tried to process - using foreachpartition on each Dataframe coming from the bySecurityArray - then process each row from the resulting dataset (after foreachpartition) using foreach

But I see only the first Dataframe (SECURITY_ID = 35G71) executing, not the second Dataframe (290X2).

Errors received:

19/09/23 08:57:38 ERROR util.Utils: Exception encountered
java.io.StreamCorruptedException: invalid type code: 30
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1601)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
    at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:561)
    at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:74)
    at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70)
    at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1371)
    at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
    at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:376)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
19/09/23 08:57:38 ERROR util.Utils: Exception encountered

19/09/23 08:57:38 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 218.0 (TID 10452, CANTSHARE_URL, executor 6): java.io.StreamCorruptedException: invalid type code: 30
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1601)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
    at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:561)
    at java.lang.Thread.run(Thread.java:748)

19/09/23 08:57:38 INFO scheduler.DAGScheduler: ShuffleMapStage 218 (run at ThreadPoolExecutor.java:1149) failed in 0.120 s due to Job aborted due to stage failure: Task 9 in stage 218.0 failed 4 times, most recent failure: Lost task 9.3 in stage 218.0 (TID 10466, CANTSHARE_URL, executor 6): java.io.StreamCorruptedException: invalid type code: 30
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:376)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Voila
  • 85
  • 2
  • 15
  • What is the size of the initial dataframe (d1)? and what is the cardinality of SECURITY_ID? – hagarwal Sep 23 '19 at 17:01
  • The cardinality of SECURITY_ID is around 1000 or so. And each DF1, DF2 etc are GROUPED BY SECURITY_ID values and these groups might contain anywhere from 10-100 rows of data. But issue here is, I am not able to execute the 2nd group onwards. The 1st group executes fine. – Voila Sep 23 '19 at 17:18
  • The above code should work. I have written a sample spark app using the above data and methods, it is working fine in my local system. Could you try with: val bySecurityArray = securityIds.map(securityId => d1.where(d1("SECURITY_ID") === securityId)) – hagarwal Sep 23 '19 at 18:31
  • The code in methodA works fine, but the code in methodB has to work for each partition and in each partition, each row has to execute one row after another from top. But right now the row execution is not in order. This part is where I need help and correction if any at the inner statement... df.foreachPartition(ds => { //Tried below while and also foreach... its same result. //Option 1 while (ds.hasNext) { allocProcess(ds.next()) } //Option 2 ds.foreach(row => allocProcess(row)) }) – Voila Sep 24 '19 at 13:19
  • Spark does not preserve order, as data is distributed across partition. Data could be collected as array/list to preserve order/sort, However it is not recommended as it can choke the master. Could you please elaborate allocProcess() function what exactly you are trying to achieve, because might be it could be done in Spark & optimized way. – hagarwal Sep 24 '19 at 13:32
  • Each group when acted upon by allocProcess has to be process the 1st/2nd rows and store the result NET_QUANTITY somewhere. Then when 3rd/4th rows get processed and LONG_ID or SHORT_IND is same, then the NET_QUANTITY from memory has to be used in processing 3rd/4th rows. So for SECURITY_ID = 290X2, lets say we process 1st/2nd rows, NET_QUANTITY=-22883 and LONG_IND is stored in memory. Then when 3rd/4th row processes (say LONG_IND is same), then -22883 is added to 2000+8003+ (-22883) = -12880 is end result. This is again stored in memory.We are trying to see if we can use Custom Accumulator. – Voila Sep 24 '19 at 14:57
  • As each dataset is under each partition (by SECURITY_ID), order of row execution should be maintained is what we feel. Diff rows for each SECURITY_ID don't need to work together to get a result. Only rows in same partition (SECURITY_ID) need to work from top of rows to bottom, based on a PRIORITY_ID, assigned to each row. – Voila Sep 24 '19 at 15:03

1 Answers1

0

Spark does not preserve order, as data is distributed across the partition, within a partition order is still not guaranteed as there could be multiple tasks. To obtain a logical order coalesce(1) followed by sort(cols:*) operation could be applied over the Datafame to get new Datafame/Dataset sorted by the specified columns, all in ascending order.

def methodA(d1: DataFrame): Unit = {
val securityIds = d1.select("SECURITY_ID").distinct.collect.flatMap(_.toSeq)
val bySecurityArray = securityIds.map(securityId => d1.where(d1("SECURITY_ID") === securityId))

for (i <- 0 until bySecurityArray.length) {
  val allocOneDF = bySecurityArray(i).toDF()
  print("Number of partitions: " + allocOneDF.rdd.getNumPartitions)
  methodB(allocOneDF)
 }
}

def methodB(df: DataFrame): Unit = {
df.coalesce(1).sort("LONG_IND", "SHORT_IND").foreach(row => {
  println(row)
  //allocProcess(row)
 })
}
hagarwal
  • 1,153
  • 11
  • 27