0

Is there any way to convert Seq[Row] into a dataframe in scala. I have a dataframe and a list of strings that have weights of each row in input dataframe.I want to build a DataFrame that will include all rows with unique weights. I was able to filter unique rows and append to seq[row] but I want to build a dataframe. This is my code .Thanks in advance.

 def dataGenerator(input : DataFrame, val : List[String]): Dataset[Row]= {
    val valitr = val.iterator
    var testdata = Seq[Row]()
    var val = HashSet[String]()
    if(valitr!=null) {
      input.collect().foreach((r) => {
        var valnxt = valitr.next()
        if (!valset.contains(valnxt)) {
          valset += valnxt
          testdata = testdata :+ r
        }
      })
    }
//logic to convert testdata as DataFrame and return
}
  • Do you really need a collect to do this? If you try to perform this using the DataFrame methods (probably filter in this case I expect) then you will find that you do not need to convert a Seq[Row] to a DataFrame – mikelegg Mar 01 '21 at 13:39
  • @mikelegg Can you explain in detail on how I can achieve this using filter function –  Mar 01 '21 at 14:18
  • 1
    I'm not sure I understand what you want to do, please clarify. Is there an element in 'val' for each row in 'input'? And they go together in the same order? And you only want to include the first row for cases where the corresponding 'val' entries are the same? – mikelegg Mar 01 '21 at 15:03
  • Yes for every row in my dataframe there is corresponding value in val list.Now I want to include only the first row if two rows correspond to a same value. –  Mar 01 '21 at 15:28
  • Then there might be a problem related to ordering. Using the order for the relationship between 'input' and 'val' might not be good. How do you know the ordering of 'input' is the same as 'val'? Is the input dataframe in a known order? – mikelegg Mar 01 '21 at 15:48
  • Yes the ordering for both is same.val is calculated using fields from inputdf itself –  Mar 01 '21 at 15:57

1 Answers1

0

You said that 'val is calculated using fields from inputdf itself'. If this is the case then you should be able to make a new dataframe with a new column for the 'val' like this:

+------+------+
|item  |weight|
+------+------+
|item 1|w1    |
|item 2|w2    |
|item 3|w2    |
|item 4|w3    |
|item 5|w4    |
+------+------+

This is the key thing. Then you will be able to work on the dataframe instead of doing a collect.

What is bad about doing collect? Well there is no point in going to the trouble and overhead of using a distributed big data processing framework just to pull all the data into the memory of 1 machine. See here: Spark dataframe: collect () vs select ()

When you have the input dataframe how you want it, as above, you can get the result. Here is a way that works, which groups the data by the weight column and picks the first item for each grouping.

    val result = input
        .rdd // get underlying rdd 
        .groupBy(r => r.get(1)) // group by "weight" field
        .map(x => x._2.head.getString(0)) // get the first "item" for each weight
        .toDF("item") // back to a dataframe

Then you get the only the first item in case of duplicated weight:

+------+
|item  |
+------+
|item 1|
|item 2|
|item 4|
|item 5|
+------+
mikelegg
  • 1,197
  • 6
  • 10