-3

I have a DataFrame which has a column in the form of string. This looks like:

`+--------------------------------------------------------------------------------------------------------------------------------------+
|queue_sequence                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------+
|In Queue,In-Progress,Internally,Development Done/ Eng testing,In-Progress,Development Done/ Eng testing,Complete                      |
|In Queue,In-Progress,Complete,In-Progress,Complete                                                                                    |
|In Queue,Development,Development Ready,In Queue,Development,In Queue,Complete                                                         |
|In Queue,Analyze,In-Progress,ISRM,Externally,ISRM,Complete                                                                            |
|In Queue,Complete,In-Progress,Complete                                                                                                |
|In Queue,DSM/UCL,Complete                                                                                                             |
|In Queue,In-Progress,Development Done/ Eng testing,Complete,In Queue,In-Progress,Development Done/ Eng testing,Complete               |
|In Queue,In-Progress,Externally,Development Done/ Eng testing,Complete                                                                |
|In Queue,In-Progress,Development Done/ Eng testing,DSM/UCL,In-Progress,ISRM,In-Progress,Development Done/ Eng testing,Complete        |
|In Queue,Development,Development Ready,In Queue,Development,Development Done/ Eng testing,Development,Complete                        |
|In Queue,In-Progress,In Queue,In-Progress,ISRM,Complete                                                                               |
|In Queue,Development Ready,In-Progress,Done,Complete                                                                                  |`

I want to take the unique of all the comma separated words in each row.

I have tried the following code

 `df.select("queue_sequence") .collect() .map(_.mkString)` 

and stored it in a variable which looks like a Array[String]:

Array[String] = Array(In Queue,
                      In-Progress,
                      Internally,
                      Development Done/ Eng testing,
                      In-Progress,
                      Development Done/ Eng testing,
                      Complete, 
                      In Queue,
                      In-Progress,
                      Complete,
                      In-Progress,
                      Complete, 
                      In Queue,
                      Analyze,
                      In-Progress,
                      ISRM,
                      Externally,
                      ISRM,
                      Complete, 
                      In Queue,
                      Development,
                      Development Ready,
                      In Queue,
                      Development,
                      In Queue,Complete
                     )

But this list is not unique . So how do i get them to distinct format

I tried the following:

.toSet.toList
.toList.Distinct

I am unable to get distinct words from that array. I tried the above-mentioned methods, but none of them worked.

Leothorn
  • 1,345
  • 1
  • 23
  • 45
Vikrant
  • 139
  • 1
  • 12

2 Answers2

1

This works normally. Here are some examples with your data:

Your array:

arr: Array[String] = Array(In Queue, In-Progress, Internally, Development Done/ Eng testing, In-Progress, Development Done/ Eng testing, Complete, In Queue, In-Progress, Complete, In-Progress, Complete, In Queue, Analyze, In-Progress, ISRM, Externally, ISRM, Complete, In Queue, Development, Development Ready, In Queue, Development, In Queue, Complete)

Distinct elements:

Approach 1: Use distinct directly on the array

val distinct_array=arr.distinct
distinct_array: Array[String] = Array(In Queue, In-Progress, Internally, Development Done/ Eng testing, Complete, Analyze, ISRM, Externally, Development, Development Ready)

Approach 2: Convert it to a set (which automatically takes distinct values, and also then you can do union and intersections)

val set_arr=arr.toSet
set_arr: scala.collection.immutable.Set[String] = Set(Complete, ISRM, Development, In Queue, Internally, Development Done/ Eng testing, Analyze, In-Progress, Development Ready, Externally)

//union example
set_arr.union(set2)

//intersection example
set_arr.intersect(set2)
partha_devArch
  • 414
  • 2
  • 10
  • arr.distinct doesn't work as the array is produced in the following way: `df.select("queue_sequence") .collect() .map(_.mkString)` – Vikrant Sep 05 '19 at 10:55
  • 1
    If you have a dataframe, then why do you want to get an array out of it? You can already run distinct on the dataframe column, and it is more effective. Also, it is easier to do union and intersection on a dataframe. Another important note - Applying the `collect()` function gets all your data into your driver, which is not a good way do handle operations on spark. – partha_devArch Sep 05 '19 at 11:01
  • because each row of my df column is a collection of words which looks like: First Row - `In Queue,In-Progress,Internally,Development Done/ Eng testing,In-Progress,Development Done/ Eng testing,Complete ` ______________________________ Second Row `In Queue,In-Progress,Complete,In-Progress,Complete ` I want a union of all these words – Vikrant Sep 05 '19 at 11:04
0

The best and most easy way to get the unique elements is to convert the array to set.

scala> val ar=Array("abc","def","abc")

ar: Array[String] = Array(abc, def, abc)

scala> ar.toSet

res1: scala.collection.immutable.Set[String] = Set(abc, def)

Sam91
  • 95
  • 1
  • 8