Scala - Using DF filter on multiple fields

Question

I've been searching through stackoverflow for several days now and I'm just not finding an answer to the following question. I'm really new to scala coding, so this might be a very basic question. Any help will be much appreciated.

The problem I'm having (getting an error on) is with the last bit of code.
I'm trying to get a filtered subset of records from a dataframe where all the filtered records are missing data from one or more of the specified fields.

I'm using Scala IDE Build 4.7.0 in Eclipse.
The pom.xml file I'm using has spark-core_2.11, version 2.0.0

Thank you.
Jesse

val source_path = args(0)
val source_file = args(1)

val vFile = sc.textFile(source_path + "/" + source_file)

val vSchema = StructType(
            StructField("FIELD_1",LongType,false)::
            StructField("FIELD_2",LongType,false)::
            StructField("FIELD_3",StringType,true)::
            StructField("FIELD_4",StringType,false)::
            StructField("FIELD_ADD_1",StringType,false)::
            StructField("FIELD_ADD_2",StringType,false)::
            StructField("FIELD_ADD_3",StringType,false)::
            StructField("FIELD_ADD_4",StringType,false)::
            StructField("FIELD_5",StringType,false)::
            StructField("FIELD_6",StringType,false)::
            StructField("FIELD_7",StringType,false)::
            StructField("FIELD_8",StringType,false)::
            Nil)

// val vRow = vFile.map(x=>x.split((char)30, -1)).map(x=> Row(
val vRow = vFile.map(x=>x.split("", -1)).map(x=> Row(
                            x(1).toLong,
                            x(2).toLong,
                            x(3).toString.trim(),
                            x(4).toString.trim(),
                            x(5).toString.trim(),
                            x(6).toString.trim(),
                            x(7).toString.trim(),
                            x(8).toString.trim(),
                            x(9).toString.trim(),
                            x(10).toString.trim(),
                            x(11).toString.trim(),
                            x(12).toString.trim()
                        ))

val dfData = sqlContext.createDataFrame(vRow.distinct(),vSchema)

val dfBlankRecords = dfData.filter(x => (
                    x.trim(col("FIELD_ADD_1")) == "" ||
                    x.trim(col("FIELD_ADD_2")) == "" ||
                    x.trim(col("FIELD_ADD_3")) == "" ||
                    x.trim(col("FIELD_ADD_4")) == ""
                ))

I'd add the `apache-spark` tag for some better visibility. In the `val vRow ...` line, you are doing an `x.split("", -1)`. Is the empty string there intentional? That splits into an array of single characters. — Travis Hegner, Jun 21 '18 at 20:22
Also, what version of spark? If `>= 1.6` there are better methods for [reading text files](https://stackoverflow.com/a/36766853/2639647) directly into Datasets. — Travis Hegner, Jun 21 '18 at 20:31
@TravisHegner, there is actually an unprinted character between those quotes that exists as the column delimiter in the file. I was going to try to use the following line, so that it was a little clearer, but I haven't figured out how to correctly write it yet. val vSrcRow = vSrcFile.map(x=>x.split((char)30, -1)).map(x=> Row( Also, according to the pom.xml file I'm using in conjunction with the .scala code file, I have spark-core_2.11, version 2.0.0. I would be more than glad to see a better method for reading text files into datasets, if you're willing to share. — Zugabo, Jun 22 '18 at 13:53

Travis Hegner · Accepted Answer · 2018-06-22T15:03:14.123

0

The spark.read.* functions will read data directly into the Dataset/Dataframe API, avoiding (somewhat) the need for schema definitions, and working with the RDD API at all.

val source_path = args(0)
val source_file = args(1)

val dfData = spark.read.textFile(source_path + "/" + source_file)
  .flatMap(l => {
    val a = l.split('\u001e'.toString, -1).map(_.trim())
    val f1 = a(0).toLong
    val f2 = a(1).toLong
    val Array(f3, f4, fa1, fa2, fa3, fa4, f5, f6, f7, f8) = a.slice(2,12)

    if (fa1 == "" ||
        fa2 == "" ||
        fa3 == "" ||
        fa4 == "") {
      Some(f1, f2, f3, f4, fa1, fa2, fa3, fa4, f5, f6, f7, f8)
    } else {
      None
    }
  }).toDF("FIELD_1", "FIELD_2", "FIELD_3", "FIELD_4",
          "FIELD_ADD_1", "FIELD_ADD_2", "FIELD_ADD_3", "FIELD_ADD_4",
          "FIELD_5", "FIELD_6", "FIELD_7", "FIELD_8")

I think this will result in what you want. I'm sure someone better than myself could optimize this more, and with more succinct code.

Notice the array is zero indexed, if you were selecting specific fields intentionally, you'll have to adjust those. I'm also unsure of whether the '\u001e' (hex value of 30) is the appropriate value you need for your split string.

edited Jun 22 '18 at 15:03

answered Jun 22 '18 at 14:52

Travis Hegner

2,465
1
12
11

that looks excellent. I'll try to incorporate it into the rest of my code and let you know how it works. Do I need to define only the fields that are long? I'm sorry, I forgot to specify the base of the character. I'm looking for a HEX 1E or an ASCII 30. So, that looks like it should work also. Your help is greatly appreciated!! – Zugabo Jun 22 '18 at 15:33
No problem. `.split()` returns an `Array[String]` already so converting the string fields is not necessary. – Travis Hegner Jun 22 '18 at 15:56
Thank you very much for the information. I've got most of it adapted to what I'm trying to do. However, the file I'm trying to use is rather large (it has 268 columns). So, when I put all of them listed in the Some() method, I'm getting the following error: "_too many arguments for method apply: (x: A)Some[A] in object Some_". Since, I'm not familiar with the Some() method, I'm not really sure how to fix this. What is the limit on arguments for that method? – Zugabo Jun 25 '18 at 01:40
The `Some()` method actually generates an `Option`, which is basically an `Iterable` of length 1. The Option gets flattened out when doing a `.flatMap()`. So the parameters are actually being represented as `Some[TupleX]`, however a `TupleX` is limited in length to 22 items (read: `Tuple2`-`Tuple22`). – Travis Hegner Jun 25 '18 at 12:06
Unfortunately, this solution won't work at all with that number of columns. Is it OK to keep the rest of the columns as `String`s? If so, you could emit `Some[(Long, Long, Array[String])]`, and it would actually be a cleaner solution. If not, what are all of the column types you have? – Travis Hegner Jun 25 '18 at 12:08
thank you very much for your help!! I was able to successfully use your suggested solution for one of the files, and I was able to figure out how to use my original option for the second file by driving down the number of columns to just those that I actually needed. It's a bloated file and not all the data was necessary. – Zugabo Jun 26 '18 at 18:23

Scala - Using DF filter on multiple fields

1 Answers1