0

The code below should:

  • iterate over a sequence of strings
  • parse each one as json,
  • filter out fields whose names could not be used as an identifier in most languages
  • lowercase the rmaining names
  • serialize the result as a string

It behaves as expected on small tests, but on an 8.6M item sequence of live data the output sequence is significantly longer than the input sequence:

import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark._

val txt = sc.textFile("s3n://...")
val patt="""^[a-zA-Z]\w*$""".r.findFirstIn _
val json = (for {
         line <- txt
         JObject(children) <- parse(line)
         children2 = (for {
           JField(name, value) <- children

           // filter fields with invalid names
           // patt(name) returns Option[String]
           _ <- patt(name)

         } yield JField(name.toLowerCase, value))
} yield compact(render(JObject(children2))))

I have checked that it actually increases the number of unique items, so it is not just duplicating items. Given my understanding of Scala comprehensions & json4s, I do not see how this is possible. The large live data collection is a Spark RDD, while my tests were with an ordinary Scala Seq, but that should not make any difference.

How can json have more elements than txt in the above code?

Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90

2 Answers2

1

Maybe parse(line) is returning multiple JSON objects for a single line?

lmm
  • 17,386
  • 3
  • 26
  • 37
  • That was my first thought, but using `val txt = Seq("""{"a": 1, "B": 2, " ": 3}{"c": 4}""")` in the above code only parse the first record and the return type of `parse` is `JValue` – Daniel Mahler Oct 27 '14 at 19:42
  • Wait, `parse` returns a `JValue`? And you're then `flatMap`ing over that? Are you sure that's not iterating over every field of the json value or some such? – lmm Oct 27 '14 at 21:51
  • JValue implements `map` and `flatMap`. It behaves somewhat like `Option`, `JNone` acts like `None` and the other subclasses of `JValue` act as if they were wrapped in an invisible `Some`. – Daniel Mahler Oct 27 '14 at 21:57
  • Thanks for making me double check. That is how I found the actual problem. – Daniel Mahler Oct 27 '14 at 22:13
1

I was not aware that

JObject(children) <- parse(line)

matches recursively inside the result of parse. So even though parse returns a single value, when there are nested objects, they will be returned as separate bindings for children. The answer is to use

JObject(children) = parse(line)

the correct code is:

import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark._

val txt = sc.textFile("s3n://...")
val patt="""^[a-zA-Z]\w*$""".r.findFirstIn _
val json = (for {
         line <- txt
         JObject(children) = parse(line) // CHANGED <- TO =
         children2 = (for {
           JField(name, value) <- children

           // filter fields with invalid names
           // patt(name) returns Option[String]
           _ <- patt(name)

         } yield JField(name.toLowerCase, value))
} yield compact(render(JObject(children2))))
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90