1

I have a bunch of JSON files, each containing a very large array of complex data. The JSON files look something like:

ids.json

{
    "ids": [1,2,3]
}

names.json:

{
    "names": ["Tyrion","Jaime","Cersei"]
}

and so on. (In reality, the array elements are complex struct objects with 10s of fields)

I want to extract just the tag that specifies what kind of array it contains. Currently I'm using encoding/json to unmarshal the whole file into a map[string]interface{} and iterate through the map but that is too costly an operation.

Is there a faster way of doing this, preferably without the involvement of unmarshaling entire data?

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
  • If the arrays are very large and you only want to read a part of the file you could read it character by character until you have the label you want and then stop. It's not particularly flexible if your JSON object format changes - that's where a parser is handy, but if all the files are like your examples it might be enough to get the job done. – Matt Oct 03 '18 at 06:37
  • 1
    Alternatively, there's a streaming API for decoding JSON. This is a much nicer approach. It still might be too slower than my first suggestion, but it's more flexible and the next programmer that comes across your code won't hate you for following my advice lol. See this [question](https://stackoverflow.com/questions/31794355/stream-large-json). – Matt Oct 03 '18 at 06:41
  • I have edited the question to emphasise the complexity and largeness of the json file. That's exactly why I can't afford to use unmarshaling. – Himanshu Patel Oct 03 '18 at 07:07
  • Hence the use of a streaming decoder. The key point here is that it does not unmarshall the entire JSON, it only reads a field at a time and you can stop right after you read the key (i.e. before you read the large array value). See @zerkms answer, he reads the first `{` (manually) then he reads the next item in the JSON which is what you want, the key `ids`. If you follow my link, there is a nicer way of consuming the first bracket with the `Token()` method. – Matt Oct 03 '18 at 07:38
  • Thanks Matt, that was really helpful :) – Himanshu Patel Oct 03 '18 at 09:08
  • @Matt how would you decode only the key after `dec.Token()`? https://play.golang.org/p/r4PF9FLi8Aq – zerkms Oct 03 '18 at 21:06
  • 1
    @zerkms I had a play around with it in my editor and included a new answer to demonstrate. This streaming parser is new to me too. :) – Matt Oct 05 '18 at 06:27

2 Answers2

1

You can offset the reader right after the opening curly brace then use json.Decoder to decode only the first token from the reader

Something along these lines

sr := strings.NewReader(`{
    "ids": [1,2,3]
}`)

for {
    b, err := sr.ReadByte()
    if err != nil {
        fmt.Println(err)
        return
    }
    if b == '{' {
        break
    }
}

d := json.NewDecoder(sr)

var key string
err := d.Decode(&key)
if err != nil {
    fmt.Println(err)
    return
}

fmt.Println(key)

https://play.golang.org/p/xJJEqj0tFk9

Additionally you may wrap your io.Reader you obtained from open with bufio.Reader to avoid multiple single-byte writes

This solution assumes contents is a valid JSON object. Not that you could avoid that anyway.

zerkms
  • 249,484
  • 69
  • 436
  • 539
  • `strings.NewReader()` takes a string argument. However JSON data is read as a `[]byte`. Is there a workaround to directly use `[]byte`? Because converting it into string is a linear operation. – Himanshu Patel Oct 03 '18 at 08:50
  • Never mind, found it! Do a `os.Open()` and pass the file pointer in `bufio.NewReader()` instead of `strings.NewReader()`. Thanks anyway :) – Himanshu Patel Oct 03 '18 at 09:01
1

I had a play around with Decoder.Token() reading one token at a time (see this example, line 87), and this works to extract your array label:

const jsonStream = `{
    "ids": [1,2,3]
}`

dec := json.NewDecoder(strings.NewReader(jsonStream))

t, err := dec.Token()
if err != nil {
    log.Fatal(err)
}

fmt.Printf("First token: %v\n", t)

t, err = dec.Token()
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Second token (array label): %v\n", t)
Matt
  • 3,677
  • 1
  • 14
  • 24