1

I'm using NiFi for recover and put to Kafka many data. I'm actually in test phase and i'm using a large Json file.

My Json file countains 500K recordings.

Actually, I have a processor getFile for get the file and a SplitJson.

JsonPath Expression : $..posts.*

This configuration works with little file that countain 50K recordings but for large files, she crashes.

My Json file looks like that, with the 500K registeries in "posts":[]

{ 
    "meta":{ 
        "requestid":"request1000",
        "http_code":200,
        "network":"twitter",
        "query_type":"realtime",
        "limit":10,
        "page":0
    },
    "posts":[ 
        { 
            "network":"twitter",
            "posted":"posted1",
            "postid":"id1",
            "text":"text1",
            "lang":"lang1",
            "type":"type1",
            "sentiment":"sentiment1",
            "url":"url1"
        },
        { 
            "network":"twitter",
            "posted":"posted2",
            "postid":"id2",
            "text":"text2",
            "lang":"lang2",
            "type":"type2",
            "sentiment":"sentiment2",
            "url":"url2"
        }
    ]
}

I read some documentations for this problem but, topics are for text file and speakers propose to link many SplitText for split progressively the file. With a rigide structure like my Json, I don't understand how I can do that.

I'm looking for a solution that she makes the job on 500K recordings well.

BastienB
  • 180
  • 2
  • 14

3 Answers3

1

Try using SplitRecord processor in NiFi.

Define Record Reader/Writer controller services in SplitRecord processor.

Then configure Records Per Split to 1 and use Splits relationship for further processing.

(OR)

if you want to flatten and fork the record then use ForkRecord processor in NiFi.

For usage refer to this link.

notNull
  • 30,258
  • 4
  • 35
  • 50
1

Unfortunately I think this case (large array inside a record) is not handled very well right now...

SplitJson requires the entire flow file to be read into memory, and it also doesn't have an outgoing split size. So this won't work.

SplitRecord generally would be the correct solution, but currently there are two JSON record readers - JsonTreeReader and JsonPathReader. Both of these stream records, but the issue here is there is only one huge record, so they will each read the entire document into memory.

There have been a couple of efforts around this specific problem, but unfortunately none of them have made it into a release.

This PR which is now closed had added a new JSON record reader which could stream records starting from a JSON path, which in your case could be $.posts:

https://github.com/apache/nifi/pull/3222

With that reader you wouldn't even do a split, you would just send the flow file to PublishKafkaRecord_2_0 (or whichever appropriate version of PublishKafkaRecord), and it would read each record and publish to Kafka.

There is also an open PR for a new SelectJson processor that looks like it could potentially help:

https://github.com/apache/nifi/pull/3455

Bryan Bende
  • 18,320
  • 1
  • 28
  • 39
  • I'd try the solution with SplitRecord but i'd the same issue. I'm using Hortonworks so I think the JSON record reader isn't implement on my Nifi version. – BastienB Oct 29 '19 at 09:35
0

I had the same issue with json and used to write streaming parser

Use ExeuteGroovyScript processor with the following code.

It should split large incoming file to small ones:

@Grab(group='acme.groovy', module='acmejson', version='20200120')
import groovyx.acme.json.AcmeJsonParser
import groovyx.acme.json.AcmeJsonOutput

def ff=session.get()
if(!ff)return

def objMeta=null
def count=0


ff.read().withReader("UTF-8"){reader->
    new AcmeJsonParser().withFilter{
        onValue('$.meta'){ 
            //just remember it to use later
            objMeta=it 
        }
        onValue('$.posts.[*]'){objPost->
            def ffOut = ff.clone(false) //clone without content
            ffOut.post_index=count      //add attribite with index
            //write small json
            ffOut.write("UTF-8"){writer->
                AcmeJsonOutput.writeJson([meta:objMeta, post:objPost], writer, true)
            }
            REL_SUCCESS << ffOut        //transfer to success
            count++
        }
    }.parse(reader)
}
ff.remove()

output file example:

{
  "meta": {
    "requestid": "request1000",
    "http_code": 200,
    "network": "twitter",
    "query_type": "realtime",
    "limit": 10,
    "page": 0
  },
  "post": {
    "network": "twitter",
    "posted": "posted11",
    "postid": "id11",
    "text": "text11",
    "lang": "lang11",
    "type": "type11",
    "sentiment": "sentiment11",
    "url": "url11"
  }
}
daggett
  • 26,404
  • 3
  • 40
  • 56