7

I'm processing a file and I'd like to remove (trim) the first X header lines to keep only data, possibly avoiding using regular expressions.

Thanks

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
Filippo Loddo
  • 966
  • 9
  • 14

1 Answers1

8

You can remove the first X header lines by using ExecuteScript procesor in Nifi.

The following is a example Jython script which I wrote for myself:

import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback

class PyStreamCallback(StreamCallback):
  def __init__(self):
        pass
  def process(self, inputStream, outputStream):
    text = IOUtils.readLines(inputStream, StandardCharsets.UTF_8)
    for line in text[3:]:
        outputStream.write(line + "\n") 

flowFile = session.get()
if (flowFile != None):
  flowFile = session.write(flowFile,PyStreamCallback())
  flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').split('.')[0]+'_translated.json')
  session.transfer(flowFile, REL_SUCCESS)

This obviously removes the first 3 lines but you can easily modify it to remove more or less lines.

Hope that helps.

Biplob Biswas
  • 1,761
  • 19
  • 33
  • Thanks! Can this be done in Python as well? No need to prepare the code, I just want to know if ExecuteScript can be written in Python. – Filippo Loddo Feb 10 '17 at 08:05
  • Short answer - **No** Long Answer - **Maybe** : As the script engine internally uses Jython, so you may just use the pure python modules and try to make it work with it when using ExecuteScript. You can get more information [here](https://community.hortonworks.com/questions/53645/cannot-use-numpy-or-scipy-in-python-in-nifi-execut.html) It says - if other python modules are needed, "consider ExecuteProcess or (if you have incoming flow files) ExecuteStreamCommand which can execute the command-line python." If you liked the answer please consider upvoting, Thanks. – Biplob Biswas Feb 10 '17 at 16:10
  • @BiplobBiswas, after removing n number of header lines, Can we send each line into separate flowfiles instead of single flowfile? – Mister X Jun 14 '17 at 12:34
  • 1
    @prabhu Even though this is a separate question and I would generally recommend creating a new question for this. But still, to create individual flowfiles from a single flowfile, try using the splitText processor. For more info on the processor, please check - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.SplitText/index.html – Biplob Biswas Jun 16 '17 at 10:54
  • I'm having 10 gb file while process 10gb in SplitText takes huge heap memory for split into splits and it makes some time UI hanging that's why i have try for ExecuteSCript – Mister X Jun 16 '17 at 10:57
  • 1
    I am not sure, maybe you can try to 2 stages of splitText, first split by 30k-40k lines (Line Split Count = 30k - 40k) and then try using splitText with Line Split Count = 1 ... if that doesn't work, maybe add another stage in between. I am really sorry, but I don't know any better way to split the huge file using Nifi – Biplob Biswas Jun 16 '17 at 13:12