Groovy script for streamsets to parse string of about 1500 characters

Question

This is for streamsets, I am trying to write groovy script. I have string of length 1500 chars. No delimiter. The pattern is first 4 characters are some code, next 4 characters are length of word followed by the word. Again it as 4 chars of some code and 4 chars of lenght of word followed by the word. e.g. 22010005PHONE00010002IN00780004ROSE

When you decode,it will be like

2201 - code 0005 - Length of the word PHONE - Word

0001 - code 0002 - Length of the word IN - Word

0078 - code 0004 - Length of the word ROSE - Word and so on..

I need help on groovy script to create string if the code starts with 00. Thus the final string would be INROSE.

I am trying using while loop and str:substring. Any help is very much appreciated.

Thanks

def dtx_buf = record.value['TXN_BUFFER']
def fieldid = []
def fieldlen = []
def dtx_out = []
def i = 13
def j = 0
while (i < dtx_buf.size())
{    
//   values = record.value['TXN_BUFFER']
    fieldid[j] = str.substring(values,j,4)      
    output.write(record)
}

Expected result "INROSE"

tim_yates · Answer 1 · 2019-05-02T10:41:24.337

One way would be to write an Iterator that contains the rules for parsing the input:

class Tokeniser implements Iterator {
    String buf
    String code
    String len
    String word

    // hasNext is true if there's still chars left in `buf`        
    boolean hasNext() { buf }

    Object next() {
        // Get the code and the remaining string
        (code, buf) = token(buf)

        // Get the length and the remaining string
        (len, buf) = token(buf)

        // Get the word (of the given length), and the remaining string
        (word, buf) =  token(buf, len as Integer)

        // Return a map of the code and the word
        [code: code, word: word]
    }

    // This splits the string into the first `length` chars, and the rest
    private token(String input, int length = 4) {
        [input.take(length), input.drop(length)]
    }

}

Then, we can use this to do:

def result = new Tokeniser(buf: '22010005PHONE00010002IN00780004ROSE')
    .findAll { it.code.startsWith('00') }
    .word
    .join()

And result is INROSE

Take 2

We can try another iterative method without an internal class, to see if that works any better in your environment:

def input = '22010005PHONE00010002IN00780004ROSE'
def pos = 0
def words = []

while (pos < input.length() - 8) {
    def code = input.substring(pos, pos + 4)
    def len = input.substring(pos + 4, pos + 8) as Integer
    def word = input.substring(pos + 8, pos + 8 + len)
    if (code.startsWith('00')) {
        words << word
    }
    pos += 8 + len
}

def result = words.join()

Hi Tim, First of all thanks for your inputs and code. This works when I send hard coded values in the streamset pipeline. But when I pass it as variable (output from previous stage), it fails. def result = new Tokeniser(buf: record.value(dec_buf)) .findAll { it.code.startsWith('00') } .word .join() //SCRIPTING_05 - Script error while processing record: javax.script.ScriptException: groovy.lang.MissingMethodException: No signature of method: org.codehaus.groovy.jsr223.GroovyScriptEngineImpl.$() wait()// — Mani, May 02 '19 at 08:13
Can you edit the question, and copy what you're trying and the full exception into the end of it? Comments are hard to read code in... — tim_yates, May 02 '19 at 08:23
But it works in the same environment with a hard coded string? — tim_yates, May 02 '19 at 08:23
@Mani Added a second way of doing it, to see if that works better — tim_yates, May 02 '19 at 10:41

Groovy script for streamsets to parse string of about 1500 characters

1 Answers1

Take 2