1

I am trying to ingest text data from local directory to HDFS, before ingesting i need to convert text into valid json. For that, i am using JavaScript Evaluator processor.

In javascript evaluator i unable to read any record.

Here is my sample code:

for(var i = 0; i < records.length; i++) {
 try {  
   output.write(records[i]);
 } catch (e) {
   error.write(records[i], e);
 }
}

Is any other better option other than JavaScript evaluator?

Here is my sample input data:

{
    1046=
    1047=
    1048=5324800
    1049=20180508194648
    1095=2297093400,
    1111=up_default
    1118=01414011002101251
    1139=1
}
{
    1140=1
    1176=mdlhggsn01_1.mpt.com;3734773893;2472;58907
    1183=4
    1211=07486390
    1214=0
    1227=51200
    1228=111
    1229=0
    1250=614400,
}

UPDATE:

As per @metadaddy's answer, i try to use Groovy insted of JavaScript. I am getting following exception for the same data that @metadaddy showed in his answer.

Here is my error screenshot.enter image description here

metadaddy
  • 4,234
  • 1
  • 22
  • 46
user6325753
  • 585
  • 4
  • 10
  • 33

2 Answers2

1

Your JavaScript needs to read through the input, building output records.

Using Text format, the Directory origin will create a record with a /text field for each line of input.

This JavaScript will build the record structure you need:

for(var i = 0; i < records.length; i++) {
  try {
    // Start of new input record
    if (records[i].value.text.trim() === '{') {
      // Use starting input record as output record
      // Save in state so it persists across batches
      state.outRecord = records[i];
      // Clean out the value
      state.outRecord.value = {};
      // Move to next line
      i++;
      // Read values to end of input record
      while (i < records.length && records[i].value.text.trim() !== '}') {
        // Split the input line on '='
        var kv = records[i].value.text.trim().split('=');
        // Check that there is something after the '='
        if (kv[1].length > 0) {
          state.outRecord.value[kv[0]] = kv[1];   
        } else if (kv[0].length > 0) {
          state.outRecord.value[kv[0]] = NULL_STRING;
        }
        // Move to next line of input
        i++;
      }

      // Did we hit the '}' before the end of the batch?
      if (i < records.length) {
        // Write record to processor output
        output.write(state.outRecord);
        log.debug('Wrote a record with {} fields', 
            Object.keys(state.outRecord.value).length);
        state.outRecord = null;        
      }
    }
  } catch (e) {
    // Send record to error
    log.error('Error in script: {}', e);
    error.write(records[i], e);
  }
}

Here is a preview of the transformation on your sample input data:

enter image description here

Now, to write the entire record to HDFS as JSON, simply set the Data Format in the Hadoop FS destination to JSON.

metadaddy
  • 4,234
  • 1
  • 22
  • 46
  • thanks for the response. my data is not in valid json format. i need to convert that to valid json format. for that i am using javascript ecaluator. – user6325753 May 16 '18 at 05:04
  • Just to clarify - you're reading data in text format, that looks like the sample you provided, and you want the output JSON to look like `{"1":1234,"2":6788,"3":56778}` - is this correct? – metadaddy May 16 '18 at 13:54
  • OK - I figured out the JavaScript for your use case – metadaddy May 16 '18 at 14:38
  • Hi @metadaddy, can you please provide any documentation for learning other than official one. i want to build pipeline for hadoop fs, hive and hbase. – user6325753 May 21 '18 at 05:55
1

Groovy script in StreamSets Data Collector executes much faster than JavaScript, so here is the same solution in Groovy.

Using Text format, the Directory origin will create a record with a /text field for each line of input.

This script will build the record structure you need:

for (i = 0; i < records.size(); i++) {
  try {
    // Start of new input record
    if (records[i].value['text'].trim() == "{") {
      // Use starting input record as output record
      // Save in state so it persists across batches
      state['outRecord'] = records[i]
      // Clean out the value
      state['outRecord'].value = [:]
      // Move to next line
      i++
      // Read values to end of input record
      while (i < records.size() && records[i].value['text'].trim() != "}") {
        // Split the input line on '='
        def kv = records[i].value['text'].trim().split('=')
        // Check that there is something after the '='
        if (kv.length == 2) {
          state['outRecord'].value[kv[0]] = kv[1]
        } else if (kv[0].length() > 0) {
          state['outRecord'].value[kv[0]] = NULL_STRING
        }
        // Move to next line of input
        i++
      }

      // Did we hit the '}' before the end of the batch?
      if (i < records.size()) {
        // Write record to processor output
        output.write(state['outRecord'])        
        log.debug('Wrote a record with {} fields', 
            state['outRecord'].value.size());
        state['outRecord'] = null;        
      }
    }
  } catch (e) {
    // Write a record to the error pipeline
    log.error(e.toString(), e)
    error.write(records[i], e.toString())
  }
}

Running this on input data:

{
    1=959450992837
    2=95973085229
    3=1525785953
    4=29
    7=2
    8=
    9=
    16=abd
    20=def
    21=ghi;jkl
    22=a@b.com
    23=1525785953
    40=95973085229
    41=959450992837
    42=0
    43=0
    44=0
    45=0
    74=1
    96=1
    98=4
    99=3
}

Gives output:

{
  "1": "959450992837",
  "2": "95973085229",
  "3": "1525785953",
  "4": "29",
  "7": "2",
  "8": null,
  "9": null,
  "16": "abd",
  "20": "def",
  "21": "ghi;jkl",
  "22": "a@b.com",
  "23": "1525785953",
  "40": "95973085229",
  "41": "959450992837",
  "42": "0",
  "43": "0",
  "44": "0",
  "45": "0",
  "74": "1",
  "96": "1",
  "98": "4",
  "99": "3"
}
metadaddy
  • 4,234
  • 1
  • 22
  • 46
  • Hello @metadaddy, i have updated my question with recent exception. please have a look. – user6325753 May 23 '18 at 05:36
  • I added a check for `i < records.size()` to the while loop condition - this should catch a missing closing brace. – metadaddy May 23 '18 at 06:18
  • Hello @metadaddy, thank you for the quick response. if you dont mind, i have shared a file to your fb account that include my input data. if i use javascript its works fine for upto 10 records, after that i get an exception like arrayindexoutofbound exception. – user6325753 May 23 '18 at 06:22
  • If you're going to provide a sample, then make it representative. Your first one had no empty values. Your next one was all integers, but the real data has strings, too? You're really not making it easy to help you. – metadaddy May 23 '18 at 06:28
  • i have updated my answer with sample data, and size is greater than 10 records with in { }. javascript and groovy only works for 8 records within { } – user6325753 May 23 '18 at 06:32
  • Fixed it for strings and tabs. Works fine on your one.txt – metadaddy May 23 '18 at 06:42