Make group files in S3 instead one from each json - NIFI

Question

I am receiving the result of an API call, make some transformations, and store it in S3, now it stores 1 file for each api call. Resulting in a LOT of files, the flow is:

invokeHTTP->Split.json->JoltTransformJSON (I don't need all the data)->EvaluateJsonPath->InferAvroScheme (500 samples)->ConvertJSONToAvro->PutS3Object

The json format is:

"data": {"value1": "test", "value2": "test2"},
"actions": [{"buy": 5, "sell": 6},{"buyAgain": 5, "sellAgain": 6}],
"Reactions": [{"buy": 5, "sell": 6}],
{"otherValue": "1",
"otherValue2": "2"}

sometimes actions have values inside, in other casos "actions":[] I drop Reactions using JoltTransformJSON with remove parameter, it have a LOT of data I don't need

To join the values I tried MergeContent, but it DROPs a lot of records, first I read the possible configurations, then...I start modifying parameters just to see how it changes the output, It always DROP a lot of records.

So now I'm storing 1 file per json in S3, thats a lot of files and you can feel it when querying the data.

How can I improve the flow to store less files? Thank you!

---- EDIT: image added ----

Current MergeContents configuration, don't quite understand Attribute strategy Property. Can this fix the changes in schema? (actions with value or "actions":[])

---- EDIT 2 ---- Now I can confirm that is grouping by state as expected but dropping the JSON flows that have "actions" : [], they have the same state as some of the flows with that field full, any ideas? thanks!

DarkLeafyGreen · Accepted Answer · 2019-11-04T06:35:52.253

The MergeContent processor is the correct solution here. Set merge format to Avro and the Avro contents of a flowfile will be concatenated together into a single flowfile. Your problem of dropped data is related to the metadata strategy property:

For FlowFiles whose input format supports metadata (Avro, e.g.), this property determines which metadata should be added to the bundle. If 'Use First Metadata' is selected, the metadata keys/values from the first FlowFile to be bundled will be used. If 'Keep Only Common Metadata' is selected, only the metadata that exists on all FlowFiles in the bundle, with the same value, will be preserved. If 'Ignore Metadata' is selected, no metadata is transferred to the outgoing bundled FlowFile. If 'Do Not Merge Uncommon Metadata' is selected, any FlowFile whose metadata values do not match those of the first bundled FlowFile will not be merged.

Flowfiles, which schema is not equal to the schema of the first bundled flowfile, will be dropped. I can think of two possible solutions to prevent that:

Use `Correlation Attribute Name` to merge Avro flowfiles that share the same schema

You have to ensure, that only files get merged that have the same schema. So if you can put some attribute on the flowfile like "type=CAR or type=BIKE" you can set Correlation Attribute to "type". MergeContent will then make bundles based on type. Since the schema of the files in a bundle is the same, no records will be dropped.

Set a specific schema

Replace InferAvroSchema and ConvertJsonToAvro with a single processor: ConvertRecord. Configure a JsonTreeReader as reader and leave the default properties. Configure a AvroRecordSetWriter as writer and set following properties:

In the AvroRecordSetWriter configure following Schema text:

{
  "name": "MyClass",
  "type": "record",
  "namespace": "com.acme.avro",
  "fields": [
    {
      "name": "data",
      "type": {
        "name": "data",
        "type": "record",
        "fields": [
          {
            "name": "value1",
            "type": "string"
          },
          {
            "name": "value2",
            "type": "string"
          }
        ]
      }
    },
    {
      "name": "actions",
      "type": {
        "type": "array",
        "items": {
          "name": "actions_record",
          "type": "record",
          "fields": [
            {
              "name": "buyAgain",
              "type": ["int", "null"]
            },
            {
              "name": "sellAgain",
              "type": ["int", "null"]
            },
            {
              "name": "buy",
              "type": ["int", "null"]
            },
            {
              "name": "sell",
              "type": ["int", "null"]
            }
          ]
        }
      }
    },
    {
      "name": "Reactions",
      "type": {
        "type": "array",
        "items": {
          "name": "Reactions_record",
          "type": "record",
          "fields": [
            {
              "name": "buy",
              "type": "int"
            },
            {
              "name": "sell",
              "type": "int"
            }
          ]
        }
      }
    }
  ]
}

Notice that actions now includes all the fields. If you need help to convert Json to an Avro schema use this schema generator.

PS: if you need more information on how to control the number of records per merge please click here.

I tried option 1, but it still dropped flows, a LOT. For the second approach I understad I need to extract a field from the json and use that, for example DATE. If I harcoded a string there, all files will go to the same file? (im partitioning by date in PutS3Object), will this ignore the metadata and bin options or have to take another things in account? Thank you! — Alejandro, Nov 03 '19 at 19:12
@Alejandro with option 2 you have to ensure, that only files get merged that have the same schema. So if you can put some attribute on the flowfile like "type=CAR or type=BIKE" you can set Correlation Attribute to "type". MergeContent will then make bundles based on type. Since the schema of the files in a bundle is the same, no records will be dropped. With option 3 you basically define a giant schema, that includes all the fields your data may have. — DarkLeafyGreen, Nov 03 '19 at 19:22
The bigger change from flow to flow is that sometime the array "actions" is empty, I thought that was the problem, that I couldn't solve even with option 1 for some reason. the main difference between flows is in the data contained (or not) in that field. you are saying that use a variable like actions[1].buy so if there is something inside it will group that and if it's null it will not drop it. In this case buy have 3 possible values...or not exists. — Alejandro, Nov 03 '19 at 20:42
added image of my current configuration with your recommendations, I'm extracting the state from the JSON in EvaluateJSONPath, still not sure how to set Attribute Strategy, I don't understand NIFI help for that parameter. — Alejandro, Nov 03 '19 at 22:01
@Alejandro An empty array again requires another schema. Probably it is better you define a "god" schema, that includes all the fields. I updated my answer! — DarkLeafyGreen, Nov 04 '19 at 06:38
thanks! Last question to understand the process, using "god schema", when it finds an missing attribute, like the empty array, it will fill it with null values for the keys in schema? So then the merge works with same schema for all flows? — Alejandro, Nov 05 '19 at 11:24
@Alejandro exactly! For the array you have to define the type explicitly as a union of null and array, to make the array optional. This is how the schema would look like: https://stackoverflow.com/a/9955865/401025 — DarkLeafyGreen, Nov 05 '19 at 15:02
Thankyou @Upvote I marked your solution as accepted because it help me in another case I tried it in my dataset and the API have 15+ tipe of responses, 30 fields are common and there are another 15 with changing properties, there are not all unique, but it's hard to make a "god" schema, will see what can I do. For now I'm ingesting by record, and then with Athena using try(field) to get the field if it exists or get null if not. Thanks again — Alejandro, Nov 05 '19 at 21:30

Make group files in S3 instead one from each json - NIFI

1 Answers1

Use Correlation Attribute Name to merge Avro flowfiles that share the same schema

Set a specific schema

Use `Correlation Attribute Name` to merge Avro flowfiles that share the same schema