1

In Nifi we can design a flow in two ways :

  1. Content Based Modification (UpdateContent) - In this approach we are directly modifying the content of flowfiles . With this at each stage , the flowfile content will get persisted in flow file repository.

Sample Flow :

ListFile -> FetchFile -> ValidateRecord (sanity) -> UpdateContent -> CSVtoAvro -> AvrotoORC - >PutHDFS
  1. Attribute Based Modification (UpdateAttribute) - In this approach we are storing the contents of the flowfiles in memory as attributes and modifying them directly . Once the updates are done we are writing the attributes to flow file and then merging the flowfiles using MergeContent.

In terms of performance we are getting much better performance in the First case , in the second case many of the processors are slow like ExtractText and specially MergeContent. Having said that I have also done concurrent thread and backpressure level modifications , but still could not achieve better performance.

List File -> FetchFile -> Extract Text -> UpdateAttribute ->AttributeToCSV -> CSVtoAvro -> AvrotoORC -> Mergecontent -> PutHDFS (Rough flow)

I want to understand why attribute approach is less performant and if I am doing something wrong . Please suggest.

We have 200 columns file with all of them treated as attributes for modification. The machine is 32 GB machine with (16GB NIFI) and Quad core Intel Core i7-4771 with HDD Total Size: 500.1GB.

Aviral Kumar
  • 814
  • 1
  • 15
  • 40

1 Answers1

4

Little bit of theory

  1. Content-based modification - is based on Content Repository. It's just multiple binary append-only files on Nifi's local disk, that are linked to Flow Files by file path and offset (here you can find more).
  2. Atrribute-based modification: attributes are just a map inside JVM Heap, backed by Write Ahead Log (here you can fing more). So attribute-based modification works with in-memory data and is faster.

Two possible issues

  1. It doesn't look for me that you're working with attribute-based modification. MergeContent still working on content, so you need to drop Flow File content after UpdateAtribute and before MergeContent.

  2. Alternatively you may also check the volume of attributes. If you have too much attributes, in-memory map will be spilled to disk and you will loose the benefit of working with in-memory. But I think the first point is the issue

P.S.

If you think it's not a case, update your question with number of flow files, volume of extracted text to attributes, machine characteristics, maybe details about content-based approach so I will be able to compare...

UPD after question update

Your content-based flow:

(1) ListFile -> (2) FetchFile -> (3) ValidateRecord (sanity) -> (4) UpdateContent -> (5) CSVtoAvro -> (6) AvrotoORC -> (7) PutHDFS

Here, at steps 3, 4, 5 and 6 you're doing copy-on-write: do read from Content Repository (local file system) for each Flow File, modify them and append back to ContentRepository. So you're doing 4 read-write iterations.

Your attribute-based flow:

(1) List File -> (2) FetchFile -> (3) Extract Text -> (4) UpdateAttribute -> (5) AttributeToCSV -> (6) CSVtoAvro -> (7) AvrotoORC -> (8) Mergecontent -> (9) PutHDFS

Here, at steps 6 and 7 you are still doing 2 read-write iteractions. Moreover, MergeContent is another bottleneck, that is absent at the first option. MergeContent is reading all input data from disk, merge them (in memory I think) and copy result back to disk. So steps 6, 7 and 8 are already slow enough to give you as bad performance as on content-based flow. Moreover, step 3 is copying content to memory (another read from disk), and you may exprience disk swaps.

So with attribute-based flow it looks like you have almost the same volume/amount of disk read/write transactions. In the same time you also may have contention for RAM (JVM heap), because all your content is stored in memory multiple times:

  • Each version (sanitized, updated, etc) of attribute is stored in memory
  • MergeContent may store another part of data in memory So maybe you have even more disk iteractions because of disk swap (but this should be checked, it depends on files volume simultaneously processsed).

Another point is that answer depends on how are you doing transformations.

Also, what processors are you using for the first approach? Are you aware about QueryRecord processor?

VB_
  • 45,112
  • 42
  • 145
  • 293
  • I have updated the question with the required info. – Aviral Kumar Jun 22 '20 at 11:12
  • @AviralKumar updated the answer. But I think that you have a lot of content-based operations with *attribute-based approach*, so it doesn't look strange that your performance may be slow in the second case. – VB_ Jun 22 '20 at 12:43
  • @AviralKumar but performance is a complex question, while only you can debug where is the bottleneck. I can provide you with considerations only, you still need to verify how much truth are in them) – VB_ Jun 22 '20 at 12:47
  • sure VB_ , this is highly appreciated . I will dig more and ask you anything further required. – Aviral Kumar Jun 22 '20 at 12:48
  • We have written our custom processor for doing the transformations . This we did for both the cases (attribute or content) . One single processor can be used to specify all transformation given by the user. – Aviral Kumar Jun 22 '20 at 13:26
  • What is the special purpose of QueryRecord ? I already have a custom processor for the transformations – Aviral Kumar Jun 22 '20 at 16:28
  • @AviralKumar QueryRecord allow you to write SQL over content, the output will be another content. Depends on details, but it is generaly more about simplicity than about performance. But that was a side suggestion, I suppose you aren't going to re-implement the logic, and you have reasons to implement custom processor – VB_ Jun 22 '20 at 19:16
  • Okay this I am already achieving in my current processor , do you think we should look to write ORC files directly from my custom processor ? – Aviral Kumar Jun 23 '20 at 06:58
  • @AviralKumar that should be your solution. In one hand, every time you are doing CsvToAvro or AvroToORC you are doing copy-on-write. In other hand, you should look at simplicity-performance balance: complicate stuff only when it's required. So if that possible to go with simpler solution - go with it. – VB_ Jun 23 '20 at 08:00
  • actually my flow is much more complicated catering to ETL requirements like filtering and mapping values at different levels and also data duplicate checks . So flowfile version is updated at multiple places . – Aviral Kumar Jun 23 '20 at 10:19