Little bit of theory
- Content-based modification - is based on Content Repository. It's just multiple binary append-only files on Nifi's local disk, that are linked to Flow Files by file path and offset (here you can find more).
- Atrribute-based modification: attributes are just a map inside JVM Heap, backed by Write Ahead Log (here you can fing more). So attribute-based modification works with in-memory data and is faster.
Two possible issues
It doesn't look for me that you're working with attribute-based modification. MergeContent
still working on content, so you need to drop Flow File content after UpdateAtribute
and before MergeContent
.
Alternatively you may also check the volume of attributes. If you have too much attributes, in-memory map will be spilled to disk and you will loose the benefit of working with in-memory. But I think the first point is the issue
P.S.
If you think it's not a case, update your question with number of flow files, volume of extracted text to attributes, machine characteristics, maybe details about content-based approach so I will be able to compare...
UPD after question update
Your content-based flow:
(1) ListFile -> (2) FetchFile -> (3) ValidateRecord (sanity) -> (4) UpdateContent -> (5) CSVtoAvro -> (6) AvrotoORC -> (7) PutHDFS
Here, at steps 3, 4, 5 and 6 you're doing copy-on-write: do read from Content Repository (local file system) for each Flow File, modify them and append back to ContentRepository. So you're doing 4 read-write iterations.
Your attribute-based flow:
(1) List File -> (2) FetchFile -> (3) Extract Text -> (4) UpdateAttribute -> (5) AttributeToCSV -> (6) CSVtoAvro -> (7) AvrotoORC -> (8) Mergecontent -> (9) PutHDFS
Here, at steps 6 and 7 you are still doing 2 read-write iteractions. Moreover, MergeContent is another bottleneck, that is absent at the first option. MergeContent is reading all input data from disk, merge them (in memory I think) and copy result back to disk. So steps 6, 7 and 8 are already slow enough to give you as bad performance as on content-based flow. Moreover, step 3 is copying content to memory (another read from disk), and you may exprience disk swaps.
So with attribute-based flow it looks like you have almost the same volume/amount of disk read/write transactions. In the same time you also may have contention for RAM (JVM heap), because all your content is stored in memory multiple times:
- Each version (sanitized, updated, etc) of attribute is stored in memory
- MergeContent may store another part of data in memory
So maybe you have even more disk iteractions because of disk swap (but this should be checked, it depends on files volume simultaneously processsed).
Another point is that answer depends on how are you doing transformations.
Also, what processors are you using for the first approach? Are you aware about QueryRecord processor?