5

I'm looking into ETL tools (like Talend) and investigating whether Apache Nifi could be used. Could Nifi be used to perform the following:

  1. Pick up two CSV files that are placed on local disk
  2. Join the CSVs on a common column
  3. Write the joined CSV to disk

I've tried setting up a job in Nifi, but couldn't see how to perform the join of two separate CSV files. Is this task possible in Apache Nifi?

It looks like the QueryDNS processor could be used to perform enrichment of one CSV file using the other, but that seems to be over-complicated for this use case.

Here's an example of the input CSVs, which need to be joined on state_id:

Input files

customers.csv

id | name | address      | state_id
---|------|--------------|---------
1  | John | 10 Blue Lane | 100
2  | Bob  | 15 Green St. | 200

states.csv

state_id | state
---------|---------
100      | Alabama
200      | New York

Output file

output.csv

id | name | address      | state
---|------|--------------|---------
1  | John | 10 Blue Lane | Alabama
2  | Bob  | 15 Green St. | New York
Community
  • 1
  • 1
Andy Longwill
  • 624
  • 9
  • 25

4 Answers4

13

Apache NiFi is more of a dataflow tool and not really made to perform arbitrary joins of streaming data. Typically those types of operations are better suited to stream processing systems like Storm, Flink, Apex, etc, or ETL tools.

The types of joins that NiFi can do well are enrichment look ups where there is a fixed size lookup dataset, and for each record in the incoming data you use the lookup dataset to retrieve some value. For example, in your case there could be a processor called LookUpState which has a property "State Data" which points to a file containing all the states, then the customers.csv could be the input to this processor.

A community member started a project to make a generic lookup service for NiFi: https://github.com/jfrazee/nifi-lookup-service

Bryan Bende
  • 18,320
  • 1
  • 28
  • 39
4

The typical pattern one follows for this is to load the reference set into a map cache controller service in NiFi. In this case that is the states.csv data. Then the live feed of customer data comes in and is enriched with this reference data using something like ReplaceText or you could even write a custom processor in Groovy. There are a lot of ways to slice this. There is also a JIRA/PR coming for making this even easier. There are elements of live stream joins that are best done in processing systems like Apache Storm, Spark, and Flink, but for the case you mention it can be done well in NiFi.

Andy
  • 13,916
  • 1
  • 36
  • 78
Joe Witt
  • 2,152
  • 10
  • 9
1

workflow

I also tried to join two CSV files using the common column and did it sucessfully using the lookup record attribute in nifi lookup record config

Here, I used simplecsvlookup service as my lookup service and I am also attaching it's configuaration simplecsvlookup configuration

The first thing we should learn is how to use the lookup record attribute. Here, I have two csv files:

sample.csv: id,msisdn,recharge_amount 1,9048108594,399

new1: msisdn,type 9048108594,1

output: id,msisdn,recharge_amount,type 1,9048108594,399,1

The most important thing to notice is that result record path and key in this case key is msisdn (because this one is the common one in both files) and, for the result record path, we should use the column name which we need to merge with us, which in this case is "type:"

result record path--->> /type
key----->> /msisdn

And, in the lookup service, give the respective key and value names.

It will work.

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
  • @Adrian Mole, do you have the template for this please ? When I try the above example, the Flowfile reaches LookupRecord but does not generate any Output to any one of the Matched/Unmatched / Failure . – Vizag Jul 15 '20 at 15:55
  • @Vizag You're asking the wrong person. All I did here was to improve the typographical format of the answer. But your ping will also have reached the author of this answer. – Adrian Mole Jul 15 '20 at 15:58
  • In fact, I didn't even do the edit myself: I just approved it in the Review Queue. – Adrian Mole Jul 15 '20 at 16:00
  • Ah thanks Adrain for alerting me on my mistake and sorry for that. @Bruce, if possible can you share the template xml here please. – Vizag Jul 15 '20 at 16:10
  • @Vizag you can look those images and configure the lookup service correctly – Bruce wayne - The Geek Killer Jul 24 '20 at 07:35
0

NiFi example on how to join CSV files to create a merged result

use of this template: https://gist.github.com/ijokarumawak/7e20af1cd222fb2adf13acb2b0f46aed

Javad
  • 51
  • 1
  • 3