Data Integration: Bring Data to a Standard Format

Question

I am trying to make a data integration process using ETL tool (Talend).

The challenge I am facing is when I try to bring data from various sources (in different formats) into a single format.

The sources may have different column names and structures (order, datatypes, etc.). So different metadata. As I see it, it is a very common case. But the tool is not able to handle it as it does not provide any dynamic mapping feature.

What is the best approach to handle such scenario?

ydaetskcoR · Accepted Answer · 2014-07-28T12:17:17.003

1

Talend does provide a dynamic mapping tool. It's called the tMap or tXmlMap for XML data.

There's also the tHMap (Hierarchical Mapping tool) which is a lot more powerful but I've yet to use it at all because it's very raw in the version of Talend I'm using (5.4) but should be more usable in 5.5.

Your best approach here may be to use a tMap after each of your components to standardise the schema of your data.

First you should pick what the output schema should look like (this could be the same as one of your current schemas or something entirely different if necessary) and then simply copy and paste the schema to the output table of each tMap. Then map the relevant data across.

An example job may look something like this:

Example job layout

Where the schemas and the contained data for each "file" (I'm using tFixedFlowInput components to hardcode the data to the job rather than read in a file but the premise is the same) is as following:

File 1: file1 schema and data File 2: file2 schema and data File 3: file 3 schema and data

And they are then mapped to match the schema of the first "file":

File 1: file1 tMap configuration File 2: file2 tMap configuration File 3:

Notice how the first tMap configuration shows no changes as we are keeping the schema exactly the same.

Now that our inputs all share the same schema we can use a tUnite component to union (much like SQL's UNION operator) the data.

After this we then also take one final step and use a tReplace component so that we can easily standardise the "sex" field to M or F:

tReplace configuration to map "male" to "M" and "female" to "F"

And then lastly I output that to the console but this could be output to any available output component.

For a truly dynamic option without having to predefine the mapping you would need to read in all of your data with a dynamic schema. You could then parse the structure out into a defined output.

In this case you could read the data from your files in as a dynamic schema (single column) and then drop it straight into a temporary database table. Talend will automatically create the columns as per the headers in the original file.

From here you could then use a transformation mapping file and the databases' data dictionary to extract the data in the source columns and map it directly to the output columns.

edited Jul 28 '14 at 12:17

answered Jul 28 '14 at 08:58

ydaetskcoR

53,225
8
158
177

In tMap Component, the source metadata cannot change. It has to be fixed. How do i handle files with different column names and order? Say i have multiple source files: File 1 has columns : Name, Mobile_Number, Sex File 2 has columns: Names, Gender, Mob_No File 3: ... File 4: ... ... ... The Target Has schema: Names, Number, Gender. How do i do the mapping? Only one file can be mapped to the target file. Right? – Kaushal Jul 28 '14 at 09:03
What are you trying to do here? Join them or unite them? – ydaetskcoR Jul 28 '14 at 09:21
I've added a worked example where we take disparate schemas, map them to one standard schema and then union the data so it's all one place. Does that cover your question? – ydaetskcoR Jul 28 '14 at 09:42
I understood what you are suggesting, but still in the above solution i need to define all the mapping in advance. What if i have 1000's of file with different structure? Can i define this mapping using some Lookup file or table, which can be looked upon for mapping? Basically, In my scenario, I have 1000's of file coming in which might have different schema(No of Column, Name of Column, Order of Column may vary). I have to map them to a Standard Schema( Which is predefined). – Kaushal Jul 28 '14 at 09:55
Of course you need to define everything in advance. There's never going to be a way that a system could automatically guess that "sex" needs to be mapped to "gender". If you genuinely have thousands of files with different structures but still related data then you have to question the process that is outputting those files. Your only other option would be to read everything in as a dynamic schema and then attempt a dynamic mapping there. You'd need an enterprise licence for that though. – ydaetskcoR Jul 28 '14 at 09:58
Yeah, i understand that. That is why,I wanted to define this relationship in a file. Lets say i have a lookup file which contains source and target columns which needs to be mapped. So the Job could read this file and map columns accordingly. The point of doing it is that the development need not be done everytime a new type of file is received. The mapping file can be edited and more mapping columns can be added in it. – Kaushal Jul 28 '14 at 10:02
I tried using dynamic schema, It does not allow mapping of dynamic schema to a static schema. I may need to custom code to acheive that. And it will be really complex task – Kaushal Jul 28 '14 at 10:05
Yeah it's pretty complex. Off the top of my head you could use a dynamic schema to read the data in and then drop it straight into a temporary database table. Then, using a transformation mapping file and the data dictionary of the database you could retrieve the data in the columns you are mapping from and map to the appropriate output schema. Unless you really do have hundreds/thousands of disparate file formats then I'd be inclined to hard code it as above. – ydaetskcoR Jul 28 '14 at 10:35
I have files coming in from 100's of vendors and each vendor may have different file formats and standards. Thanks for all the advice. :) – Kaushal Jul 28 '14 at 10:39

Data Integration: Bring Data to a Standard Format

1 Answers1