3

I have a product that imports certain data files from clients (ie: user directories, etc), and will export other types of data (ie: reports, etc). All import and exports are in currently in CSV format (rfc4180), and files are passed back and forth through managed file transfers.

Increasingly, I'm seeing requests from clients to transform and reconfigure these data files for use in their legacy systems. For import data files, it's bizarre requests like:

"We're passing you 20 columns, from that apply $business_logic to columns 4,7,5,18,19 to determine the actual value your system needs in column 21, then drop those original columns cuz they aren't really useful by themselves"

or

"The value in column 2 is padded with zeros, please strip that off."

For data exports files, it's requests like:

"You are sending us .csv, but we need it in our special fixed width format."

or

"You are formatting numbers with decimals. Remove those, and prefix with 8 zeros."

Of course, every client we onboard has different requirements. I'm hesitant to dive in and write something from scratch as I imagine there are all sorts of gotchas along the way in building out files of different formats (csv, tsv, fixed width, excel, stone tablets), and dealing with character encoding, etc, etc. What I'm looking for is some sort of a dev framework (or commercial product) that would allow us to quickly satisfy the increasing number of (and variety of) data transformation requests. Something lightweight & simple is much preferred.

Any thoughts or experiences appreciated.

rcourtna
  • 4,589
  • 5
  • 26
  • 27
  • What's the reason for preferring light and simple? Sounds to me like you need something more full-featured. I use Pentaho Data Integration. Increased my productivity on things like this by a factor of 10. Yes, there's a learning curve, but if you do this a lot, it's definitely worth it IMO. – Brian.D.Myers Aug 12 '13 at 23:17
  • Thx for the reply - I actually spent an hour or so dabbling in Pentaho Kettle yesterday. It could work for us. My concern is that the responsibility to meet new customer data transformation requirements will fall into the hands of developers, in which case working in code is preferred (and a framework to do the heavy lifting and set the ground rules would be beneficial). Ideally, however, a sales engineer, devops or other would build the transformations, which Kettle would allow for. – rcourtna Aug 13 '13 at 13:15
  • I suppose it would be a training issue then. ETL tools are designed for analysts and non-programmers to use, but their not completely there yet. You have to have at least minimal development skills and it IS a paradigm shift. That's the learning curve. But if they can get past it, as I said, it's definitely worth it IMO. – Brian.D.Myers Aug 13 '13 at 16:43

1 Answers1

1

I'm not sure if it's a total fit but you can check out streamsets.com

It's an open-source tool for data movement and lightweight transformations. It allows you to provide minimal input schema (e.g. I have CSV files) so you don't have to deal with a lot of the things you mentioned.

*Full disclosure I'm an engineer at StreamSets

ramblingpolak
  • 632
  • 5
  • 4