Centralized System for Managing Data Ingest/Output

Question

The company I work for produces software that takes in Gigabytes of data a day, and outputs (lesser amounts of) Gigs a day. We have a significant infrastructure for managing data flow, and by significant I mean significantly bad. When we require a new type of data in our software, we develop a script for pulling data from an HTTP or FTP source, drop it on a server, and go. There are often complaints of over/under-utilization across most of our servers, but we have no good way for managing the balance of resources spent retrieving / writing out data.

Is there a software package that is designed to manage data transfer externally (retrieval) as well as passing it around from server-to-server inside our network?

I'm looking for something that offers;

A common base for simplifying the design of download scripts.
A logging/monitoring framework for knowing what's going on, what may have not been retrieved, etc. (Bonus points if the monitoring aspect can integrate into NAGIOS.)
A single location for viewing the status of download tasks, etc. (i.e. A Dashboard)
An implementation meant to be run across multiple servers, with ultimately what amounts to multiple server download slaves.
Simplicity (relatively speaking). Again, we're trying to get off of a nightmare infrastructure.

We have several homegrown tools to do this sort of thing. Some are java, some are perl, some are sh, some are Windows .NET apps. One team even built a tracker and made a spiffy AJAX frontend for theirs to find lost/stranded. Of course, it only works for their tool, not any of the other ones we have in-house. — mfinni, Sep 27 '10 at 18:10
Some advice - use an existing log framework, like log4j, that has levels (debug, info, warn, error). For tracking, you need to assign a GUID or something as soon as it is seen by your tools. Something like Java can be run everywhere : windows, unix, etc. Run it with a server and client instances, so you can start a new source/destination without having to take down the whole copy process. — mfinni, Sep 27 '10 at 18:12
I appreciate the suggestions, it reminded me that I didn't specify OS. We're using CentOS 5 pretty much across the board, so portability (regarding OS) is irrelevant. Sounds like you're an advocate for "rolling your own". It's certainly quite possible for me/us too, I just don't want to :). — VxJasonxV, Sep 27 '10 at 18:28
I'm not necessarily an advocate of "DIY", it's just pretty common for your case, in my experience. As @zerolagtime said below, this is something that workflow software can do, and it's usually hella expensive. Lombardi (bought by IBM) does stuff like this. I have also seen scheduling software (the big stuff, like Espresso) used for filecopies. You can be sure those packages have logging, retries, even job distribution. And $$$$$$ — mfinni, Sep 27 '10 at 21:09

score 1 · Accepted Answer · answered Sep 27 '10 at 19:30

The OASIS group published a standard language for defining "business processes" called BPEL (Business Process Execution Language). We have used the expensive, commercial version for a few years for the very reasons you listed: maintainability, scalability, extensibility, verifiability. Fortunately for you, many projects have taken these tasks on. Some open-source versions are at freshmeat.net. Us mere mortals call this "workflow software." One package that caught my eye was Orchestra as it is both LGPL and comes with commercial support options. Given that it is a critical business process, you may want some level of support, or at least contribute back any changes to the community.

Fortunately for you, many, many others have blazed this trail. It's just a matter of following in a way that is politically correct in your organization. I recommend a private prototype and a demonstration to get senior management to buy in to a larger, business-relevant test. Get the boss to buy-in fairly early and you might be able to change this core business practice.

Yikes. At first I thought this was just modeling/UML, a glorified "this is the flow, let's put it into practice this way" chart, but I dove in and started to understand more about what's going on. This is most certainly a brand new direction for me, and I admit it seems a bit overkill for what I expected to be relatively common, but couldn't find the words for it. Thanks for your help. [edit] I really wish I would stop pressing enter in comments. These aren't the Line Breaks you're looking for. — VxJasonxV, Sep 28 '10 at 17:15

Centralized System for Managing Data Ingest/Output

1 Answers1