2

Please could anyone point me in the right direction on how to design/build a web service client that will consume terabytes of data and perform some computation on the retrieved data?

I inherited a project at my new job. The project has been designed and has been started by the group a few weeks before I joined the team. The project is about retrieving data from several web services (soap & rest) and performing some computation on the data before storing in database, displaying to user and generating reports.

The process of getting the data involves pulling some data from web service A, B, C and using the response to make another request to web service X, Y&Z. (we don’t have control over the web service producers). The current implementation is very slow most times we run out of memory when trying to do some computation on the retrieved data. The data is in terabytes or more. The current implementation uses maven/spring.

I am at the point of drawing up a new design for this project (introducing a bit of caching etc) but I would need some suggestions from anyone who has encountered this kind of problem before.

Aside from the obvious, are there any special tricks or approach to this? I know this might sound like a stupid question to some people, but any pointers would help.

Donal Fellows
  • 133,037
  • 18
  • 149
  • 215
  • Is it possible for any of these sources of data to throw away some of the data once dealt with? If e.g. one of the streams relates to something you will process and then not process again, and you throw it away, then that should benefit both speed and memory. – Jon Hanna Dec 16 '11 at 11:05

2 Answers2

1

I've never done this sort of thing myself (would love to though), but it sounds to me like you could temporarily store this data in a data grid of some sort that scales horizontally over many machines (so you don't run out of memory) and then apply an aggregating function across the data to get the result you're looking for before storing the final result in your results database.

Off the top of my head I'd recommend looking into Cassandra or HDFS for the distributed data grid (NoSQL cluster) then Hadoop for creating jobs to query/aggregate/manipulate that data.

I hope that helps.

simonlord
  • 4,347
  • 1
  • 19
  • 12
  • Thanks for your response. However, I’m not sure if the team would like to consider setting up a hadoop cluster or implementing any sort of distributed computing owing to the expertise and time it would entail. Also, the project does not involve storing any of the data retrieved from the web service. What we store in the database would be some data generated from computations on the data from various web services. What we need to display to the user is actually what we got from the web service and we don’t have to store this in any database. – Joseph Samz Dec 16 '11 at 10:25
0

It's always awkward to deal with terabytes of data because you can't really have it all in memory at once. (Well, not without an absolutely ridiculous machine.) So instead you should ask whether it is necessary to have all that data — or even just a large chunk of it — in memory at once. Can it be processed a little bit at a time? (A few MB would count as a “little bit” these days; don't be too worried about minimizing everything to the nth degree.) If it can, redesign the application and its deployment (with that much data, you can't really separate them) so that data is on the wire or on disk.

You're probably wanting to think in terms of streaming filters and transforms; MapReduce-based algorithms are a good plan. Have you looked at Hadoop yet? Yes, I know you're not keen on setting something like that up, but you really have a large amount of data there and you have to think in terms of doing it right. That said, MapReduce is only one way of configuring a pattern of filters and transforms; there are others too. For example, you can treat subsequent service requests as a type of transform, though with that much data you need to be careful that the service owner doesn't treat you like a denial-of-service attack! You might want to consider using a scientific workflow system (Kepler, Taverna) as they're designed for doing the same set of tasks over a long list of things.

You also need to be careful with data transfers; with that much data, the standard checksum algorithms built into TCP/IP have a surprisingly high likelihood of missing something. (Luckily, actual error rates with modern hardware are mostly really low…) Also, when processing this much data you need to be ever so careful with ensuring you don't have memory leaks. Even a 1% of 1% leak is likely to mean a GB sized leak overall, which can be very noticeable.

Donal Fellows
  • 133,037
  • 18
  • 149
  • 215