Architecture for large amount of data served via HTTP API

Question

I'm building an app to serve large amounts of data via REST API and I'm looking for some inputs on how to architecture it. I'm using .NET (C# 4.0), ASP.NET MVC and Sql Server 2008.

Right now I have about 400k rows in a relational database with +- 5% of it updated through the day by an internal app that goes directly to the database. I need to serve this data via a REST API returning a custom format of XML. However, the data needs to be processed before I can output it. The good thing is that I can pre-process it before if needed.

I wrote a small POC that gets the data, process it and cache it into a local XML file. Due to processing this process takes about an hour to work on all 400k rows. After the cache is done I just return the physical file in every request.

Now I need to be able to update the data as it gets updated in the source and update my cache so I don´t need to generate everything everytime a single row gets updated.

I'm thinking about using AppFabric to keep a memory cache and use physical files just to make sure that in case the memory cache goes out I don't need to start from scratch. As soon as a row gets updated in the source I would update the cache memory and write the physical file to make sure its up to date.

So my primary source would be the AppFabric cache, then the physical cache file and as a last resort regenerate the file from the database what would take about an hour and make the file unavailable to whoever calls it.

I'm not very happy with this, but it is what I got. Any suggestions?

Thanks a lot!

Pretty clear explanation, but it left a question in my mind: Is the output file you generate from the ~400k rows some combination off all of them that gets invalidated whether one row changes, or is this a process that can happen row by row? A possible answer presents itself if this is the later case. — coffeetocode, Apr 08 '11 at 22:55
No, the output file that results from this process is a list of the input rows (Products, this is related to e-commerce) with some data modified (price). So if one row is updated I just need to update this one row. Not everything. — tucaz, Apr 10 '11 at 18:32

score 1 · Answer 1 · answered Apr 04 '11 at 16:43

1

Can you put all processing in the database? If true, I think you should keep all processing in there. 400k rows is a small number of rows and SQLserver could handle this fine (an hour is too long to do that!). Try using integration services to staging tables and do all inserts/updates/selects using 'set processing' instead of using cursor/row processing. My 2 cents.

answered Apr 04 '11 at 16:43

rboaretto

21
2

Hey my man! Long time no see, ahn? Unfortunately I can't :( Its a .NET component that does the processing. Thanks, though! – tucaz Apr 04 '11 at 17:24
Long time, rsrs. Let me ask you another question: Can you rewrite this .net component to use only the database? I mean, some procedure that does the processing using 'set processing' instead of 'row processing'. Load one table on the database that has all the values that should be updated and using some joins/temp tables you can update the target table based on this temp table or something like that. Set processing is the best approach in this case. – rboaretto Apr 11 '11 at 21:17

score 1 · Answer 2 · answered Apr 11 '11 at 15:13

Have you thought about using Linq To Xml?

You will know best whether it fits your other constraints, but here are my 2 pence:

pros:

Xml is your target format, so no problem with conversion
it is readable
very easy to use (CRUD)
load/save to file would address your 'reload processed data on service restart' issue
has a reasonably efficient implementation meaning it could handle your 400k rows (Depending on your record size).
using linq means that you can easily parallelise it (with PLINQ) thus efficiently using your cores

cons:

memory use might be a problem, but this depends on your record size

The below (for 500k elements) generated in 2secs and updated all the element prices in under 0.5sec on a good dev box:

var root = new XElement("root");

for (int i = 0; i < 500000; i++)
{
    root.Add(new XElement("product", new XAttribute("name", "product_" + i),
                          new XElement("price", 13.0 + 1.0/(i + 1))));
}

foreach (XElement updateElem in root
    .Elements("product")
    .Where(x => x.Attribute("name").Value.StartsWith("product")))
{
    updateElem.Element("price").Value = "16.0";
}

root.Save(@"c:\temp\huge.xml");

coffeetocode · Accepted Answer · 2011-04-12T11:35:32.350

Thanks for your clarification above. Here's an option based on that.

Add a table to your DB. Call it Products_Processed (or Prices, whatever). This new table has one row for each row in Products (eg, 1-to-1 with the source data). Each row in this new table contains the processed data for the corresponding source row.

Each time a row is updated in Products by the external app, you compute just that row and update the corresponding row in Products_Processed.

Here are a few ways to get get code run on just the newly updated entries:

Have a thread in your program that polls the DB once a second (or minute) runs your processing logic on any rows updated in the last second since one second (or minute) ago. This implies you keep a timestamp on when the Product row was updated (which is probably a good idea anyway).
If you don't want a timestamp, have a trigger in the DB on your Products table that adds updated rows to a Products_ToProcess table. Poll that and run your processing logic on entries that appear there, and remove from Products_ToProcess
Use a trigger on the products table that actually calls your C# code: Create and Run CLR SQLServer Trigger

This approach keeps your derived data logically close to the source of truth (in the DB with the source data) and reduces the number of times you copy/format/handle the data. Also, importantly, using the tried-and-true DB-provided mechanisms for detecting/triggering on changed data will save you from writing a lot of your own syncing code.

Now, returning your results is essentially streaming out a select * from Products_Processed. If you want to return the processed data for just specific products, you have the full power of SQL and your schema; likewise for sorting. This whole setup should be fast enough that you don't need to cache the file on disk. In fact, MSSQL caching should probably keep the most/all of the processed data rows in RAM if you have enough, so you'll rarely have to do a cold select (and if you don't have enough RAM, consider what a few extra gigs are worth compared to your time; throwing hardware at a problem is never cheating ;).

(However, if you really want to write it out to disk, you can store the offsets into the physical file for each row record, and quickly update individual data in the file as the corresponding processed data row is updated.)

This was the approach I used. The only difference is that I keep the generated data in a XDocument (as @grzeg suggested +1) object in memory and update it directly whenever a product changes. I didn't want to store the whole thing in the database to avoid generating XML for it at each request. Thanks — tucaz, Apr 14 '11 at 14:27

Architecture for large amount of data served via HTTP API

3 Answers3