0

There are 0.5 million records per day, each record consists of ~500 bytes and we have to analyze a year's records. To speed up the process, it would be better to load all records at once but we can't as it requires ~88 GB of memory. Number of records may exceed in future.

Another approach was to load these records w.r.t. groups since we analyze these records as groups and there are 25000 groups, which may exceed too.

We may load one group at a time, analyze the group, discard and load another....But this is resulting in very slow process, going to the database server 25000 times!!!. Single threaded process with data available in Memory is much faster than multithreaded process (thread count 32) with trips to the database.

Is there any approach we can handle this loading of huge data and minimize the no. of trips to the database OR Loading a Collection of size more than available memory OR a library which could wrap the on demand loading of data(partial collection)?

Community
  • 1
  • 1
bjan
  • 2,000
  • 7
  • 32
  • 64
  • Not sure what is your problem - you seem to verified that "with data available in Memory is much faster", so just go with that approach... – Alexei Levenkov Jan 08 '13 at 06:43
  • @AlexeiLevenkov We don't have this much memory available :) – bjan Jan 08 '13 at 06:44
  • 1
    @bjan But ou are aware even small servers tehye days can hold 256gb memory, and there are these things called "shops" that sell it? Just to make sure. – TomTom Jan 08 '13 at 06:53
  • @TomTom We can not ask clients to purchase this much memory or they will opt for other solutions :) – bjan Jan 08 '13 at 07:02
  • @bjan given also all your other answers maybe you should hire a lead developer who knows how to deal with both, .NET and large data? Seems you really WANT to have bad technology choices, either through bad judgement or lack of skill. That, plus budget problems on the hardware (and misrepresentation of facts) - well, that is not a good mixture. – TomTom Jan 08 '13 at 07:29
  • @TomTom I am facing the same problem of like what will be the best way to load millions of records from 2 different rdbms for some operation.Can you please provide me some insights on this? – I Love Stackoverflow Feb 16 '18 at 10:16

4 Answers4

1

Have yo uconsidered getting them in one request, running along them then discarding them as you go? LKooking into Hadoop clusters?

Without knowing what your analysis needs, it is quite futile to make any recommendations.

TomTom
  • 61,059
  • 10
  • 88
  • 148
  • We can't load all the data at once because we don't have 88 GB RAM. Let me see what is Hadoop clusters. – bjan Jan 08 '13 at 06:45
  • 1
    I did not say at once. I said stream and discard - read. Query ONCE, run along the data with a reader but do not remember what you already processed. – TomTom Jan 08 '13 at 06:52
  • To analyze a group we need to have group's all transactions available and we are using LINQ for grouping and filtering so data reader can not be used. Reading data for each group requires database trip which is comparatively slow – bjan Jan 08 '13 at 06:59
  • @bjan, why can't you move grouping to database - instead of filtering, you can choose ordering that way group data can be contentious! You can also think about multiple result sets where your database query can generate one result-set per group. – VinayC Jan 08 '13 at 07:16
  • Hm, last time I dealt with this stuff I also used LINQ (the technology) and had no problem with either. Maybe your choice of a LINQ provider is crappy, and you just dont know how to use it properly? Hint: neither LINQ2Sql nor EntityFramework will work nicely with requirements like this, but LINQ in itself will. And do the grouping in a second step. Pull not grouped data out of a reader, group when you see a group change, process. – TomTom Jan 08 '13 at 07:28
  • @TomTom That is what we are doing, Linq to Objects, grouping and filtering data in memory(collection). And once again, the data should be available in memory... – bjan Jan 08 '13 at 07:34
  • @bjan Does not work. LINQ is not suited Get rid of your grouping and use something like bltoolkit for the access (that does not remember loaded data and can enumerate along a result set). Do the grouping in memory when you see the group discriminator change. Worked like a charm for me processing 40+ million rows in one run. – TomTom Jan 08 '13 at 07:38
1

Apart from taking a distributed approach (I.e. having separate machines running the analysis in parallel, co-ordinated by a central controller), the only thing I can think of is maybe streaming the data from the database directly into a file on the filesystem of the machine that will be running the analysis (this could be done as a precursor to running the analysis).

If the storage hardware is fast (e.g. SSD), then replacing the database calls with file reads within the analysis program may provide better performance.

Chamila Chulatunga
  • 4,856
  • 15
  • 17
0

Is it necessary to load all of your data into memory? Maybe the analysis you want to do only requires 3 fields of a record instead of all 50 fields for example. Consider creating a temporary dataset with a hash to lessen the memory you will require. Maybe your data is unnecessarily large, ie you're using bigints when you need only 3 sig figs, date and time when you only need the date, varchar(100) when you only need the first 5 letters of the last name. Try truncating data to allow a less memory intensive initial processing. then you can go back using your hash and look at finer details, like the time after the date has been sorted. So you would load records in a block, dump the portion of the data that is unneeded, and move on.

It would be helpful if you gave us more details of what your data looks like, what you are trying to do with it, etc. Or at least a facsimile of it if security/privacy keeps you from giving us the real thing. Sorry this is so general, working with what I've got.

VoteCoffee
  • 4,692
  • 1
  • 41
  • 44
-2

If you have 25000 groups, which are independent of each other, a multi threaded approach is better, where there is a controller thread which spawns other "worker" threads according to load, and gives them data to work with.

Controller thread fetches the amount of data which can be handled optimally (multiple groups which be processed in one iteration-limited by amount of available memory) and decides on the number of threads to spwan. This can also be made more scale able by adding multiple app servers with each having different set of worker threads.

patil.rahulk
  • 574
  • 1
  • 3
  • 13
  • Loading one (or many) group at a time which should not exceed available memory requires trips to the database, these trips are slowing us down – bjan Jan 08 '13 at 06:50
  • if you simply do not want multiple trips to DB, then the only option you have is to get entire data in single trip... you have figured that out yourself! If you dont have that much memory, only option is to split data fetching in multiple trips, calculate how much memory is required for one group and then calculate maximum (optimal) number of groups which ca be processes at a time. – patil.rahulk Jan 08 '13 at 06:56
  • That is what i am asking for, loading 88 GB of data while the available RAM is less than 88. How? – bjan Jan 08 '13 at 07:00
  • Calculate and split.. lets say 1 group = 1 MB, and you have 4GB of RAM so you can process 4GB/1MB = 4000 groups at a time, tweak your database fetch operation to return 4000 groups at a time, process them and them go for next 4000 groups. You will ave to do the counter management yourself within DB or Application. – patil.rahulk Jan 08 '13 at 07:02
  • This is good but since group size is not fixed. 4000 groups may have lesser number of records than a single group so it is a bit complicated – bjan Jan 08 '13 at 07:08
  • But the transaction record size in database is fixed right? So calculate the number of groups which can be handled at a time using number of transactions contained in those groups. which can vary for each iteration. You have to get down to a finite record level to e able to calculate size and then use it to decide how much you should read in each iteration. – patil.rahulk Jan 08 '13 at 07:31