0

Basically, the needed job is for large amount of records on a data base, and more records can be inserted all the time:

Select <1000> records with status "NEW" -> process the records -> update the records to status "DONE".

This sounds to me like "Map Reduce".

I think that the job described above can may be done in parallel, even by different machines, but then my concern is:

When I select <1000> records with status "NEW" - how can I know that none of these records are already being processed by some other job ?

The same records should not be selected and processed more than once of course. Performance is critical. The naive solution is to do the mentioned basic job in a loop.

It seems related to big data processing / nosql / map reduce etc'.

Thanks

Liran
  • 144
  • 2
  • 6
  • It's depends, What databse is being used? – twid Jul 24 '13 at 19:21
  • The database which is being used is oracle, and a small chance that it can be HBase instead. So please provide me 2 solutions (if they are at all different): 1. For Oracle database. 2. For HBase. Thanks – Liran Jul 25 '13 at 15:34
  • What would be a reason to start the second job processing new records before the first one is done? – Olaf Jul 25 '13 at 16:30

1 Answers1

0

Since considering Performance issue... We can can achieve this.The main goal is to distribute records to clients such way that no to clients get same record. I irrespective of database...

  1. If you have one more column which is used for locking record. So on fetching those records you can set lock, To prevent from fetching for send time.

  2. But if you don not have such capability then my bets bet would be to create another table or im-memory key-value store, with Record primary key and lock, and on fetching records you need to check of record does not exist in other table....

If you have HBase then it can be achieved easily first approach is achievable with performance.

twid
  • 6,368
  • 4
  • 32
  • 50
  • I thought about doing it on parallel only for performance and scaling- so it can be done by multiple threads -and moreover- maybe by multiple machines- if it can not be done in parallel, there is no scaling- no more than one machine which does the processing. Regarding 1, the lock on records- I thought of that, I think it complex things: the process will be- lock <1000> records - Select <1000> records with status "NEW"->process the records -> update the records to status "DONE" with unlocking them. And if a process locking records and crashes in the middle- records left locked. More handling. – Liran Jul 26 '13 at 09:04
  • So i think you can do is there has to be some TTL, after TTL expire next thread/process can access that record. TTL value need to be big enough to allow first client to process record and submit it... – twid Jul 26 '13 at 09:17
  • Thought about it too :) Again - these are good and valid solutions - as I wrote - more handling, but I am not sure whether it should be that complex. So I will take all that in mind. First solution without paralleling, or seconds solution with locking and more complexity. Thanks ! – Liran Jul 26 '13 at 09:28
  • for Record Level Locking look at [LINK](http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html) – twid Jul 26 '13 at 10:59