I have two different applications which are running on two different machines.
- Application A is receiving data from Source A.
- Application B is receiving data from Source B.
Technically Source A
and Source B
are supposed to provide same data. Source A and B are not controlled by us. It is managed by some other team.
Now I want to sample records coming from both Source A and B in those two applications. If I am sampling 1 million records from Source A then I want to sample same 1 million records from Source B as well. I am using userId
coming from both the source to sample the records. So given a userId
I need to add some logic by which I can sample 1 million records. And then I will use same logic in both my applications to get same 1 million sampled records from both the source A and B.
We are getting bunch of userId's
from both the sources and there is no specific pattern for that.
What is the algorithm and logic I should use for this so that I can sample 1 million records. I want to use the same logic in both the applications.. Is there any way to do this? I was thinking of using modulus here? I have the exact same below code in both of my applications:
public void writeToDatabase(final Holder holder) {
String userId = holder.getUserId();
// how to make sure that we are storing only 1 million user data in database
// and it should be same user data from both the system.
// need some logic on userId
// write to database
}
After storing same data from both the sources (A and B), I need to do some data quality comparisons between those two sources. Basically, I will compare same 1 million userId
data from Source A with Source B.
Note: One million is just a number, 10,000 samples is also a fine or 5000 samples is also a fine.