sample same 1 million records in two applications with same logic

Question

I have two different applications which are running on two different machines.

Application A is receiving data from Source A.
Application B is receiving data from Source B.

Technically Source A and Source B are supposed to provide same data. Source A and B are not controlled by us. It is managed by some other team.

Now I want to sample records coming from both Source A and B in those two applications. If I am sampling 1 million records from Source A then I want to sample same 1 million records from Source B as well. I am using userId coming from both the source to sample the records. So given a userId I need to add some logic by which I can sample 1 million records. And then I will use same logic in both my applications to get same 1 million sampled records from both the source A and B.

We are getting bunch of userId's from both the sources and there is no specific pattern for that.

What is the algorithm and logic I should use for this so that I can sample 1 million records. I want to use the same logic in both the applications.. Is there any way to do this? I was thinking of using modulus here? I have the exact same below code in both of my applications:

  public void writeToDatabase(final Holder holder) {
    String userId = holder.getUserId();
    // how to make sure that we are storing only 1 million user data in database
    // and it should be same user data from both the system.
    // need some logic on userId


    // write to database
  }

After storing same data from both the sources (A and B), I need to do some data quality comparisons between those two sources. Basically, I will compare same 1 million userId data from Source A with Source B.

Note: One million is just a number, 10,000 samples is also a fine or 5000 samples is also a fine.

How many records are there in total? What is effective value range of `userId`? — Andreas, Jun 12 '17 at 22:46
@Andreas we don't know that.. All these records are activities happening on site in real time. — , Jun 12 '17 at 22:54

score 0 · Answer 1 · answered Jun 12 '17 at 23:48

Run a decent hash algorithm, and take all of the ids whose hash is below some threshold. You want it fast, and secure doesn't matter.

For example if you take all user IDs whose MD5 hash starts with '00', you'll get approximately a half-percent of all records, and you'll get the same half-percent on both sides. You need know nothing about the way they have selected hashes, and there should be no observable pattern. You can adjust what range of hashes you accept in any way that you want.

(You will get much closer to 0.5% by taking the ones whose hashes are alphabetically less than '018AE'. Or you can arrange any fraction that you want.)

Without knowing how many user IDs they have, I cannot tell you how many you will probably get.

(If the sources use different user IDs for the same person, well, then you've got your work cut out for you...)

For same person there will be same `userId` always.. if I am doing any activity on site, then my `userId` data should flow from both the sources (A and b). Can you tell me which hash algorithm I should use? Any example will be of great help. — , Jun 12 '17 at 23:51
@shortcut Any fast cryptographic hash algorithm will work for your purposes. MD5 is no longer secure, but is widely available and fast. It will be fine. — btilly, Jun 13 '17 at 00:44

score 0 · Answer 2 · answered Jun 23 '17 at 17:38

Check out Reservoir Sampling: https://en.wikipedia.org/wiki/Reservoir_sampling

You could use reservoir sampling to select the first n records from Source A to ensure an equal chance of each record being chosen. The Algorithm R might be the relevant method for you. After you retrieve the samples from Source A, you can then use the User IDs for those samples to retrieve the same samples from Source B.

Keep in mind that this method will provide the most statistically random sample, but it does not use the same logic to sample from Source A and Source B. In addition, if you do not store your data, you may not be able to find the correct User IDs from Source B if the dataset is too large.

sample same 1 million records in two applications with same logic

2 Answers2