2

I've taken one of the numerous DNA tests online, and it has identified genetic relatives based on DNA, but most of these relatives are at least 3rd cousins. These other users have their family trees online, which I can access as JSON data.
I'm adopted, so the 'ground truth' is unknown for me.

But I need some sort of algorithm to crunch this data. The simplest one that I can think of is to find the most common last names from the data, but that doesn't seem very sophisticated. I'd like some more suggestions or links to relevant discussions or algorithms.

I don't want a discussion on if I should do this. I'm not 100% sure if I'm interested in it for my own benefit, or for more of an academic exercise.

amit
  • 175,853
  • 27
  • 231
  • 333
coding_hero
  • 1,759
  • 3
  • 19
  • 34

1 Answers1

1

Maximum-likelihood estimation is one of the standard approaches to this kind of problem. Once you've pieced together the family trees, compute how likely it is that someone in the tree would get your test results (making independence assumptions freely to simplify the math). Then iterate over all someones (hopefully this won't take too long) and report the k largest likelihoods.

The tricky part here is getting reasonable likelihood estimates. Here's one approach; I have no idea whether it's any good. Your family "tree" is a directed acyclic (hopefully no one has a time machine) graph where each node has exactly zero or two predecessors. Iterate over the nodes in topological (i.e., a plausible chronological) order. For nodes with zero predecessors, initialize a "chromosome" consisting of 2k random bits grouped into k pairs of 1-bit alleles (not sure how to set k; maybe a thousand?). For nodes with two predecessors, for each of the k pairs, generate the chromosome by choosing one of the mother's alleles and one of the father's. At the end, you can get genetic similarity scores via Hamming distances. You'll have to find a mapping between test results and distances, perhaps by simulating/working out the math for potted examples of third cousins, etc.

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • Feel free to edit this answer if I've screwed up the biology terms. It's been a while =P – David Eisenstat Mar 25 '14 at 17:51
  • If I understand your approach correctly, it assumes I have access to the other users' DNA records. But I don't. I just have access to their family trees. – coding_hero Mar 25 '14 at 21:54
  • @coding_hero No, you're faking DNA records according to the family tree to get an idea of how close their actual DNA records would be if you had access. – David Eisenstat Mar 25 '14 at 21:55
  • Ahh, I see. I read it correctly the first time, but the talk about 'alleles' on the second read-through threw me. – coding_hero Mar 26 '14 at 16:52