0

I'm working on a real-world problem where I have theoretical protein sequences and associated experimental protein masses. I'm attempting to decompose the experimental protein masses into all possible compomers of amino acids of known masses to identify all possible impurities which could result in the given experimental protein mass, defining an impurity as any compomer which differs from the theoretical protein sequence.

To solve this problem, I've framed it as a coin change problem, and implemented the algorithms outlined by Böcker and Lipták in their 2005 paper Efficient Mass Decomposition. However, the algorithm I've implemented identified all compomers, and scales exponentially with experimental mass size, quickly producing intractable results well within use-cases. Peptides of 1000 Da are found to have 48,530 compomers!

Because I don't just have the experimental peptide masses, but theoretical sequences too, I'm looking to use the experimental sequences to narrow the scope of output compomers, and hopefully, place an upper limit on the returned compomers. For example, if I had a theoretical sequence containing 2 As, 3 Cs, and 5 Ds, I would place the limit at double, so no returned compomer could have more than 4 As, 6 Cs, and 10 Ds. Is it possible to implement such an upper bound in the solution to the coin change problem? Is there a better, or more efficient way that I can be solving this problem?

michaelmccarthy404
  • 498
  • 1
  • 5
  • 19
  • One idea- produce a list of values representing the masses, with as many duplicates as your theoretical sequences allow for. Solve the subset sum problem for a sum equal to your protein mass. – Dillon Davis Jul 30 '18 at 22:53
  • If your masses are positive values that span a smallish range, you can probably use the dynamic programming solution to solve it fairly quickly. If you had only a few "coins" (although it doesn't sound like it), you could brute force it quickly instead. – Dillon Davis Jul 30 '18 at 22:55
  • @DillonDavis I was actually thinking that the subset sum problem may be a valid choice, but running the calculation sin my head, I could potentially have 160 max items in my set. Is that too large for the subset sum problem? I've never implemented it before. – michaelmccarthy404 Jul 30 '18 at 22:57
  • I've implemented the brute force method before, and after 40 unique items my highly-optimized C solver started taking about 5 minutes to generate all possible solutions. – Dillon Davis Jul 30 '18 at 23:04
  • I'm not sure about the dynamic programming solution. I haven't implemented that one before because all my items have been floating point numbers. – Dillon Davis Jul 30 '18 at 23:06
  • Also, is it 160 unique items, or 160 with duplicates? If they are unique, your only option is dynamic programming. If they are duplicates, how many unique items do you have? If its less than 40, you can probably still make the brute force solution work. – Dillon Davis Jul 30 '18 at 23:16

0 Answers0