1

I am running a program called dnadist from PHYLIP (http://evolution.genetics.washington.edu/phylip/doc/dnadist.html). This creates a dna distance matrix from the number of sequences you input.

Currently, I want to create a matrix from 14,778 sequences. I am submitting this to run on my University's HPCC and based on my calculated estimate it will take 10 days to run.

I want to request more cores to speed up the time, but I am getting confused on if this is even possible to split up the algorithm running? Or does it have to run all on 1 core? My assumption is I would have to alter the algorithm itself to spilt up the matrix being produced and then concatenate it all back together. Is this correct to assume?

Oscar Foley
  • 6,817
  • 8
  • 57
  • 90
Gina
  • 71
  • 9
  • Distance matrix computations should be fairly trivial to parallel: for example for three elements A, B, C, the AB, BC and AC can be each calculated independently. However, it's not sure whether this is implemented in `dnadist`. You might want to look at other distance matrix calculations tools for example in `R` (`stats::dist` function) or they probably also exist in `C` or C derivatives. – Thomas Guillerme Sep 24 '18 at 00:12
  • Which HPCC version do you have? – Oscar Foley Sep 24 '18 at 07:18

2 Answers2

2

Yes, you can parallelize, that is the main point of using HPCC. Without reading your code is hard to answer. I assume you code would something like:

EXPORT CalculateDistances :=FUNCTION(parameters)
    // For each parameter do your DNA magic(matrix calculation) 
    RETURN something;
END;

result:= CalculateDistances(A,B,C,D...);
OUTPUT(result, named('result'));

You can parallelize coding your function with the basic matrix calculation, using PARALLEL ECL command and running workunit in Thor (not in HThor).

EXPORT CalculateDistance :=FUNCTION(Matrix1, Matrix2)
    // Your DNA magic(basic matrix calculation) for Matrix1&Matrix2
    RETURN something;
END;

result01:= CalculateDistance(A,B);
result02:= CalculateDistance(A,C);
result03:= CalculateDistance(A,D);
result04:= CalculateDistance(B,A);
result05:= CalculateDistance(B,C);
result06:= CalculateDistance(B,D);
result07:= CalculateDistance(C,A);
result08:= CalculateDistance(C,B);
result09:= CalculateDistance(C,D);
result10:= CalculateDistance(D,A);
result11:= CalculateDistance(D,B);
result12:= CalculateDistance(D,C);

executeCalculates := PARALLEL(
        result01,
        result02,
        result03,
        result04,
        result05,
        result06,
        result07,
        result08,
        result09,
        result10,
        result11,
        result12
);

executeCalculates;
Oscar Foley
  • 6,817
  • 8
  • 57
  • 90
  • 2
    may get cumbersome for 15,000 sequences however, that's over 100 million unique combinations! You could #Loop (HPCC macro function) I suppose. https://hpccsystems.com/training/documentation/ecl-language-reference/html/LOOP.html – DataMacGyver Sep 24 '18 at 08:23
  • Good idea. Macros are fun :-) – Oscar Foley Sep 24 '18 at 08:55
2

I'm not sure of a few things here: How phylip is running the pairwise comparisons (if it's all at once that's a helluva calculation!), what you are sequencing (bacterial proteins are orders of magnitude easier to fit in memory than the wheat genome would be) and how this is setup to run on HPCC (phylip is C I believe so how has it been deployed?).

In short, genetic analyses are done all the time so writing bespoke programs to do it yourself is probably a non-starter. There are other tools, such as MEGA that can chew through distance calculations for you but it's worth seeing what's being used in the literature for your problem and on what hardware. Perhaps also try R's dist.dna() function? You could if you wanted to parellelise this (link) but you'd need some jiggery pokery to make sure you got all the distances done before you combine them.

Is speed of calculation important? If you have 15,000 whole bacterial sequences (assuming 1,300 kbp each) then they will fit in memory on a decent machine. Again, my guess is that someone has already got something that will do this, a few days on a desktop you've got lying around is fine, gives you chance to write up your intro and methods!

DataMacGyver
  • 386
  • 3
  • 10
  • As discussed in Oscar's answer, you could write an ECL #LOOP to do this on HPCC. If you used an R, Python, C++ or Java function to do this then you could embed one of these languages in your code. – DataMacGyver Sep 24 '18 at 09:37
  • Thank you for the detailed response, I greatly appreciate it! To answer your questions first- (1) dnadist, based on my understanding, runs 1 organism at a time against all the others - so for example, A will be compared to B-Z and then B will run next (See "output format" in http://evolution.genetics.washington.edu/phylip/doc/dnadist.html) (2) I am using bacterial 16S RNA gene sequences that are 28 kb long (3) Again, my understanding is that PHYLIP is still deployed through C still via C compiler, but was uploaded into linux (http://evolution.genetics.washington.edu/phylip/install.html) – Gina Sep 24 '18 at 16:02
  • I am basing this analysis off of Adam Martiny's lab that uses PHYLIP before using their consenTRAIT algorithm - which requires neighbor trees & a binary trait table. PHYLIP creates 100 bootstraps of randomized sequences w/in a set of 16S via Seqboot. Each bootstrap is run through Dnadist. These 100 matrices are run through the Neighbor program. The speed is important since 10-15 days per bootstrap with needing to test 8 pathways = ~4 months if I send mult jobs on the HPCC & no troubleshooting. I'd be thrilled if I can get 1 bootstrap to run in 3-5 days. I'll check on other programs though too! – Gina Sep 24 '18 at 16:21