1

Goal: To map mutation location from file1 to a region or feature from file two. For this you need to make sure that chromosome (chr1) and strands (+/-) are the same before comparing chromosome location from file 1 to regions of file2.

Question: How to use mapreduce or Disco to map one location to a region.. . Aka formulate the location -> chromosomal region in a mapreduce method?

Description: I have two medium sized files (10gb) and two file types that I wanted to process. I already have these files parsed in basic python but I will likely have to parse many larger similar files in the future so I wanted to try it with mapreduce (hadoop/Pig to be more specific)or Disco to learn .

While I can run the nodes on an EC2 cluster ideally a one cluster hadoop (yes I know it defeats the purpose) or on something like Disco or Sparc.

I like the idea of using Pig because that would reduce the process to just processing the file from .csv files but I have no idea for how to use mapreduce for mapping something to a region instead of just a key/value pair

Here is a visual representation of what I was thinking of: was thinking of.

File info:

  1. First file is TCGA cancer SNP mutations. Some important features include

    • Chromosome location
    • Chromosome number
    • strand
    • sample id
    • the rest is not so important
  2. 3' UTR sequence.

    • Chromosome start location: int
    • Chromosome end location: int
    • Chromosome number: chrX
    • strand +/-
    • gene id
    • the rest is not so important

sample files are here:two sample files

Finally python is my language of choice for this if it matters..

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
prussiap
  • 667
  • 1
  • 7
  • 14
  • you should ask http://www.biostars.org/ . See also : http://stackoverflow.com/questions/1832103 – Pierre Jun 04 '13 at 09:11
  • @pierre yeah i've seen some of the similar approaches. I'll ignore overlaps and missing ones for now.. I'm asking this in a more CS oriented manner then a bio specific parsing. I'd like to know how a software engineer would go about mapping this. I've added a link to a picture that hopefully shows what I'm trying to do. – prussiap Jun 04 '13 at 20:15

0 Answers0