-6

My 26 data files (.txt) (imputed genotypes for each chromosome) look like:

CHR SNP1 SNP2 SNP3
1   3   1   2   3 
1   3   0   2   1
1   0   0   1   0
1   0   3   3   1
1   1   1   0   2


CHR SNP1 SNP2 SNP3
2   1   1   2   2 
2   0   3   1   1
2   0   0   1   0
2   0   3   3   2
2   3   2   0   1

How I can convert them in plink format (.map and .ped)?

Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
user2808642
  • 95
  • 1
  • 9
  • you should likely be looking at various tools in bioconductor to help, esp since one of the pkgs there has https://www.rdocumentation.org/packages/snpStats/versions/1.22.0/topics/write.plink – hrbrmstr Nov 10 '17 at 15:35
  • Not know how you generated your imputed genotypes, I have no idea what these indicate. Is zero missing genotypes? Are these 26 files for 26 chromosomes of one individual? Or 26 files for 26 individuals? – caw5cv Nov 10 '17 at 15:56
  • these are 26 files for chromosomes (1-26), here I only show 2 files (chromosome 1 and 2). Zero missings are not important here. I just need to convert them in plink fromat, after combining all 26 files into one plink file – user2808642 Nov 10 '17 at 16:19
  • 1
    plink text (ped) format includes two columns per locus (for a diploid individual). You only have one value for the genotype (0, 1, 2, and 3) and without knowing what those mean, it's impossible to translate into two columns. We'd need to know, for example, if... 0 maps to "0 0", 1 maps to "1 1", 2 maps to "2 2", and 3 maps to "1 2" for your specific case. – caw5cv Nov 10 '17 at 17:25
  • The columns containing the SNP information (except first column) coded as 0, 1, and 2 stand for the homozygous aa, the heterozygous aA or Aa, and the homozygous AA cases, respectively – user2808642 Nov 10 '17 at 17:31
  • additionaly - there is another column represents IDs. – user2808642 Nov 10 '17 at 17:32
  • 2
    cross-posted: https://www.biostars.org/p/282886/ – Pierre Nov 10 '17 at 22:18

1 Answers1

0

I'll be making some assumptions here. You don't specify what "3" indicates for your genotypes, so I coded it as the default missing genotype for Plink (0). I'm using "R" and "A" as simply shorthand for "Reference" and "Alternate" alleles; plink can read these characters just fine. You'd need to edit the dictionary below if you need it to do something else.

This script essentially transposes the first line of your input to create the columns required in the .map file. It reads the second line to get the chromosome number. The third column is left at the default dummy value of "0" because centimorgans is unknown.

The ped file is generated by converting genotypes from your numeric values to what I'm assuming they represent (again, defined in the python dictionary), including the first six required columns (see https://www.cog-genomics.org/plink2/formats#fam for specifics). Importantly, the FID is simply repeating the IID because I don't know relatedness here. IID (individual ID) is just the row number (1st row is ind_1, 2nd row is ind_2, etc. This is fine so long as the order is identical between your different chromosome files. Phenotype is set to the "missing" value of -9. Format is fine on the test files shown below.

#!/usr/bin/env python

import sys

infile_name = sys.argv[1]

pedDict = {
"0" : "R R",
"1" : "R A",
"2" : "A A",
"3" : "0 0"
}

def convertToPlink(infile_name):
    with open(infile_name, 'r') as infile:
        header = infile.readline().rstrip().split()
        chromosome = infile.readline().split()[0]
        with open('chr_' + chromosome + '.map', 'w') as mapfile:
            for POS in header[1:]:
                mapfile.write("\t".join([chromosome, chromosome+"_"+POS+"_SNP", "0", POS])+"\n")
    with open(infile_name, 'r') as infile:
        with open('chr_' + chromosome + '.ped', 'w') as pedfile:
            id_index = 0
            for line in infile:
                if not line.startswith("CHR"):
                    id_index += 1
                    IID = "ind_" + str(id_index)
                    line = line.rstrip().split()
                    pedfile.write(" ".join([IID, IID, "0", "0", "0", "-9"]+[pedDict[genotype] for genotype in line][1:])+ "\n")


convertToPlink(infile_name)

test input: chr_1.txt

CHR     59      100     130     165
1       3       1       2       3
1       3       0       2       1
1       0       0       1       0
1       0       3       3       1
1       1       1       0       2

executing

convertToPlink.py chr_1.txt

yields two plink-compatible files:

chr_1.map

1       1_59_SNP        0       59
1       1_100_SNP       0       100
1       1_130_SNP       0       130
1       1_165_SNP       0       165

chr_1.ped

ind_1 ind_1 0 0 0 -9 0 0 R A A A 0 0
ind_2 ind_2 0 0 0 -9 0 0 R R A A R A
ind_3 ind_3 0 0 0 -9 R R R R R A R R
ind_4 ind_4 0 0 0 -9 R R 0 0 0 0 R A
ind_5 ind_5 0 0 0 -9 R A R A R R A A
caw5cv
  • 701
  • 3
  • 9