-4

I am working on a program that estimates the statistic Tajima's D in a series of sliding windows across a chromosome. The chromosome itself is also divided into a number of different regions with (hopefully) functional significance. The sliding window analysis is performed by my script on each region.

At the start of the program, I define the size of the sliding windows and the size of the steps that move from one window to the next. I import a file which contains the coordinates for each different chromosomal region, and import another file which contains all the SNP data I am working with (this is read line-by-line, as it is a large file). The program loops through the list of chromosomal locations. For each location, it generates an index of steps and windows for the analysis, partitions the SNP data into output files (corresponding with the steps), calculates key statistics for each step file, and combines these statistics to estimate Tajima's D for each window.

The program works well for small files of SNP data. It also works well for the first iteration over the first chromosomal break point. However, for large files of SNP data, the step size in the analysis is inexplicably decreased as the program iterates over each chromosomal regions. For the first chromosomal regions, the step size is 2500 nucleotides (this is what it is suppose to be). For the second chromosome segment, however, the step size is 1966, and for the third it is 732.

If anyone has any suggestions at to why this might be the case, please let me know. I am especially stumped as this program seems to work size for small files but not for larger ones.

My code is below:

import sys
import math
import fileinput
import shlex
import string
windowSize = int(500)
stepSize = int(250)
n = int(50)     #number of individuals in the anaysis
SNP_file = open("SNPs-1.txt",'r')
SNP_file.readline()
breakpoints = open("C:/Users/gwilymh/Desktop/Python/Breakpoint coordinates.txt", 'r')
breakpoints = list(breakpoints)
numSegments = len(breakpoints)
# Open a file to store the Tajima's D results:
outputFile = open("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/Tajima's D estimates.txt", 'a')
outputFile.write(str("segmentNumber\tchrSegmentName\tsegmentStart\tsegmentStop\twindowNumber\twindowStart\twindowStop\tWindowSize\tnSNPs\tS\tD\n"))

#Calculating parameters a1, a2, b1, b2, c1 and c2
numPairwiseComparisons=n*((n-1)/2)
b1=(n+1)/(3*(n-1))
b2=(2*(n**2+n+3))/(9*n*(n-1))
num=list(range(1,n))                # n-1 values as a list
i=0
a1=0
for i in num:
   a1=a1+(1/i)
   i=i+1
j=0
a2=0
for j in num:
    a2=a2+(1/j**2)
    j=j+1
c1=(b1/a1)-(1/a1**2)
c2=(1/(a1**2+a2))*(b2 - ((n+2)/(a1*n))+ (a2/a1**2) )

counter6=0
#For each segment, assign a number and identify the start and stop coodrinates and the segment name
for counter6 in range(counter6,numSegments):
    segment = shlex.shlex(breakpoints[counter6],posix = True)
    segment.whitespace += '\t'
    segment.whitespace_split = True
    segment = list(segment)
    segmentName = segment[0]
    segmentNumber = int(counter6+1)
    segmentStartPos = int(segment[1])
    segmentStopPos = int(segment[2])
    outputFile1 = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_Count of SNPs and mismatches per step.txt")%(str(segmentNumber),str(segmentName))), 'a')

#Make output files to index the lcoations of each window within each segment
    windowFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_windowFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'a')
    k = segmentStartPos - 1
    windowNumber = 0
    while (k+1) <=segmentStopPos:
        windowStart = k+1
        windowNumber = windowNumber+1
        windowStop = k + windowSize 
        if windowStop > segmentStopPos:
            windowStop = segmentStopPos
        windowFileIndex.write(("%s\t%s\t%s\n")%(str(windowNumber),str(windowStart),str(windowStop)))
        k=k+stepSize
    windowFileIndex.close()

# Make output files for each step to export the corresponding SNP data into + an index of these output files
    stepFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_stepFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'a')
    i = segmentStartPos-1
    stepNumber = 0
    while (i+1) <= segmentStopPos:
        stepStart = i+1
        stepNumber = stepNumber+1
        stepStop = i+stepSize 
        if stepStop > segmentStopPos:
            stepStop = segmentStopPos
        stepFile = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(stepNumber))), 'a')
        stepFileIndex.write(("%s\t%s\t%s\n")%(str(stepNumber),str(stepStart),str(stepStop)))
        i=i+stepSize
    stepFile.close()
    stepFileIndex.close()

# Open the index file for each step in current chromosomal segment
    stepFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_stepFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'r')
    stepFileIndex = list(stepFileIndex)
    numSteps = len(stepFileIndex)

    while 1:
        currentSNP = SNP_file.readline()
        if not currentSNP: break
        currentSNP = shlex.shlex(currentSNP,posix=True)
        currentSNP.whitespace += '\t'
        currentSNP.whitespace_split = True
        currentSNP = list(currentSNP)
        SNPlocation = int(currentSNP[0])
        if SNPlocation > segmentStopPos:break
        stepIndexBin = int(((SNPlocation-segmentStartPos-1)/stepSize)+1)
        #print(SNPlocation, stepIndexBin)
        writeFile = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(stepIndexBin))), 'a')
        writeFile.write((("%s\n")%(str(currentSNP[:]))))
        writeFile.close()

    counter3=0
    for counter3 in range(counter3,numSteps):
# open up each step in the list of steps across the chromosomal segment:
        L=shlex.shlex(stepFileIndex[counter3],posix=True)
        L.whitespace += '\t'
        L.whitespace_split = True
        L=list(L)
        #print(L)
        stepNumber = int(L[0])
        stepStart = int(L[1])
        stepStop = int(L[2])
        stepSize = int(stepStop-(stepStart-1))
#Now open the file of SNPs corresponding with the window in question and convert it into a list:
        currentStepFile = open(("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(counter3+1)),'r')
        currentStepFile = list(currentStepFile)
        nSNPsInCurrentStepFile = len(currentStepFile)
        print("number of SNPs in this step is:", nSNPsInCurrentStepFile)
        #print(currentStepFile)
        if nSNPsInCurrentStepFile == 0:
            mismatchesPerSiteList = [0] 
        else:
# For each line of the file, estimate the per site parameters relevent to Tajima's D
            mismatchesPerSiteList = list()
            counter4=0
            for counter4 in range(counter4,nSNPsInCurrentStepFile):
                CountA=0
                CountG=0
                CountC=0
                CountT=0
                x = counter4
                lineOfData = currentStepFile[x]
                counter5=0
                for counter5 in range(0,len(lineOfData)):
                    if lineOfData[counter5]==("A" or "a"): CountA=CountA+1
                    elif lineOfData[counter5]==("G" or "g"): CountG=CountG+1
                    elif lineOfData[counter5]==("C" or "c"): CountC=CountC+1
                    elif lineOfData[counter5]==("T" or "t"): CountT=CountT+1
                    else: continue
                AxG=CountA*CountG
                AxC=CountA*CountC
                AxT=CountA*CountT
                GxC=CountG*CountC
                GxT=CountG*CountT
                CxT=CountC*CountT
                NumberMismatches = AxG+AxC+AxT+GxC+GxT+CxT
                mismatchesPerSiteList=mismatchesPerSiteList+[NumberMismatches]
        outputFile1.write(str(("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n")%(segmentNumber, segmentName,stepNumber,stepStart,stepStop,stepSize,nSNPsInCurrentStepFile,sum(mismatchesPerSiteList))))
    outputFile1.close()

    windowFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_windowFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'r')
    windowFileIndex = list(windowFileIndex)
    numberOfWindows = len(windowFileIndex)
    stepData = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_Count of SNPs and mismatches per step.txt")%(str(segmentNumber),str(segmentName))), 'r')
   stepData = list(stepData)
    numberOfSteps = len(stepData)

    counter = 0
    for counter in range(counter, numberOfWindows):
        window = shlex.shlex(windowFileIndex[counter], posix = True)
        window.whitespace += "\t"
        window.whitespace_split = True
        window = list(window)
        windowNumber = int(window[0])
        firstCoordinateInCurrentWindow = int(window[1])
        lastCoordinateInCurrentWindow = int(window[2])
        currentWindowSize = lastCoordinateInCurrentWindow - firstCoordinateInCurrentWindow +1
        nSNPsInThisWindow = 0
        nMismatchesInThisWindow = 0

        counter2 = 0
        for counter2 in range(counter2,numberOfSteps):
            step = shlex.shlex(stepData[counter2], posix=True)
            step.whitespace += "\t"
            step.whitespace_split = True
            step = list(step)
            lastCoordinateInCurrentStep = int(step[4])
            if lastCoordinateInCurrentStep < firstCoordinateInCurrentWindow: continue
            elif lastCoordinateInCurrentStep <= lastCoordinateInCurrentWindow:
                nSNPsInThisStep = int(step[6])
                nMismatchesInThisStep = int(step[7])
                nSNPsInThisWindow = nSNPsInThisWindow + nSNPsInThisStep
                nMismatchesInThisWindow = nMismatchesInThisWindow + nMismatchesInThisStep
            elif lastCoordinateInCurrentStep > lastCoordinateInCurrentWindow: break
        if nSNPsInThisWindow ==0 :
            S = 0
            D = 0
        else:
            S = nSNPsInThisWindow/currentWindowSize
            pi = nMismatchesInThisWindow/(currentWindowSize*numPairwiseComparisons)
            print(nSNPsInThisWindow,nMismatchesInThisWindow,currentWindowSize,S,pi)
            D = (pi-(S/a1))/math.sqrt(c1*S + c2*S*(S-1/currentWindowSize))
        outputFile.write(str(("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n")%(segmentNumber,segmentName,segmentStartPos,segmentStopPos,windowNumber,firstCoordinateInCurrentWindow,lastCoordinateInCurrentWindow,currentWindowSize,nSNPsInThisWindow,S,D)))
gwilymh
  • 415
  • 1
  • 7
  • 20
  • 8
    First of all, stop. Split your code up into separate functions, test each function. If you still can't get it to work, post here only the *relevant portion* of the code. Also, search StackOverflow for "sliding window iterator" there are plenty of implementations out there. – Joel Cornett Mar 06 '13 at 22:26
  • 2
    Also, your problem probably has something to do with the line `stepSize = int(stepStop-(stepStart-1))`. – Joel Cornett Mar 06 '13 at 22:28
  • 1
    Don't `import string`: `Module 'string'` `Warning:` most of the code you see here isn't normally used nowadays. Beginning with Python 1.6, many of these functions are implemented as methods on the standard string object. – askewchan Mar 06 '13 at 22:34
  • 2
    I thought Python code was supposed to be easy to read – John La Rooy Mar 06 '13 at 23:20
  • 1
    @gnibbler: OP is using the Excel spreadsheet naming convention. – Joel Cornett Mar 08 '13 at 02:02

1 Answers1

5

A quick search shows that you do change your stepSize on line 110:

    stepStart = int(L[1])
    stepStop = int(L[2])
    stepSize = int(stepStop-(stepStart-1))

stepStop and stepStart appear to depend on your files' contents, so we can't debug it further.

Pavel Anossov
  • 60,842
  • 14
  • 151
  • 124