Matching parts of lines in a file (python)

Question

I currently have a list of genes in a file. Each line has a chromosome with it's information. Such an entry appears as:

NM_198212 chr7 + 115926679 115935830 115927071 11593344 2 115926679,'115933260', 115927221,'115935830',

The sequence for the chromosome starts at base 115926679 and continues up to(but not including) base 115935830

If we want the spliced sequence, we use the exons.The first extends from 115926679 to 155927221, and the second goes from '115933260' to '115935830'

However, I have run across a problem when on a complementary sequence such as:

NM_001005286 chr1 - 245941755 245942680 245941755 245942680 1 245941755, '245942680'

Since column 3 is a '-', these coordinates are in reference to the anti-sense strand (the complement to the strand). The first base (in bold) matches the last base on the sense strand (in italics). Since the file only has the sense stand, I need to try to translate coordinates on the anti-sense strand to the sense strand, pick out the right sequence and then reverse-complement it.

That said, I have only been programming for about half a year and and not sure how to starts going about doing this.

I have written a regular expression:

'(NM_\d+)\s+(chr\d+)([(\+)|(-)])\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+),(\d+),s+(\d+),(\d+),'

but am now unsure as to how to start this function... If anyone can help me get started at all on this, perhaps making me see how to do this, I would very much appreciate it.

OK: suppose this is chromosome 25:

AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG

(there are 10 of each character).

Now: if I am looking for an unspliced gene on: chr25 + 10 20

Then the gene starts on position 10 (starting from 0), and goes up to but not including position 20. So its:

CCCCCCCCCC

This is easy. It matches python string slicing really well.

Its more confusing if I give you:

chr25 - 10 20

What you have is the positive strand. But this gene is on the negative (complementary) strand. Remember what the chromosome looks like as a double-strand:

AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG
TTTTTTTTTTGGGGGGGGGGAAAAAAAAAACCCCCCCCCC

We are looking for the gene on the bottom strand. Meaning we count from 0 starting on the right. Number the top strand from the left, and the bottom strand from the right. So what I want here is AAAAAAAAAA.

The catch is that I'm only giving you the top strand. I'm not giving you the bottom strand. (You could generate yourself from the top strand — but given how large it is, I advise against that.)

So you need to convert coordinates. On the bottom strand, base 0 (the right-most C) is opposed to base 39 on the top strand. Base 1 is against base 38. Base 2 is against case 37. (Important point: notice what happens when you add these two numbers up — every time.) So base 10 is against base 29, and base 19 is against base 20.

So: if I want to find base 10-20 on the bottom strand, I can look at base 20-29 on the top (and then reverse-complement it).

I need to figure out how to translate to coordinates on the bottom strand to the equivalent coordinates on the bottom strand. Yes: it is very confusing

I have tried weronika's original answer:

fields = line.split(' \t')
geneID, chr, strand = fields[:2]
start = int(fields[3])
end = int(fields[4])
if strand == '-':
    start,end = -(start + 1), -(end + 1) # this part was changed from original answer.

which is on the right track, but it's not enough. This would take the 10 and 20, and turn it into a 20 and 10.

And I know I can reverse complement the string by doing this:

r = s[::-1]
bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
l = list(r)
o = [bc[base] for base in l]
     return ''.join(o)

Edited! Does this look correct?!

fp2 = open('chr22.fa', 'r')
fp = open('chr22.fa', 'r')
for line in fp2:
    newstring = ''
    z = line.strip()
    newstring += z
for line in fp:
    fields = line.split('\t')
    gene_ID, chr, strand = fields[:2]
    start = int(fields[3])
    end = int(fields[4])
    bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'a': 't', 't': 'a', 'c':'g', 'g':'c', 'N':'N', 'n':'n'}
    l = list(newstring)        
    if strand == '+':
        geneseq = ''.join([bc[base] for base in l[start:end]]) 
    if strand == '-':
        newstart, newend = -(start + 1), -(end + 1)
        genseq = ''.join([bc[base] for base in l[newstart:newend]])

I'm sorry, I don't understand what you're actually trying to do or what the strange syntax of your lines means. — Katriel, Apr 21 '12 at 20:54
maybe the `fuzzy string matching` can help you:http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python . Check this post http://stackoverflow.com/questions/362231/python-finding-partial-string-matches-in-a-large-corpus-of-strings as well. :) — Thanasis Petsas, Apr 21 '12 at 21:01
I just made some bold, some italics, and some in quotes to show the different parts I was trying to use — Peter Hanson, Apr 21 '12 at 21:01
@PatrickCampbell OK, I understand that you care about the third, fourth and fifth columns. But what are you _actually trying to do_ with them? Reverse the numbers? Do the other numbers matter? What are the commas? — Katriel, Apr 21 '12 at 21:10
I am trying to match the '-' sequence with the missing numbers. The commas separate the spliced sequence. The first being between the two italicized numbers and the second part being between the two numbers in "" — Peter Hanson, Apr 21 '12 at 21:16
aka to its complementary strand, and pick out the coordinating sequence from that strand — Peter Hanson, Apr 21 '12 at 21:18
The numbers in single quotes indicate the start and end of the second part of the spliced sequence — Peter Hanson, Apr 21 '12 at 21:21
@PatrickCampbell: Trye to explain your problem more abstractly. Instead of saying "genes, chromosomes, sense, antisense, etc.". Say (for example) "I have a file that I need to parse. The columns are space delimited. Based on whether the 3rd column is a '+' or '-', I need to read remainder of the row differently. If the row has a '-' I need to reverse the order of the remainder of the row." Of course, you would replace what I said with what you really mean (as I have no idea what that is). — Joel Cornett, Apr 21 '12 at 21:22
I think many people here have a hard time understanding all this biology terminology. How about you give a simple input string and tell us what you expect the output string to be. And break down what each section of the string means. — Jack, Apr 21 '12 at 21:24
Yeah, I gotcha. I am going to try to come up with a different way to say it with less biology in it and then repost the question. — Peter Hanson, Apr 21 '12 at 21:33
Are you SURE this is how your input file works?? In all the biology formats I've worked with in the past, the coordinates given in a case like this are still sense strand coordinates. — weronika, Apr 22 '12 at 16:04
it is a file called human.genes and looks like the example given above. I am not sure what you mean by that though... — Peter Hanson, Apr 22 '12 at 16:05
Also, a note on normal SO usage: instead of taking a part of my answer and putting it into your question as something you tried, you should just say you tried my answer and it didn't do what you wanted (and preferably say it in a comment on my answer too, so I get a notification). — weronika, Apr 22 '12 at 16:09
What I mean about your input file: in my experience, if a file says a gene is on the - strand in positions 10-20, it's still in positions 10-20 *counting from the beginning of the sense strand*, not from the beginning of the antisense strand. The sequence needs to be reverse-complemented, but the coordinates are fine. Of course, it's possible your file really works differently than I'm used to, but please double-check that, it's really very odd. — weronika, Apr 22 '12 at 16:11
Well, I think the since the file only contains the numbers of the genes, I want to manipulate the coordinates, but is there a way to just reverse complement the strand and have it mean the same thing? I am confused as to how to change this into what I am wanting. I may be over complicating — Peter Hanson, Apr 22 '12 at 16:14
Since the file is too large to actually reverse compliment the chromosome strand, I need to figure out how to just say to do it with the matched coordinates. That is why I was doing what I did above — Peter Hanson, Apr 22 '12 at 16:20
I am pretty sure you can avoid all the biology nonsense in this problem and look at just the numbers, I know that is what I really need to be looking at here — Peter Hanson, Apr 22 '12 at 16:33
All right, if you're really sure your coordinates work in the way you described, then hopefully my answer will work. That said, you really might want to talk to whoever gave you that file about why they're using such an odd system. — weronika, Apr 22 '12 at 16:34
@PatrickCampbell: Are the files you're accessing pure text and completely composed of bases (no extra newlines, chars, headers, etc.)? — Joel Cornett, Apr 22 '12 at 17:32
The first file actually has no bases, it has 3 lines of headers, and then lines as the example above. The second file is of chr22 and is purely DNA sequences — Peter Hanson, Apr 22 '12 at 18:05
Well, *does* it work? If you have the files and the code, why are you asking if it will work instead of trying it yourself? But I imagine not. To start with, you appear to be getting your chromosome sequence and your gene coordinates from the same file, which doesn't make any sense to me. — weronika, Apr 25 '12 at 03:40

score 0 · Answer 1 · answered Apr 21 '12 at 21:51

I didn't understand the domain problem, but it looks like you're trying to cram too much into one regex. Try to break it down into simpler subproblems, like the following (in pseudocode):

if third column is a '+'
    parseRegularSequence()
else
    parseComplementarySequence()

weronika · Answer 2 · 2012-04-22T16:21:05.553

0

As noted in my comment to the question, this seems like a very odd file format, thus my original confusion.

Note: if this is one of the standard biology file formats, you would probably be better off using Biopython or something similar to parse it.

If you want to do your own parsing, regular expressions still seem like the wrong way to go - difficult to read and unnecessary with a simple space/tab separated file. I'm assuming you have parsed your chromosome fasta file and have the sequences of all the chromosomes as a chrom_sequences name:seq dictionary, and also that you have a reverse_complement function (both of those things are easily implemented by hand but probably better done with biopython).

fields = line.split(' ')  # or '\t' instead of ' ' if the file is tab-separated
gene_ID, chr, strand = fields[:2]
start, end = [int(x) for x in fields[3:5])
this_chromosome_seq = chrom_sequences[chr]
# if strand is +, just take the sequence based on the start-end position
if strand == '+':
  # be careful about off-by-one errors here - some formats give you a 1-based position, 
  #  other formats make it 0-based, and they can also be end-inclusive or end-exclusive
  gene_sequence = this_chromosome_seq[start:end]
# if your coordinates really are given as antisense strand coordinates when strand is -,
#  you just need to subtract them from the chromosome length to get sense-strand coordinates,
#  (switching start and end so they're still in smaller-to-larger order),
#  and then reverse-complement the resulting sequence. 
if strand == '-':
  chrom_length = len(this_chromosome_seq)
  # again, be careful about off-by-one errors here!
  new_start,new_end = chrom_length-end, chrom_length-start
  gene_sequence = reverse_complement(this_chromosome_seq[new_start:new_end])

Original answer, didn't actually do what was asked:

If you just want to get the start/end positions, do something like this:

fields = line.split(' ')  # or '\t' instead of ' ' if the file is tab-separated
gene_ID, chr, strand = fields[:2]
start = int(fields[3])
end = int(fields[4])
if strand == '-':
  start,end = end,start

Then you'll have to parse your fasta file to actually get the sequence for these start-end coordinates, and reverse-complement it if strand=='-'. Again, I think Biopython can do most of that for you.

Another note - Biostar is a good StackExchange-type site specifically for bioinformatics, you may get better answers there.

edited Apr 22 '12 at 16:21

answered Apr 21 '12 at 22:42

weronika

2,561
2
24
30

you are right, using regular expressions is a bad idea, i dont remember who said it but it goes like this: "you have a problem so you use regular expressions to solve it. now you have two problems." – jojo Apr 22 '12 at 16:02
Sorry, I have tried this. Yet it is not quite what I was anticipating as seen above – Peter Hanson Apr 22 '12 at 16:15
@PatrickCampbell Now that you edited the question so I have a better idea what's going on, I gave the answer another try - does that work? – weronika Apr 22 '12 at 16:21
I may have gotten lost in your code, but what does this do: this_chromosome_seq = chrom_sequences[chr] – Peter Hanson Apr 22 '12 at 16:29
@PatrickCampbell I'm assuming you have a `chrom_sequences` dictionary containing multiple chromosome sequences, like this: `{'chr1': 'AAATTTGGGC', 'chr2': 'AGGGGGGCCCC', ...}`, so I'm just pulling out the sequence for your particular chromosome from it. Maybe your fasta parsing resulted in a different data structure - just substitute whatever code results in `this_chromosome_seq` being the sequence of the chromosome you want (like AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG for chromosome 25) – weronika Apr 22 '12 at 16:32
Will parsing like this create said dictionary? Otherwise I just have a file with each one on different lines. I really appreciate all the help you are providing me! I believe you can help me through this, but would you actually like me to send you the file so you can see what I am talking about. Again, thanks for all this help! – Peter Hanson Apr 22 '12 at 16:37
You're welcome! No, my code doesn't create this dictionary - as I said in my answer, "I'm assuming you have parsed your chromosome fasta file and have the sequences of all the chromosomes as a chrom_sequences name:seq dictionary". If you need help with how to parse a fasta file, you should probably ask another question, since this one's getting complicated enough as it is! – weronika Apr 22 '12 at 16:41
I know, I have made this much more complicated than it needed to be. But, you have been a lot of help, and I know I am close to getting it. But, if I could email you, I feel as though it would be easier for you to see what I am talking about here. I have a file (about 2mb) that has each chromosome strand like the example from the start. – Peter Hanson Apr 22 '12 at 16:45
actually I may have found the key! In my other posted question, I just need to incorporate this into the expression somehow. If you are going backwards, for a given start a and stop b (not inclusive), you'd want a slice from -(a + 1) to -(b + 1). – Peter Hanson Apr 22 '12 at 16:48
my code already does that, in a more verbose manner, by setting `new_start,new_end = chrom_length-end, chrom_length-start` and taking `this_chromosome_seq[new_start:new_end]` to get reverse-complemented. The `[...:...]` thing is "classic string slicing". As noted in my comment to the relevant code section, you may want to add +1's to the coordinates, depending on whether they're 1-based or 0-based etc - you'll need to figure that part out. – weronika Apr 22 '12 at 17:00
You're right, it does. I was just thrown off by the dictionary part. I am going to edit you're original post answer in my question. If you could take a look at that, it would be great. I think I am understanding it, but I may be missing a key part. – Peter Hanson Apr 22 '12 at 17:02
@PatrickCampbell - yes, what you posted in the question should work, now you just need to grab the sequence from those coordinates and reverse-complement it (which the newer version of my answer includes somewhat). So, what else are you missing? – weronika Apr 22 '12 at 17:38
I am just trying to figure out how to incorporate what I had for the reverse complement into what the now functional coordinate function (the part I added to your original one). I am trying to understand the way the dictionary should be set up in order for this to work. I want a list of the other file with bases? Not sure... – Peter Hanson Apr 22 '12 at 18:24
@PatrickCampbell There are three steps to this: 1) get the gene coordinates, 2) get the gene SEQUENCE, and 3) reverse-complement the gene sequence. 1 is done already, and you have a function for 3. So you just need to do 2 - get the chromosome sequence (as a string), and slice the gene sequence out of it using the new start,end coordinates. What format do you have the chromosome sequence in, in python? It doesn't need to be a dictionary, whatever you have will be fine. – weronika Apr 22 '12 at 19:08
I have one file in the format in my example (in a .genes format) and then a .fa file that has every base (the actual DNA sequence on the chr) as well. You're right I have part 1 and 3 in the things I need. I just need to get the sequence... – Peter Hanson Apr 22 '12 at 19:22
So you don't know how to get the chromosome sequence from the file read into python, is that it? I.e. you don't know how to parse a fasta file. As I said, that's really a separate question from this one, but here are [some Biostar answers](http://www.biostars.org/post/show/710/correct-way-to-parse-a-fasta-file-in-python/). – weronika Apr 22 '12 at 19:30

Katriel · Accepted Answer · 2012-04-24T19:44:39.603

0

If you slice a string with a negative number, Python counts backward from the end.

complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
s = "AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG"
"".join(complement[c] for c in s[-20:-10])

EDIT:

Your edit looks about right to me, yes. I am very bad at checking for fencepost errors, but you're better placed to see those than I am anyway!

I've rewritten your code to be more Pythonic, but not changed anything substantial.

bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'N':'N'}

f = open('chr22.fa')
l = ''.join(line.strip() for line in f)
f.seek(0)

for line in f:
    fields = line.split('\t')
    gene_ID, chr, strand = fields[:2]

    start = int(fields[3])
    end = int(fields[4])

    if strand == '-':
        start, end = -(start + 1), -(end + 1)

    geneseq = ''.join(bc[base.upper()] for base in l[start:end])

edited Apr 24 '12 at 19:44

answered Apr 22 '12 at 16:16

Katriel

120,462
19
136
170

Right, I understand this. But to do the size of the file I am trying to do this to, about 50mb I am trying to manipulate the coordinates and not do an reverse compliment with letters like this. So I need to try to figure out how to match coordinates.... – Peter Hanson Apr 22 '12 at 16:19
"manipulate coordinates" = "subtract from `len(s)`". Or am I missing something? – Katriel Apr 22 '12 at 16:53
No, this is correct. I just want to do it without the inclusion of actual letters, therefore I want to use the first coded part in my question, instead of the second one – Peter Hanson Apr 22 '12 at 16:55
I need something along the lines of: If you are going backwards, for a given start a and stop b (not inclusive), you'd want a slice from -(a + 1) to -(b + 1) – Peter Hanson Apr 22 '12 at 16:56
Possibly, can you take a look at my revision above? – Peter Hanson Apr 24 '12 at 19:32
@PatrickCampbell looks good to me. See edit for cleaner code. – Katriel Apr 24 '12 at 19:44

score 0 · Answer 4 · answered Apr 23 '12 at 01:15

I imagine (especially since your files are large) that it would be MUCH easier to read and write directly from the file buffer.

Let's say you've already parsed your header file. The line that you parsed looks like so:

line = "NM_001005286 chr1 - 245941755 245942680 245941755 245942680 1"

you then determine what the start/end positions are (in antisense coords):

name, chromosome, direction, start, end = line[:5]

Then, do the following:

#Open up the file `chr1.txt`.
f = open("chr1.txt", "r")

#Determine the read length.
read_len = end - start

#Seek to the appropriate position (notice the second argument, 2 -- this means
#seek from the end of the file)
f.seek(-end, 2)

#Read the data
sense_seq = f.read(read_len)

After that it's just a matter of converting the sequence to antisense.

Easy :)

Matching parts of lines in a file (python)

4 Answers4