0

I’m trying to write a code in python that will help me look for a string between two specific strings. When I implement the code with a single string, I get the desired output. However, I need to match the pattern in an array of sequences. It keeps throwing me an error.

defining a function to look for a pattern between two user specified sequence:

import re
def find_between(prefix, suffix, text):
pattern = r"{}\s*(.*)\s*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
    return result.group(1)
else:
    return None

when I try a single string, it works:

text = "AGGTCCTGTAAACCT"
prefix = "TCCT"
suffix = "ACCT"
find_between(prefix, suffix, text)

output : 'GTAA'

But when I try reading the fastq file and implement the search, it does not:

seqs = readFastq('FN1.fastq')

text = seqs
prefix = "TCCT"
suffix = "ACCT"
find_between(prefix, suffix, text)

It throws me this error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-9c35672e7561> in <module>()
  2 prefix = "TCCT"
  3 suffix = "ACCT"
----> 4 find_between(prefix, suffix, text)

<ipython-input-19-5f42599c717f> in find_between(prefix, suffix, text)
  3 def find_between(prefix, suffix, text):
  4     pattern = r"{}\s*(.*)\s*{}".format(re.escape(prefix),     re.escape(suffix))
----> 5     result = re.search(pattern, text, re.DOTALL)
  6     if result:
  7         return result.group(1)

/Users/shravantikrishna/anaconda/lib/python3.6/re.py in search(pattern, string, flags)
180     """Scan through string looking for a match to the pattern, returning
181     a match object, or None if no match was found."""
--> 182     return _compile(pattern, flags).search(string)
183 
184 def sub(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object
  • The text variable is probably not string or bytes. What do you get if you print out the type(text)? You may be able to convert 'text' to an actual string or bytes object before calling find_between... – Ron Norris Jun 20 '17 at 16:01
  • It still doesn't work. Also, do you know how I can allow a mismatch up to two letters in the prefix and suffix? In the real case, suffix and prefix are going to be the same string. – user8033590 Jun 20 '17 at 21:10

1 Answers1

0

I wouldn't use regex for matching in this apparently simple case. If you're interested in finding the text between a prefix and suffix, you can use: result = text.lstrip(prefix[:2]).rstrip(suffix[:2]) But you did not say which 2 characters you don't need to match in the prefix and suffix.

Here's some sample code and data...

text = 'XXsome data that needs to be parsedXX'
prefix = 'XXYY'
suffix = 'XXYY'
result = text.lstrip(prefix[:2]).rstrip(suffix[:2])
print(result)

some data that needs to be parsed
Ron Norris
  • 2,642
  • 1
  • 9
  • 13
  • Thanks! To be clear, I have one sequence - say ATGC, in between two sequences AGGCCCCC and AGGCCCCC. I want to extract ATGC out even if any two letters do not match in AGGCCCCC. The above code won't work because it's not accounting for mismatches in any position. – user8033590 Jun 20 '17 at 21:31
  • I'll take back what I said about not needing regex because it sounds like you're facing potentially greater data variability in both the text and prefix/suffix.So that would bring you back to looking at what type of object readFastq('FN1.fastq') returns to see why regex is throwing an exception. It seems it is complaining about text which is also seqs. – Ron Norris Jun 21 '17 at 01:18
  • Yes, I think that could be the issue. Because when I use one string, it works, and even with two strings it throws me that error. IS there a way I can make it read an array of strings and return the results as a list? – user8033590 Jun 21 '17 at 16:21