0

I would like to edit a sequencing Fastq file, and delete lines that are repetitive only at certain character positions. Ideally I would iterate over every line in the input file and output a file that has only a single instance of any unique set of characters.

So as shown below. I am only interested at looking at the first 6 chars, last 6 chars, and a portion of the intervening chars of every line, and keep only one instance of each unique combination of the three sequences.

AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA    Start by comparing to this line
AAAAAACCCAAACCCCCCTTTTTTTTTTCCCCCCCCAAAAAA    1-6, 19-28, 37-42 are same, so delete
AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCAAACCAAAAAA    1-6, 19-28, 37-42 are same, so delete
TTTTTTCCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA    1-6 and 36-42 are same but 37-42 are different so keep

As shown in the above example if we take a file that only contains 4 lines, and i am looking at chars 1-6, 19-28, 37-42, lines 2 and 3 would be deleted, or not output to an output file because they have the same characters at each desired postion, but because line 4 is different it will not be deleted.

I have started with this the following code, and my idea is to set each position to a variable (but I don't know have to get intervening sequence), and then compare with each line as we iterate through the input file.

with open(current_file, 'r') as f:
    next(f)
    for line in f:
        start = line[:6]
        end = line[-7:]

If it helps, these files are also 5-10GB, so not tiny. I would appreciate any help. Thanks.

ZdaR
  • 22,343
  • 7
  • 66
  • 87
The Nightman
  • 5,609
  • 13
  • 41
  • 74

2 Answers2

1

A simple approach is to use a dictionary with a key made out of the sections you want to compare. Each new instance will simply overwrite the last one and you'll save unique instances. For the examples you give:

a = 'AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'    #Start by comparing to this line
b = 'AAAAAACCCAAACCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'    #1-6, 19-28, 37-42 are same, so delete
c = 'AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCAAACCAAAAAA'    #1-6, 19-28, 37-42 are same, so delete
d = 'TTTTTTCCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'    #1-6 and 36-42 are same but 37-42 are different so keep
save_dict = {}
for fastq in (a,b,c,d):
    save_dict['%s%s%s' % (fastq[:6], fastq[19:28], fastq[37:42])] = fastq

ends up with save_dict containing

{'AAAAAACTTTTTTTTTCAAAAA': 'AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCAAACCAAAAAA',
 'TTTTTTCTTTTTTTTTCAAAAA': 'TTTTTTCCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'}

(Check the indexes, I may have not included the ones you're after)

iayork
  • 6,420
  • 8
  • 44
  • 49
  • Thanks this is helpful and pretty close to what I'm looking for. It does seem to me though that adding a 10GB file into a dict would be a problem on RAM space. No? – The Nightman Jun 12 '15 at 17:45
  • If the sequences are redundant enough (4:1?) then you may easily fit the dict in memory. If not, I think you have two relatively straightforward options: Either write into a Pandas data frame with one column keyed as here, and then use drop_duplicates (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) or load your data into an SQL database, SQLite probably being the easiest, and work with that. – iayork Jun 12 '15 at 19:11
1

Here is a script that does the following: Pull the elements from each line as a combined string, check it against a list of combined string it has already seen, add the line to a list if it is a new instance.

with open('path/to/file', 'r') as f:
    lineCharsList = []
    outLines = []
    for line in f:
        lineChars = line[0:6]+line[18:28]+line[36:42]
        if not (lineChars in lineCharsList):
            lineCharsList.append(lineChars)
            outLines.append(line)
James
  • 32,991
  • 4
  • 47
  • 70
  • Appreciate the help, this work well for me. I can't however write to an output file. When I try something like `open(output_file 'w')` then `output_file.write(line)` I get the following error: `TypeError: unsupported operand type(s) for >>: 'str' and 'str'` – The Nightman Jun 12 '15 at 17:34