Finding substrings in a DNA sequence; script returns higher values than expected

Question

I'm struggling with a really frustrating problem, I've spent the past 2.5 hours trying to find the bug, but I can't manage. The problem is this: I have to find the amount of occurrences of each combination of 4 DNA nucleotides (AAAA-TTTT) in a string. The problem is; my script returns wrong values.

The idea is: create a dict of all possible 4-mers AAAA-TTTT, then iterate over every character in the DNA sequence, and check which k-mer in the dict is a match for the current index. Add 1 to that k-mer in the dict, and then return the value.

Seems simple enough, but I really can't figure out why it doesn't work.

Input file: https://drive.google.com/file/d/1hDRaQ76hhhCQO4mFocC6yIKSY5mhgDjt/view?usp=sharing

Output: 340 319 348 337 331 329 343 348 336 345 347 370 307 356 313 368 324 315 365 338 322 327 332 341 336 352 350 354 381 339 330 377 346 318 337 346 383 326 311 335 343 326 354 349 326 367 355 344 313 314 320 356 370 347 327 369 340 337 335 340 368 308 363 346 331 324 341 324 344 330 326 382 323 360 355 355 326 360 341 357 329 342 313 360 335 354 320 359 331 350 311 355 350 345 335 338 308 359 321 316 332 348 331 354 312 351 340 339 356 353 365 343 384 331 363 379 341 329 346 378 356 329 316 342 354 371 357 320 345 331 346 347 350 337 359 343 334 324 338 319 319 327 344 336 322 376 339 332 340 346 360 333 317 308 337 365 355 351 328 330 338 344 313 345 331 333 339 340 345 338 293 333 326 357 319 325 331 374 335 339 378 333 344 351 340 354 307 343 330 340 341 365 329 368 366 339 318 326 359 342 364 320 338 346 351 366 356 326 357 361 375 351 343 336 328 336 317 361 340 350 375 356 357 354 377 367 348 319 317 363 343 342 333 321 317 302 367 340 315 368 378 326 346 321 348 336 348 344 379 342 319 372 324 353 362 358

Expected output: I don't know, I only know that the output I'm currently getting is wrong.

from sys import argv

# functions

def kmer():
    """Returns a dict of all possible 4-mers, and sets the occurrences to 0"""
    kmers = {}
    bases = ['A', 'C', 'G', 'T']
    for i in range(4**4):
        kmer = bases[i // 64] + bases[(i % 64) // 16] + bases[(i % 16) // 4] + bases[(i % 4)]
        kmers[kmer] = 0
    return kmers

def count_kmer(seq, kmer_dict):
    """Returns occurrences of each k-mer in the DNA sequence"""
    for nt in range(len(seq)):
        if seq[nt : nt + 4] in kmer_dict:
            kmer_dict[seq[nt : nt + 4]] += 1
    return kmer_dict

def input_parser(filename):
    """Returns contents of input filename as a string"""
    lines = open(filename, 'r').readlines()[1:]
    seq = ""
    for line in lines:
        line.replace('\n', '')
        seq += line
    return seq

def write_output(result):
    output = open('kmers.txt', 'w')
    for i in result:
        value = str(result[i]) + ' '
        output.write(value)

# main
if __name__ == '__main__':
    # Parse input
    seq = input_parser('rosalind_kmer.txt')
    # Create k-mer dict
    kmer_dict = kmer()
    # Count k-mer occurrences
    result = count_kmer(seq, kmer_dict)
    # Print results
    write_output(result)

See [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) particularly the **Help others reproduce the problem** section. In this case, that would mean including a small sample input file and the expected output, along with the wrong results you are currently getting. — martineau, May 13 '21 at 17:51
No, a *small* sample input file and include the data in your question, not an offsite link to it — and small enough that you can figure out what the output would be and post that in your question, too. — martineau, May 13 '21 at 17:59
Can the kmers overlap (as your code does) or must they not overlap? — Chris Charley, May 13 '21 at 18:02
They can overlap. I found the problem though, solution below — user2296226, May 13 '21 at 18:15

score 0 · Answer 1 · answered May 13 '21 at 18:15

After 3 hours I found the problem.. It was quite simple: Because I read the sequence from a file, newline characters were included. These newline characters broke the sequence up, while the sequence should have been a continuous string. After creating a small for loop (shown below), the script performed as expected.

def input_parser(filename):
    """Returns contents of input filename as a string"""
    text = open(filename, 'r').read()
    seq = ''
    for ch in text:
        if ch in ['A', 'C', 'G', 'T']:
            seq += ch
    return seq

Finding substrings in a DNA sequence; script returns higher values than expected

1 Answers1