CS50 Problem Set 6 (DNA) "Python", I can't count Intermittent DNA sequence, my code succeeds in a small database, fail in the large one

Question

I am a beginner in programming, so I decided to take a CS50 course. In Problem Set6 (Python) I wrote the code and it worked for the small database but it failed for the big one so I only asked for help with the idea. Here is the course page, and you can download it here (from Google Drive)

My Code

import csv
from sys import argv


class DnaTest(object):

    """CLASS HELP: the DNA test, simply give DNA sequence to the program, and it searches in the database to
       determine the person who owns the sample.

    type the following in cmd to run the program:
    python dna.py databases/small.csv sequences/1.txt """

    def __init__(self):
        # get filename from the command line without directory names "database" and "sequence"
        self.sequence_argv = str(argv[2][10:])
        self.database_argv = str(argv[1][10:])

        # Automatically open and close the database file
        with open(f"databases/{self.database_argv}", 'r') as database_file:
            self.database_file = database_file.readlines()

        # Automatically open and close the sequence file
        with open(f"sequences/{self.sequence_argv}", 'r') as sequence_file:
            self.sequence_file = sequence_file.readline()

        # Read CSV file as a dictionary, function: compare_database_with_sequence()
        self.csv_database_dictionary = csv.DictReader(self.database_file)
        # Read CSV file to take the first row, function: get_str_list()
        self.reader = csv.reader(self.database_file)
        # computed dictionary from the sequence file
        self.dict_from_sequence = {}

    # returns the first row of the CSV file (database file)
    def get_str_list(self):
        # get first row from CSV file
        self.keys = next(self.reader)

        # remove 'name' from list, get STR only.
        self.keys.remove("name")
        return self.keys

    # returns dictionary of computed STRs from the sequence file (key(STR): value(count))
    def get_str_count_from_sequence(self):  # PROBLEM HERE AND RETURN DICTIONARY FROM IT !
        for dna_seq in self.get_str_list():
            self.dict_from_sequence.update({dna_seq: self.sequence_file.count(dna_seq)})

    # compare computed dictionary with the database dictionaries and get the person name
    def compare_database_with_sequence(self):
        for dictionary in self.csv_database_dictionary:
            dict_from_database = dict(dictionary)
            dict_from_database.pop('name')

            # compare the database dictionaries with sequence computed dictionary
            shared_items = {k: self.dict_from_sequence[k] for k in self.dict_from_sequence if
                            k in dict_from_database and self.dict_from_sequence[k] == int(dict_from_database[k])}

            if len(self.dict_from_sequence) == len(shared_items):
                dict_from_database = dict(dictionary)
                print(dict_from_database['name'])
                break


# run the class and its functions (Program control)
if __name__ == '__main__':
    RunTest = DnaTest()
    RunTest.get_str_count_from_sequence()
    RunTest.compare_database_with_sequence()

Problem is

in function get_str_count_from_sequence(self): i use count, and it is work but for Sequential sequence, in sequence file (example 5.txt), The required sequence is non-sequential and I cannot compare every number of consecutive sequences. I searched, but I did not find anything simple. Some use the Regex module and others use re module and I have not found a solution.

TEST CODE:

From CS50 site: Run your program as python dna.py databases/large.csv sequences/6.txt Your program should output Luna

specification

From CS50 site.

Re is a python library that does regex. Regex or regular expression is a very common paradigm in programming (not a library) which kind of deals with pattern matching in a text. If you're beginner in programming, it is advisable to understand regex properly. . There are tons of resources online, like this one lets you check your regex https://regexr.com/ — Piyush Singh, Mar 24 '20 at 17:41
When asking questions here it is best to reduce your example code to a working minimal example. This might take some effort on your part to find the portion of code that is at fault then write a functional (stand-alone) toy example including the faulty code. If your code operates on data, you need to include a minimal example of the data and the expected results. Please read [mcve]. The code and data in your question should reproduce the problem. The key is a minimum amount of code and data - something *we* an copy and paste then test. — wwii, Mar 24 '20 at 19:03
First and not least, thank you for your comment and then your advice and guidance, but I divided my question into parts first The code (writing) and also I added a direct download link and also the attachments but I can not include a part of the code or attachments and when you download the files you will know that I have shortened to the fullest extent that Labeling of the sequence is too large to paste into the question, then the Excel file, secondly the problem, and finally the expected solution. — MrAhmedElsayed, Mar 25 '20 at 16:16

score 1 · Accepted Answer · answered Apr 27 '20 at 13:39

Thank you "Piyush Singh" I worked with your advice and used re to solve the problem. At first, I chose a group (the longest sequential sequence) using re and set the match group which is a dictionary and then I took the largest value for each STR and then I cleared the dictionary data to store the next STR and here I made an update to a comparison function Dictionaries (read from the database and calculated from the sequence file)

import csv
from sys import argv
import re


class DnaTest(object):
    """CLASS HELP: the DNA test, simply give DNA sequence to the program, and it searches in the database to
       determine the person who owns the sample.

    type the following in cmd to run the program:
    python dna.py databases/small.csv sequences/1.txt """

    def __init__(self):
        # get filename from the command line without directory names "database" and "sequence"
        self.sequence_argv = str(argv[2][10:])
        self.database_argv = str(argv[1][10:])

        # Automatically open and close the database file
        with open(f"databases/{self.database_argv}", 'r') as database_file:
            self.database_file = database_file.readlines()

        # Automatically open and close the sequence file
        with open(f"sequences/{self.sequence_argv}", 'r') as sequence_file:
            self.sequence_file = sequence_file.readline()

        # Read CSV file as a dictionary, function: compare_database_with_sequence()
        self.csv_database_dictionary = csv.DictReader(self.database_file)
        # Read CSV file to take the first row, function: get_str_list()
        self.reader = csv.reader(self.database_file)
        # computed dictionary from the sequence file
        self.dict_from_sequence = {}
        self.select_max = {}

    # returns the first row of the CSV file (database file)
    def get_str_list(self):
        # get first row from CSV file
        keys = next(self.reader)

        # remove 'name' from list, get STR only.
        keys.remove("name")
        return keys

    # returns dictionary of computed STRs from the sequence file (key(STR): value(count))
    def get_str_count_from_sequence(self):  # PROBLEM HERE AND RETURN DICTIONARY FROM IT !
        for str_key in self.get_str_list():
            regex = rf"({str_key})+"
            matches = re.finditer(regex, self.sequence_file, re.MULTILINE)

            # my code
            for match in matches:
                match_len = len(match.group())
                key_len = len(str_key)
                self.select_max[match] = match_len
                #  select max value from results dictionary (select_max)
                max_values = max(self.select_max.values())

                if max_values >= key_len:
                    result = int(max_values / key_len)
                    self.select_max[str_key] = result
                    self.dict_from_sequence[str_key] = result

            # clear compare dictionary to select new key
            self.select_max.clear()

    # compare computed dictionary with the database dictionaries and get the person name
    def compare_database_with_sequence(self):
        # comparison function between database dictionary and sequence computed dictionary
        def dicts_equal(from_sequence, from_database):
            """ return True if all keys and values are the same """
            return all(k in from_database and int(from_sequence[k]) == int(from_database[k]) for k in from_sequence) \
                and all(k in from_sequence and int(from_sequence[k]) == int(from_database[k]) for k in from_database)

        def check_result():
            for dictionary in self.csv_database_dictionary:
                dict_from_database = dict(dictionary)
                dict_from_database.pop('name')

                if dicts_equal(self.dict_from_sequence, dict_from_database):
                    dict_from_database = dict(dictionary)
                    print(dict_from_database['name'])
                    return True

        if check_result():
            pass
        else:
            print("No match")


# run the class and its functions (Program control)
if __name__ == '__main__':
    RunTest = DnaTest()
    RunTest.get_str_count_from_sequence()
    RunTest.compare_database_with_sequence()

Check Solution

Run your program as python dna.py databases/small.csv sequences/1.txt. Your program should output Bob.
Run your program as python dna.py databases/small.csv sequences/2.txt. Your program should output No match.

for more checks visit CS50 DNA problem set

jliu1999 · Answer 2 · 2020-08-29T13:57:34.690

To get the maximum number of consecutive STRs for each STR, I wrote only several lines of codes. The idea is: you search a STR, if you find it, then you search STRx2, if find again, then search STRx3, and so on, until you can’t find STRxn, then your maximum number is n-1. Since STRxn is always consecutive, so you don’t need to worry if you find anything non-consecutive. You don’t need python library other than sys and csv. My whole piece of codes is less than 30 lines.

enter code here

import csv
import sys

# check command-line arguments, expect 3 including dna.py
n = len(sys.argv)
if n != 3:
    print("Usage: python dna.py data.csv sequence.txt")
    exit(0)

with open(sys.argv[1], 'r') as database:  # read database
    data_lines = csv.reader(database)  # read line-by-line, store in data_lines
    data = [row for row in data_lines]  # convert to list of lists, store in data

with open(sys.argv[2], 'r') as sequences:
    dna = sequences.read()  # read sequence data, store in string dna

counts = []  # list to store counts of the longest run of consecutive repeats of each STR

for i in range(1, len(data[0])):  # loop through all STR
    count = 1
    string = data[0][i]  # assign each STR to a string
    while string * count in dna:  # if find 1 string, then try to find string*2, and so on
        count += 1
    counts.append(str(count - 1))  # should be decreased by 1 as initialized to 1. int to str

for j in range(1, len(data)):  # loop through all rows in database
    if data[j][1:len(data[0])] == counts:  # compare only numebrs in each row to counts
        print(data[j][0])  # print corresponding name
        exit(0)
print('No Match')

*My whole piece of codes is less than 30 lines.* – If you've got a better solution / answer than OP's own (and accepted) answer then it'd be best to include the mentioned code in your answer. – A good (code) answer should have both, the explaination behind the code as well as the code itself. Please consider revising your answer. — Ivo Mori, Aug 29 '20 at 13:50
Sorry, I thought people might be interested in the idea, not the code itself. Anyway, I just edited my answer and included my code as you suggested. — jliu1999, Aug 29 '20 at 14:00

CS50 Problem Set 6 (DNA) "Python", I can't count Intermittent DNA sequence, my code succeeds in a small database, fail in the large one

2 Answers2