File Names Chain in python

Question

I CANNOT USE ANY IMPORTED LIBRARY. I have this task where I have some directories containing some files; every file contains, besides some words, the name of the next file to be opened, in its first line. Once every word of every files contained in a directory is opened, they have to be treated in a way that should return a single string; such string contains in its first position, the most frequent first letter of every word seen before, in its second position the most frequent second letter, and so on. I have managed to do this with a directory containing 3 files, but it's not using any type of chain-like mechanism, rather a passing of local variables. Some of my college colleagues suggested I had to use slicing of lists, but I can't figure out how. I CANNOT USE ANY IMPORTED LIBRARY. This is what I got:

'''
    The objective of the homework assignment is to design and implement a function
    that reads some strings contained in a series of files and generates a new
    string from all the strings read.
    The strings to be read are contained in several files, linked together to
    form a closed chain. The first string in each file is the name of another
    file that belongs to the chain: starting from any file and following the
    chain, you always return to the starting file.
    
    Example: the first line of file "A.txt" is "B.txt," the first line of file
    "B.txt" is "C.txt," and the first line of "C.txt" is "A.txt," forming the 
    chain "A.txt"-"B.txt"-"C.txt".
    
    In addition to the string with the name of the next file, each file also
    contains other strings separated by spaces, tabs, or carriage return 
    characters. The function must read all the strings in the files in the chain
    and construct the string obtained by concatenating the characters with the
    highest frequency in each position. That is, in the string to be constructed,
    at position p, there will be the character with the highest frequency at 
    position p of each string read from the files. In the case where there are
    multiple characters with the same frequency, consider the alphabetical order.
    The generated string has a length equal to the maximum length of the strings
    read from the files.
    
    Therefore, you must write a function that takes as input a string "filename"
    representing the name of a file and returns a string.
    The function must construct the string according to the directions outlined
    above and return the constructed string.
    
    Example: if the contents of the three files A.txt, B.txt, and C.txt in the
    directory test01 are as follows
    
    
    test01/A.txt          test01/B.txt         test01/C.txt                                                                 
    -------------------------------------------------------------------------------
    test01/B.txt          test01/C.txt         test01/A.txt
    house                 home                 kite                                                                       
    garden                park                 hello                                                                       
    kitchen               affair               portrait                                                                     
    balloon                                    angel                                                                                                                                               
                                               surfing                                                               
    
    the function most_frequent_chars ("test01/A.txt") will return "hareennt".
    '''

        def file_names_list(filename):
            intermezzo = []
            lista_file = []
        
            a_file = open(filename)
        
            lines = a_file.readlines()
            for line in lines:
                intermezzo.extend(line.split())
            del intermezzo[1:]
            lista_file.append(intermezzo[0])
            intermezzo.pop(0)
            return lista_file
        
        
        def words_list(filename):
            lista_file = []
            a_file = open(filename)
        
            lines = a_file.readlines()[1:]
            for line in lines:
                lista_file.extend(line.split())
            return lista_file
        
        def stuff_list(filename):
            file_list = file_names_list(filename)
            the_rest = words_list(filename)
            second_file_name = file_names_list(file_list[0])
            
            
            the_lists = words_list(file_list[0]) and 
            words_list(second_file_name[0])
            the_rest += the_lists[0:]
            return the_rest
            
        
        def most_frequent_chars(filename):
            huge_words_list = stuff_list(filename)
            maxOccurs = ""
            list_of_chars = []
            for i in range(len(max(huge_words_list, key=len))):
                for item in huge_words_list:
                    try:
                        list_of_chars.append(item[i])
                    except IndexError:
                        pass
                    
                maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
                list_of_chars.clear()
            return maxOccurs
        print(most_frequent_chars("test01/A.txt"))

Please provide a complete example with the inputs, and the expected output. https://stackoverflow.com/help/minimal-reproducible-example — C-3PO, Nov 19 '22 at 14:52
Too much arbitrary bunny-hopping around. If you stopped separating everything and instead made one clear function that does the whole job, your problem will likely disappear. Look at your filenames function, you do like 50 things just to get the very first thing in the file... — OneMadGypsy, Nov 19 '22 at 15:03
Segmenting the code is often a good idea though; personally I'd first implement a most_frequent_letter(list_of_words, index) and test it, then work on the rest of the assignment. — Swifty, Nov 19 '22 at 15:08
@Swifty ~ Not everybody can juggle 3 or more balls, but anyone with working hands and arms can certainly juggle one. Segmentation only makes sense if you have similar but not identical operations that share a generic part. If you are just trying to do one thing, there is no benefit to breaking it up into smaller chunks. — OneMadGypsy, Nov 19 '22 at 15:12
I manged to do the most_frequent_letter function. The problem is idk how to open the files in that specific way my homeworks wants me to. That is the main problem I have. I do 50 things to get to the first line of the file because this is what is required to be done and idk how (the first line of the file contains the name of the next one to be opened) — youngsoyuz, Nov 19 '22 at 15:13
I segmented things because it firstly made sense for me, but this is not the main problem, again. — youngsoyuz, Nov 19 '22 at 15:14
"I do 50 things to get to the first line of the file because this is what is required to be done" ~ this isn't true, at all. You could do the exact same thing in two lines of code, and then you realize there is no reason for it to be a function. If you keep following that line, you end up with everything in one function, and it probably works because your don't have code that is 80% noise. — OneMadGypsy, Nov 19 '22 at 15:16
My guy you are still ignoring my problem, please help me to solve it, if you can, otherwise goodbye. — youngsoyuz, Nov 19 '22 at 15:19
I just showed a very poor attempt to solve the problem. I asked a question because I am looking for a better solution. — youngsoyuz, Nov 19 '22 at 15:20
OK ~ "goodbye" sounds good. You're still too new to realize that I gave you the best help of all. My help forces you to help yourself, make realizations, and become a more independent programmer. You just want answers handed to you, and due to how you just talked to me I don't intend to give you any, champ. — OneMadGypsy, Nov 19 '22 at 15:22
From what I understand, the test batteries will provide a filename as input and let you work from that; so your code should start with "filename = input()" ; then write the code that open this file, get what you need from it (next filename and words list) and go on to the next file (keep a set of already parsed filenames so you'll know when you've cycled, and use a while loop); when all files are done, go on to the words parsing. And use the 3 files provided as an example to test your code, of course. — Swifty, Nov 19 '22 at 16:38

C-3PO · Accepted Answer · 2022-11-20T10:44:51.550

This assignment is relatively easy, if the code has a good structure. Here is a full implementation:

def read_file(fname):
    with open(fname, 'r') as f:
        return list(filter(None, [y.rstrip(' \n').lstrip(' ') for x in f for y in x.split()]))

def read_chain(fname):
    seen   = set()
    new    =  fname
    result = []
    while not new in seen:
        A          = read_file(new)
        seen.add(new)
        new, words = A[0], A[1:]
        result.extend(words)
    return result

def most_frequent_chars (fname):
    all_words = read_chain(fname)
    result    = []
    for i in range(max(map(len,all_words))):
        chars = [word[i] for word in all_words if i<len(word)]
        result.append(max(sorted(set(chars)), key = chars.count))
    return ''.join(result)

print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"

In the code above, we define 3 functions:

read_file: simple function to read the contents of a file and return a list of strings. The command x.split() takes care of any spaces or tabs used to separate words. The final command list(filter(None, arr)) makes sure that empty strings are erased from the solution.
read_chain: Simple routine to iterate through the chain of files, and return all the words contained in them.
most_frequent_chars: Easy routine, where the most frequent characters are counted carefully.

PS. This line of code you had is very interesting:

maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)

I edited my code to include it.

Space complexity optimization

The space complexity of the previous code can be improved by orders of magnitude, if the files are scanned without storing all the words:

def scan_file(fname, database):
    with open(fname, 'r') as f:
        next_file = None
        for x in f:
            for y in x.split():
                if next_file is None:
                    next_file = y
                else:
                    for i,c in enumerate(y):
                        while len(database) <= i:
                            database.append({})
                        if c in database[i]:
                            database[i][c] += 1
                        else:
                            database[i][c]  = 1
        return next_file

def most_frequent_chars (fname):
    database  =  []
    seen      =  set()
    new       =  fname
    while not new in seen:
        seen.add(new)
        new  =  scan_file(new, database)
    return ''.join(max(sorted(d.keys()),key=d.get) for d in database)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"

Now we scan the files tracking the frequency of the characters in database, without storing intermediate arrays.

It can be slightly improved, I noticed, by writing the encoding= "utf-8" inside the first open function, so it can read every type of character. Thank you again — youngsoyuz, Nov 19 '22 at 18:27
Nice, I am glad I was able to help. Sorry for the delay in answering, cheers, — C-3PO, Nov 19 '22 at 18:28
I added a new version, optimizing the space complexity of the code. — C-3PO, Nov 20 '22 at 10:46

Swifty · Answer 2 · 2022-11-19T18:47:07.727

Ok, here's my solution:

def parsi_file(filename):
    
    visited_files = set()
    words_list = []
    
    # Getting words from all files
    while filename not in visited_files:
        visited_files.add(filename)
        with open(filename) as f:
            filename = f.readline().strip()
            words_list += [line.strip() for line in f.readlines()] 
    
    # Creating dictionaries of letters:count for each index
    letters_dicts = []
    for word in words_list:
        for i in range(len(word)):    
            if i > len(letters_dicts)-1:
                letters_dicts.append({})
            letter = word[i]
            if letters_dicts[i].get(letter):
                letters_dicts[i][letter] += 1
            else:
                letters_dicts[i][letter] = 1
        
     # Sorting dicts and getting the "best" letter
    code = ""
    for dic in  letters_dicts:
        sorted_letters = sorted(dic, key = lambda letter: (-dic[letter],letter))
        code += sorted_letters[0]
        
    return code

We first get the words_list from all files.
Then, for each index, we create a dictionary of the letters in all words at that index, with their count.
Now we sort the dictionary keys by descending count (-count) then by alphabetical order.
Finally we get the first letter (thus the one with the max count) and add it to the "code" word for this test battery.

Edit: in terms of efficiency, parsing through all words for each index will get worse as the number of words grows, so it would be better to tweak the code to simultaneously create the dictionaries for each index and parse through the list of words only once. Done.

I am very sorry but I didn't specify I couldn't use an input functione either, as you did at the start of your code. I still thank you very much for your help. — youngsoyuz, Nov 19 '22 at 18:08
Indeed, I focused on "input" and forgot "function". Well, I'll just modify the 1st line :) Done; and since they didn't force the function name, I decided to go full Wagnerian ;) — Swifty, Nov 19 '22 at 18:45

File Names Chain in python

2 Answers2