-2

I have multiple text files in a folder that I'm trying to read and write into a dictionary. The files look like this:

file1.txt:

chr17   1   1   T   C   C   5
chr13   2   2   A   A   G   4

file2.txt:

chr17   1   1   T   C   C   5
chr17   2   2   A   A   G   4

Code:

import os,csv, glob


mydict = {}
for file in glob.glob(os.path.join(os.getcwd(), '*.txt')):
    with open(file) as f:
        for line in f:
            mydict[",".join(line.split()[0:4])] = ",".join(line.split()[4:6])
    for (key,val) in mydict.items():
        print file, key, val

Expecting it to print all the four rows in the two files with first four columns as key and 5,6 columns as value:

file1.txt chr17,1,1,T C,C
file1.txt chr13,2,2,A A,G
file2.txt chr17,1,1,T C,C
file2.txt chr17,2,2,A A,G

But getting this, instead:

file1.txt chr17,1,1,T C,C
file1.txt chr13,2,2,A A,G
file2.txt chr17,1,1,T C,C
file2.txt chr13,2,2,A A,G (extra row!!! This row's in file1, but not file2)
file2.txt chr17,2,2,A A,G
pam
  • 1,175
  • 5
  • 15
  • 28
  • 1
    Where in your code do you limit the lines to those that are common to file1 and file2? – zondo Feb 23 '16 at 17:49
  • @zondo I don't want to limit the duplicate lines. I want it to print all the lines that are actually there. This extra row is not in file2. Its getting the line from file1. – pam Feb 23 '16 at 17:51
  • @pam: Right, but why do you think your code shouldn't do that? Your code just loops through one file and gets all the rows, then loops through the other file and gets all the rows. There's nothing in your code that says a row shouldn't be printed if it's only in file1. – BrenBarn Feb 23 '16 at 17:52
  • 1
    You are reading file1, however, so *I* would expect your program to print the line. You say you want to print all the lines that are actually there. Well, it *is* there...in file1. Since you are printing all the lines in all files in that folder, you will print all lines in file1. – zondo Feb 23 '16 at 17:53
  • @BrenBarn It should print even if the row is only in one of the files, but the association is wrong. The fourth row in the output says its from file2, when its not actually in file2. Ideally, the output should only have the four rows that are in both the files (unique or duplicate) – pam Feb 23 '16 at 17:54
  • @zondo The line with chr13,2,2,A A,G is already printed for file1 (and its totally fine). But it is printing again for file2. – pam Feb 23 '16 at 17:55
  • 4
    You need to create a fresh `mydict` for each file. Put `mydict = {}` under the `with` line. – PM 2Ring Feb 23 '16 at 17:55
  • 1
    @pam: Please clarify your intent. In your comment you just said "It should print even if the row is only in one of the files", and then in the same comment you said "the output should only have the four rows that are in both the files". Those aren't the same thing. Which of those do you mean? – BrenBarn Feb 23 '16 at 17:55
  • @pam It's because during the second loop value of file is `file2.txt`, it doesn't mean the line came from that file. Your dict has no knowledge of file name. – Ashwini Chaudhary Feb 23 '16 at 17:55
  • You define `mydict` as `{}` only *before* the loop. You have to redefine it each time, or it will always be printing everything that was already there. – zondo Feb 23 '16 at 17:57
  • @BrenBarn My bad! I'm not a native English speaker. PM 2Ring's comment worked. Thanks – pam Feb 23 '16 at 17:58
  • 1
    You could probably do with a better data structure. Consider building a `collections.namedtuple` for each line. This won't solve the problem in the question that PM 2Ring fixed already, but..... – Adam Smith Feb 23 '16 at 17:58

1 Answers1

2

You need to create a fresh mydict for each file.

import os,csv, glob    

for file in glob.glob(os.path.join(os.getcwd(), '*.txt')):
    with open(file) as f:
        mydict = {}
        for line in f:
            mydict[",".join(line.split()[0:4])] = ",".join(line.split()[4:6])
    for key,val in mydict.iteritems():
        print file, key, val
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182