0

there. I'm a begginer in python and I'm struggling to do the following:

I have a file like this (+10k line):

EgrG_000095700 /product="ubiquitin carboxyl terminal hydrolase 5"
EgrG_000095800 /product="DNA polymerase epsilon subunit 3"
EgrG_000095850 /product="crossover junction endonuclease EME1"
EgrG_000095900 /product="lysine specific histone demethylase 1A"
EgrG_000096000 /product="charged multivesicular body protein 6"
EgrG_000096100 /product="NADH ubiquinone oxidoreductase subunit 10"

and this one (+600 lines):

EgrG_000076200.1
EgrG_000131300.1
EgrG_000524000.1
EgrG_000733100.1
EgrG_000781600.1
EgrG_000094950.1

All the ID's of the second file are in the first one,so I want the lines of the first file corresponding to ID's of the second one.

I wrote the following script:

f1 = open('egranulosus_v3_2014_05_27.tsv').readlines()
f2 = open('eg_es_final_ids').readlines()
fr = open('res.tsv','w')

for line in f1:
     if line[0:14] == f2[0:14]:
        fr.write('%s'%(line))

fr.close()
print "Done!"

My idea was to search the id's delimiting the characters on each line to match EgrG_XXXX of one file to the other, an then, write the lines to a new file. I tried some modifications, that's just the "core" of my idea. I got nothing. In one of the modifications, I got just one line.

3 Answers3

4

I'd store the ids from f2 in a set and then check f1 against that.

id_set = set()
with open('eg_es_final_ids') as f2:
    for line in f2:
        id_set.add(line[:-2]) #get rid of the .1

with open('egranulosus_v3_2014_05_27.tsv') as f1:
    with open('res.tsv', 'w') as fr:
        for line in f1:
            if line[:14] in id_set:
                fr.write(line)
Patrick Haugh
  • 59,226
  • 13
  • 88
  • 96
  • This only checks for existence in the other file, it doesn't preserve the order of writing to that of `eg_es_final_ids` – roganjosh Sep 26 '16 at 18:47
  • that did not work too. the problem is, in my point of view, that python are comparing, for instance, all the line "EgrG_000095700 /product="ubiquitin carboxyl terminal hydrolase 5"" with the "EgrG_000095700", so cause the second lacks the "/product...", python are not wrting. That's why I want to search delimiting the chars and then copying the line. – Tiago Minuzzi Sep 26 '16 at 18:47
  • Try replacing `line[:14]` in the above with `line.split()[0].strip()` – Patrick Haugh Sep 26 '16 at 18:52
0

f2 is a list of lines in file-2. Where are you iterating over the list, like you are doing for lines in file-1 (f1). That seems to be the problem.

prabodhprakash
  • 3,825
  • 24
  • 48
0
with open('egranulosus_v3_2014_05_27.txt', 'r') as infile:
    line_storage = {}
    for line in infile:
        data = line.split()
        key = data[0]
        value = line.replace('\n', '')
        line_storage[key] = value

with open('eg_es_final_ids.txt', 'r') as infile, open('my_output.txt', 'w') as outfile:
    for line in infile:
        lookup_key = line.split('.')[0]
        match = line_storage.get(lookup_key)
        outfile.write(''.join([str(match), '\n']))
roganjosh
  • 12,594
  • 4
  • 29
  • 46
  • Made an edit so you don't get loads of `None` written to the file by using `if match` – roganjosh Sep 26 '16 at 18:55
  • That worked, man! many thanks! The only "problem" is that the output lines are with square brackets at the beggining and the end of each line, but that's easy to get rid of. :) – Tiago Minuzzi Sep 26 '16 at 18:56
  • @TiagoMinuzzi Most welcome :) Change `str(match)` to just `match` and let me know if it persists. – roganjosh Sep 26 '16 at 18:58
  • if I change, that's what happens: Traceback (most recent call last): File "line_copying.py", line 23, in outfile.write((match) + '\n') TypeError: can only concatenate list (not "str") to list – Tiago Minuzzi Sep 26 '16 at 19:02
  • @TiagoMinuzzi working on it, I was sloppy trying to get an answer out that kept the order for you, sorry. Do you understand why my code works otherwise or would you like an elaboration on that too? – roganjosh Sep 26 '16 at 19:04
  • Yeah, I guess I understood all. Don't worry, the way it is now is good for me. – Tiago Minuzzi Sep 26 '16 at 19:07
  • @TiagoMinuzzi _think_ I've fixed it. If you don't want `None` added to the file then you can still put `if match: outfile.write(''.join([str(match), '\n']))` and it's still fixed either way. – roganjosh Sep 26 '16 at 19:10