Code to compare two files is to slow - search to improve my code

Question

I am not a specialist in Python and I write a script to compare two file (earthquake data locations). What I wrote is quite ugly and very slow. Does someone have an idea to improve my code ? Thanks !

#!/usr/bin/env python
# -*- coding: utf-8 -*-

file_1 = 'Loc_1D.txt'
file_2 = 'Loc_3D.txt'
output_file = 'result_file.txt'

with open(file_1, "r") as f1:
    for line1 in f1:
        yr1, mth1, d1, hr1, m1, s1, lat1, lon1, z1, mag  = line1.split()
        time1 = [yr1, mth1, d1, hr1, m1]
        with open(file_2, "r") as f2:
            for line2 in f2:
                yr2, mth2, d2, hr2, m2, s2, lat2, lon2, z2, *_ = line2.split()
                time2 = [yr2, mth2, d2, hr2, m2]
                with open(output_file, "w") as oup:
                    if time1 == time2 and abs(float(s1)-float(s2)) <= 2:
                        Event = [yr2, mth2, d2, hr2, m2, s2, lat2, lon2, z2, mag]
                        print (Event)
                        oup.write(str(Event))

You're opening file_2 for each line in file_1. Open each file once and parse the lines into 2 dictionaries with time as their key. — Matt, Jan 08 '20 at 09:13
Thanks, I tried something, but it doesn't work, it's writing just one line (I put it below) — OcéF, Jan 08 '20 at 09:40

Samay Gupta · Accepted Answer · 2020-01-08T10:33:44.823

0

This is similar to Matt's comment. Assuming that the timestamp is unique in all the cases, this might be the most efficient solution:

file_1 = 'Loc_1D.txt'
file_2 = 'Loc_3D.txt'
output_file = 'result_file.txt'
with open(file_1) as f1:
    f1_data = {}
    for line in f1.read().split("\n"):
        line_data = line.split()
        f1_data["-".join(map(str, line_data[:5]))] = line_data

with open(file_2) as f2:
    f2_data = {}
    for line in f2.read().split("\n"):
        line_data = line.split()
        f2_data["-".join(map(str, line_data[:5]))] = line_data

output_data = []
for data_key in [key for key in f1_data.keys() if key in f2_data.keys()]:
    if abs(float(f1_data[data_key][5])-float(f2_data[data_key][5])) <= 2.0:
        Event = str(f2_data[data_key][5])
        print(Event)
        output_data.append(Event)

with open(output_file, 'w') as f:
    f.write("\n".join(output_data))

The reason is if you consider in terms of efficiency using the Big O notations, You're moving from a O(n^2) to a O(3n) efficiency. In simple terms is number of iterations reduce. for instance if each file has say 100 lines of data, The computer would have to process around 10000 in the previous case and 300 using this case.

edited Jan 08 '20 at 10:33

answered Jan 08 '20 at 09:29

Samay Gupta

437
3
8

thank you so much. It seems there is a syntax error on it : `File "./comp_time_test2.py", line 22 for data_key in [key for key f1_data.keys() if key in f2_data.keys()]: ^ SyntaxError: invalid syntax` – OcéF Jan 08 '20 at 09:47
What is the Big O notations ? – OcéF Jan 08 '20 at 09:49
@OcéF Yea I missed a in, I corrected the code, should work now. Big O notations is just a way to measure the efficiency of a program. It's an estimate but gives a good idea of how the code will run (w'.r.t to time and computer resources). It depends on time to run and a lower n value is more efficient and will be faster. – Samay Gupta Jan 08 '20 at 09:54
Thank you for those informations. It seems that there is still some issue. Apparently f1.read.split(\n) cannot have split attribute. ` File "./comp_time_test2.py", line 11, in for line in f1.read.split("\n"): AttributeError: 'builtin_function_or_method' object has no attribute 'split' ` – OcéF Jan 08 '20 at 09:58
Fixed that too, I missed a couple of brackets. Should hopefully work now – Samay Gupta Jan 08 '20 at 10:01
Don't be sorry, you help me a lot. I should have seen those missing brackets. I still do not understand why Matt solution does not work but ... Anyway, there is still an issue ^^ `if abs(str(f1_data[data_key][5])-str(f2_data[data_key][5])) <= 2.0: TypeError: unsupported operand type(s) for -: 'str' and 'str'` – OcéF Jan 08 '20 at 10:07
Yea forget to convert it back to float. Should work now (Hopefully) – Samay Gupta Jan 08 '20 at 10:16
Cool thanks really fast. I changed two little things to obtain really what I wanted. – OcéF Jan 08 '20 at 10:29
Update: I added it so it prints each result on a separate line. converting the final list to string would make a less readable output file. – Samay Gupta Jan 08 '20 at 10:34

score 0 · Answer 2 · answered Jan 08 '20 at 10:30

0

This is the correction, thanks to @Samay Gupta

file_1 = 'Loc_1D.txt'
file_2 = 'Loc_3D.txt'
output_file = 'result_file.txt'

with open(file_1) as f1:
    f1_data = {}
    for line in f1.read().split("\n"):
        line_data = line.split()
        f1_data["-".join(map(str, line_data[:5]))] = line_data

with open(file_2) as f2:
    f2_data = {}
    for line in f2.read().split("\n"):
        line_data = line.split()
        f2_data["-".join(map(str, line_data[:5]))] = line_data

output_data = []
for data_key in [key for key in f1_data.keys() if key in f2_data.keys()]:
    if abs(float(f1_data[data_key][5])-float(f2_data[data_key][5])) <= 2.0:
        Event = f2_data[data_key][:9], f1_data[data_key][9]
        print(Event)
        output_data.append(Event)

with open(output_file, 'w') as f:
    f.write("\n".join(str(output_data)))

answered Jan 08 '20 at 10:30

OcéF

47
8

So there are no two events in one file happening in the same minute? Why didn't you say that in the question? – Stefan Pochmann Jan 08 '20 at 14:23
Because it wasn't my question. My question was to find a way to make my script faster. So events happened at the same min, it is for that I compared seconds. – OcéF Jan 09 '20 at 07:17
Uh, no, with this code you might in many cases *not* compare the seconds anymore, because before you even get to the seconds comparison, you already threw entries away because of minute collisions. – Stefan Pochmann Jan 09 '20 at 07:21

Code to compare two files is to slow - search to improve my code

2 Answers2