Removing identical coordinates on a specific place within a file

Question

I have this file with x/y-coordinates that I am trying to sort out. The file consists of various information, but with the coordinates at the same place within a line, like this:

IMPORTANT information 12213   1541515      COORDINATEX.COORDINATEY
IMPORTANT assadad213114141 asdadad         COORDINATEX.COORDINATEY
IMPORTANT assadad2ssss4141 asdadad         COORDINATEX.COORDINATEY
IMPORTANT ass 141 asd135566666666d         COORDINATEX.COORDINATEY

What I want, is remove all lines where the coordinates (COORDINATEX.COORDINATEY) is identical AND the first 10 characters marked IMPORTANT are identical, except the first. I have tried using sort -u in unix, but that wont work, as the whole line needs to be identical, which is not the case here.

Example:

IMPORTANTLINE1 713)#!=%!3839413!"¤#(!¤! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 1339220"##"#"#"""""""""" COORDINATEX.COORDINATEY144
IMPORTANTLINE1 fsafasdasd!38aaa!"¤#(!¤! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 713)#!=%!3839413!"¤#(!¤! COORDINATEX.COORDINATEY1
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE2
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE1
IMPORTANTLINE2 sadasda333333333dadadada COORDINATEX.COORDINATE1

should look like this:

IMPORTANTLINE1 713)#!=%!3839413!"¤#(!¤! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 1339220"##"#"#"""""""""" COORDINATEX.COORDINATEY144
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE2
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE1

Thanks in advance!

So do you have some kind of structure on this file? Like those coordinates are always last on the line, or the number of columns separated by \t is the same ? Because from your examples I can't really tell. — Bogdan, Mar 30 '12 at 14:49
Yes, the coordinates are always last, always the same length. 1,2, and 144 was just to make them different, but I can see how that messed up my question. Sorry about that. — niicepants, Mar 30 '12 at 15:16

score 1 · Answer 1 · answered Mar 30 '12 at 14:49

1

For each line you read from the file, take the parts that define a duplicate and piece them into a single string. Check a set to see if it contains the string, if it doesn't then write the line to output and put the string into the set.

answered Mar 30 '12 at 14:49

Mark Ransom

299,747
42
398
622

1

I'd probably use a tuple of the two parts of the 'key' rather than making a single string. Feels more natural. – DSM Mar 30 '12 at 14:54
@DSM, for me it feels more natural to have the key be a single thing. I see what you mean though, it's a matter of personal preference I guess. I'm not sure which would be more performant, or even if there'd be a detectable difference. – Mark Ransom Mar 30 '12 at 14:57
@MarkRansom :a tuple in Python, is a "single thing" - it is a pretty nice langauge and I think you'd like to learn more about it - gluing separated things up in strings is the wya to go in languages where having then as a sequence wouldbem harder to deal. – jsbueno Mar 30 '12 at 18:38
@jsbueno, I agree Python is a very nice language, and I do understand that a tuple *is* a single thing. I hope I didn't imply that it was wrong to use a tuple, I was just stating a preference. – Mark Ransom Mar 30 '12 at 19:13

score 1 · Answer 2 · answered Mar 30 '12 at 18:46

SO, you have four fields per line, separatedfd by whitespace. On the second field - is that it?

lines = []
found_lines = set()
with open("mydatafile.dat", "rt") as data_file:
   for line in data_file:
       #avoid stopping on blank lines (usually the last line in the file is blank)
       if not line.strip(): continue
       # separate fields
       imp, field1, x, y = line.split()
       #separate significative chars in field1:
       field1 = field1[1:10]  # "first 10 chars, except first"
       if (field1, x, y) in found_lines:
            continue
       found_lines.add(field1, x ,y)
       lines.append(line)

score 0 · Answer 3 · answered Mar 30 '12 at 15:24

This does it I think:

import re

data='''
IMPORTANTLINE1 713)#!=%!3839413!"#(!! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 1339220"##"#"#"""""""""" COORDINATEX.COORDINATEY144
IMPORTANTLINE1 fsafasdasd!38aaa!"#(!! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 713)#!=%!3839413!"#(!! COORDINATEX.COORDINATEY1
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE2
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE1
IMPORTANTLINE2 sadasda333333333dadadada COORDINATEX.COORDINATE1
'''
d={}
data_out=[]

for i,line in enumerate(data.split('\n')):
    m=re.search(r'^(IMPORTANTLINE\d+).*(COORDINATEX)\.(COORDINATE(Y)?\d+)',line)
    if m:
        h=m.group(1)+m.group(2)+m.group(3)
        if h not in d:
            d[h]=i
            data_out.append(line)

for line in data_out:
    print line

Output:

IMPORTANTLINE1 713)#!=%!3839413!"#(!! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 1339220"##"#"#"""""""""" COORDINATEX.COORDINATEY144
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE2
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE1

you should use a separate "set" to keep the structures already read. The verification with the operator "in" in a long list is extremely expensive, and transforms this problem from O(N log(N) ) to O(N²) . Also, no need to use regular expressions — jsbueno, Mar 30 '12 at 18:41
@jsbueno: Your solution is good if there really are only 4 fields separated by white space. However, I disagree that going from a dict to a set will go from O(N log(N) ) to O(N²). [Look at Alex Martelli's performance comparison between a dict and a set](http://stackoverflow.com/a/1419324/298607). They are about the same for key look-up. — dawg, Mar 30 '12 at 19:59

Removing identical coordinates on a specific place within a file

3 Answers3