0

I have this file with x/y-coordinates that I am trying to sort out. The file consists of various information, but with the coordinates at the same place within a line, like this:

IMPORTANT information 12213   1541515      COORDINATEX.COORDINATEY
IMPORTANT assadad213114141 asdadad         COORDINATEX.COORDINATEY
IMPORTANT assadad2ssss4141 asdadad         COORDINATEX.COORDINATEY
IMPORTANT ass 141 asd135566666666d         COORDINATEX.COORDINATEY

What I want, is remove all lines where the coordinates (COORDINATEX.COORDINATEY) is identical AND the first 10 characters marked IMPORTANT are identical, except the first. I have tried using sort -u in unix, but that wont work, as the whole line needs to be identical, which is not the case here.

Example:

IMPORTANTLINE1 713)#!=%!3839413!"¤#(!¤! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 1339220"##"#"#"""""""""" COORDINATEX.COORDINATEY144
IMPORTANTLINE1 fsafasdasd!38aaa!"¤#(!¤! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 713)#!=%!3839413!"¤#(!¤! COORDINATEX.COORDINATEY1
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE2
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE1
IMPORTANTLINE2 sadasda333333333dadadada COORDINATEX.COORDINATE1

should look like this:

IMPORTANTLINE1 713)#!=%!3839413!"¤#(!¤! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 1339220"##"#"#"""""""""" COORDINATEX.COORDINATEY144
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE2
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE1

Thanks in advance!

Thomas Wouters
  • 130,178
  • 23
  • 148
  • 122
niicepants
  • 11
  • 3
  • So do you have some kind of structure on this file? Like those coordinates are always last on the line, or the number of columns separated by \t is the same ? Because from your examples I can't really tell. – Bogdan Mar 30 '12 at 14:49
  • Yes, the coordinates are always last, always the same length. 1,2, and 144 was just to make them different, but I can see how that messed up my question. Sorry about that. – niicepants Mar 30 '12 at 15:16

3 Answers3

1

For each line you read from the file, take the parts that define a duplicate and piece them into a single string. Check a set to see if it contains the string, if it doesn't then write the line to output and put the string into the set.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • 1
    I'd probably use a tuple of the two parts of the 'key' rather than making a single string. Feels more natural. – DSM Mar 30 '12 at 14:54
  • @DSM, for me it feels more natural to have the key be a single thing. I see what you mean though, it's a matter of personal preference I guess. I'm not sure which would be more performant, or even if there'd be a detectable difference. – Mark Ransom Mar 30 '12 at 14:57
  • @MarkRansom :a tuple in Python, is a "single thing" - it is a pretty nice langauge and I think you'd like to learn more about it - gluing separated things up in strings is the wya to go in languages where having then as a sequence wouldbem harder to deal. – jsbueno Mar 30 '12 at 18:38
  • @jsbueno, I agree Python is a very nice language, and I do understand that a tuple *is* a single thing. I hope I didn't imply that it was wrong to use a tuple, I was just stating a preference. – Mark Ransom Mar 30 '12 at 19:13
1

SO, you have four fields per line, separatedfd by whitespace. On the second field - is that it?

lines = []
found_lines = set()
with open("mydatafile.dat", "rt") as data_file:
   for line in data_file:
       #avoid stopping on blank lines (usually the last line in the file is blank)
       if not line.strip(): continue
       # separate fields
       imp, field1, x, y = line.split()
       #separate significative chars in field1:
       field1 = field1[1:10]  # "first 10 chars, except first"
       if (field1, x, y) in found_lines:
            continue
       found_lines.add(field1, x ,y)
       lines.append(line)
jsbueno
  • 99,910
  • 10
  • 151
  • 209
0

This does it I think:

import re

data='''
IMPORTANTLINE1 713)#!=%!3839413!"#(!! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 1339220"##"#"#"""""""""" COORDINATEX.COORDINATEY144
IMPORTANTLINE1 fsafasdasd!38aaa!"#(!! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 713)#!=%!3839413!"#(!! COORDINATEX.COORDINATEY1
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE2
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE1
IMPORTANTLINE2 sadasda333333333dadadada COORDINATEX.COORDINATE1
'''
d={}
data_out=[]

for i,line in enumerate(data.split('\n')):
    m=re.search(r'^(IMPORTANTLINE\d+).*(COORDINATEX)\.(COORDINATE(Y)?\d+)',line)
    if m:
        h=m.group(1)+m.group(2)+m.group(3)
        if h not in d:
            d[h]=i
            data_out.append(line)

for line in data_out:
    print line  

Output:

IMPORTANTLINE1 713)#!=%!3839413!"#(!! COORDINATEX.COORDINATEY1
IMPORTANTLINE1 1339220"##"#"#"""""""""" COORDINATEX.COORDINATEY144
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE2
IMPORTANTLINE2 sadasdasdadadadadadadada COORDINATEX.COORDINATE1
dawg
  • 98,345
  • 23
  • 131
  • 206
  • you should use a separate "set" to keep the structures already read. The verification with the operator "in" in a long list is extremely expensive, and transforms this problem from O(N log(N) ) to O(N²) . Also, no need to use regular expressions – jsbueno Mar 30 '12 at 18:41
  • @jsbueno: Your solution is good if there really are only 4 fields separated by white space. However, I disagree that going from a dict to a set will go from O(N log(N) ) to O(N²). [Look at Alex Martelli's performance comparison between a dict and a set](http://stackoverflow.com/a/1419324/298607). They are about the same for key look-up. – dawg Mar 30 '12 at 19:59