1

I have a text file with thousands of lines of text consisting of lists of domain names followed by periods with different information after the domain (numbers, spaces, other information)

Some domains might have more than 1 line worth of information, with different numbers and information afterwards, such as domains 1 and 4 in this example

domain1.foo. 3600 ...
domain1.foo. 1800 ...
domain2.foo. 900 ...
domain3.foo. 60 ...
domain4.foo. 3600 ...
domain4.foo. 1200 ...
domain4.foo. 1200 ...

The duplicate listings would only be lines underneath each other (for example, lines involving domain4 could be lines 50, 51, 52, but never 50, 60, and 400).

So what I am trying to do is create use sed to delete any duplicate lines containing each domain name, regardless of what comes after - So the example would become

domain1.foo. 3600 ...
domain2.foo. 900 ...
domain3.foo. 60 ...
domain4.foo. 3600 ...

I only have a basic knowledge of regex and would appreciate some help as to how to go about this. I managed to get the list formatted so tabs and double spaces are removed, but I need a little help for this part.

Joen499
  • 11
  • 1

2 Answers2

2

Awk to the rescue:

$ awk 'last != $1; {last = $1}'
domain1.foo. 3600 ...
domain2.foo. 900 ...
domain3.foo. 60 ...
domain4.foo. 3600 ...

This works by settings the variable last to the value from the first column. The current line will only be printed if the first column isn't the same as last.

You can also do it with sed, but you really shouldn't:

sed ':s;N;/^\([^ ]*\) [^\n]*\n\1/{s/\n.*//;bs};P;D'

The above works by reading the next line into the pattern space, and checking if the first column in each lines are the same.

If they are the same, the last line is removed and the scripts jumps to the start again.

Once the two columns are different the first is printed, then deleted, and the script it repeated for the second line.

:s                                     # Label called `s'
N                                      # Append next line to pattern space
/^\([^ ]*\) [^\n]*\n\1/ {              # If the first columns are the same...
    s/\n.*//                           # Remove last line
    b s                                # Goto `s'
}                                      # If the columns are not the same...
P                                      # Print first line from pattern space
D                                      # Delete the printed line
Andreas Louv
  • 46,145
  • 13
  • 104
  • 123
0

andlcr's helpful awk answer is the way to go, especially given that it is portable (POSIX-compliant) and works with variable-length domain names.

In this simple case,

  • given the fixed number of chars. in the line prefixes,

  • if your platform has the GNU implementation of uniq (verify with uniq --version)

the following will work too:

uniq -w 12 file
Community
  • 1
  • 1
mklement0
  • 382,024
  • 64
  • 607
  • 775