2

I have a text file with data separated by 4 separate | There are some problem lines in the file. These lines contain fewer than 4 pipes. The data in the problem rows is not needed and I want to run a command on the file which deletes any line which contains fewer than four pipes. I would also like to know how many lines were deleted afterwards so if this could be printed on the screen once the command is applied that would be ideal.

Sample data:

865|Blue Moon Club|Havana Project|34d|879
899|Soya Plates|Dimsby|78a|699
657|Sherlock
900|Forestry Commission|Eden Project|68d|864

Desired output:

865|Blue Moon Club|Havana Project|34d|879
899|Soya Plates|Dimsby|78a|699
900|Forestry Commission|Eden Project|68d|864

I have tried awk '|>=3' file.txt which didn't work. There is a lot of info out there regarding awk, some of which I found, but there's so much it makes it difficult to find exactly what I want to do due to its sheer volume.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
neilH
  • 3,320
  • 1
  • 17
  • 38

2 Answers2

7

To eliminate the lines:

grep '|.*|.*|.*|' file > newfile

To count the number of bad lines:

grep -cv '|.*|.*|.*|' file

That doesn't do the edit in place; you could do that with sed but it is often safer to do edits like this to a newfile, in order to avoid losing data if you make a mistake.

The first grep pattern matches any line with four pipe symbols. (By default, grep uses "Basic" regular expressions, in which you have to write the alternation operator \|. So you can use | as an ordinary character.)

The second invocation counts (-c) the number of non-matching (-v) lines.

Here's a simple sed solution:

sed -n -i.bak  '/|.*|.*|.*|/p' file

The -n option turns off automatic printing, so the command only prints the lines which match the pattern. (Again, by default, sed uses basic regexes.). The -i.bak option does the edit in place, creating a backup of the original with the name file.bak.

If you wanted to select lines with exactly four pipes, you could use awk:

awk -F'|' 'NF==5' file > newfile

which will set the filed separator to a pipe symbol and then select the lines with exactly five fields, which are the lines with four pipes.

A useful tool to count lines is wc:

wc -l file

will tell you how many lines are in file; if you count lines in both file and newfile, the difference will obviously be the number of deletions. You could do that computation in awk, too, but it's a bit wordier:

awk -F'|' 'NF==5{print;next}{del+=1}END{print del >>"/dev/stderr"}' file > newfile
rici
  • 234,347
  • 28
  • 237
  • 341
2

This will do:

sed -i.bak '/\([^|]*|\)\{4\}/!d' file

Or (as Cyrus's comment)

sed -i.bak -E '/(\|[^\|]*){4}/!d' file

Or

sed -n '/^[^|]*|[^|]*|[^|]*|[^|]*|$/p' file > newfile

Or

sed -e '/^[^|]*|[^|]*|[^|]*|$/d' \
    -e '/^[^|]*|[^|]*|$/d' \
    -e '/^[^|]*|$/d' \
    -e '/^[^|]*$/d' \
    -i.bak file

This won't give you line count though. To get line count run grep -cv '^[^|]*|[^|]*|[^|]*|[^|]*|$' file on the original file as rici mentioned, or compare the line number before and after with wc -l file command


Explanation:

The first two sed matches loosely 4 pipes (not less but can be more) and the third one matches exactly 4 | (not more or less).

The fourth sed matches exactly 3,2,1 and 0 pipes (|) and deletes those lines (in place) and prepares a backup file (file.bak) of the original.

Community
  • 1
  • 1
Jahid
  • 21,542
  • 10
  • 90
  • 108