10

I have a text file (more correctly, a “German style“ CSV file, i.e. semicolon-separated, decimal comma) which has a date and the value of a measurement on each line.
There are stretches of faulty values which I want to remove before further work. I'd like to store these cuts in some script so that my corrections are documented and I can replay those corrections if necessary.

The lines look like this:

28.01.2005 14:48:38;5,166
28.01.2005 14:50:38;2,916
28.01.2005 14:52:38;0,000
28.01.2005 14:54:38;0,000
(long stretch of values that should be removed; could also be something else beside 0)
01.02.2005 00:11:43;0,000
01.02.2005 00:13:43;1,333
01.02.2005 00:15:43;3,250

Now I'd like to store a list of begin and end patterns like 28.01.2005 14:52:38 + 01.02.2005 00:11:43, and the script would cut the lines matching these begin/end pairs and everything that's between them.

I'm thinking about hacking an awk script, but perhaps I'm missing an already existing tool.

Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
Florian Jenn
  • 5,211
  • 4
  • 23
  • 18

5 Answers5

25

Have a look at sed:

sed '/start_pat/,/end_pat/d'

will delete lines between start_pat and end_pat (inclusive).

To delete multiple such pairs, you can combine them with multiple -e options:

sed -e '/s1/,/e1/d' -e '/s2/,/e2/d' -e '/s3/,/e3/d' ...
Alok Singhal
  • 93,253
  • 21
  • 125
  • 158
  • Great! I knew I was missing something… I always used sed with single patterns and never recalled that it offers ranges. – Florian Jenn Jan 03 '10 at 23:53
  • Also, I can put the expressions in a file, where I can also use comments (with `#`). The command line then is `sed -f scriptfile outfile`. – Florian Jenn Jan 04 '10 at 00:08
  • 1
    Be careful if the `end_pat` does **not** exist, **everything** in the file is deleted after the `start_pat`. Also if you have multiple occurrences of any of the patterns you will get different results depending of the order. – FireEmerald Feb 21 '20 at 18:46
0

Firstly, why do you need to keep a record of what you have done? Why not keep a backup of the original file, or take a diff between the old & new files, or put it under source control?

For the actual changes I suggest using Vim.

The Vim :global command (abbreviated to :g) can be used to run :ex commands on lines that match a regex. This is in many ways more powerful than awk since the commands can then refer to ranges relative to the matching line, plus you have the full text processing power of Vim at your disposal.

For example, this will do something close to what you want (untested, so caveat emptor):

:g!/^\d\d\.\d\d\.\d\d\d\d/ -1 write tmp.txt >> | delete

This matches lines that do NOT start with a date (the ! negates the match), appends the previous line to the file tmp.txt, then deletes the current line.

You will probably end up with duplicate lines in tmp.txt, but they can be removed by running the file through uniq.

Dave Kirby
  • 25,806
  • 5
  • 67
  • 84
  • I'd like to keep short notes about the records I threw out and why. I will work with these data not very frequently, and I know I might forget what I had done. Also, someone else may need to understand and reproduce what I've done. Sadly, your vi/ex example doesn't really solve my problem, because all lines start with a date. But I understand the direction you're pointing to. – Florian Jenn Jan 04 '10 at 00:00
0

you are also use awk

awk '/start/,/end/' file
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • 2
    Somewhere it was mentioned that awk is appropriate where data is represented in column format. Is that correct. Could you please explain if awk is better that sed for **this** particular task. – Talespin_Kit Mar 27 '14 at 12:22
0

I would seriously suggest learning the basics of perl (i.e. not the OO stuff). It will repay you in bucket-loads.

It is fast and simple to write a bit of perl to do this (and many other such tasks) once you have grasped the fundamentals, which if you are used to using awk, sed, grep etc are pretty simple.

You won't have to remember how to use lots of different tools and where you would previously have used multiple tools piped together to solve a problem, you can just use a single perl script (usually much faster to execute).

And, perl is installed on virtually every unix/linux distro now.

(that sed is neat though :-)

DaveC
  • 2,020
  • 15
  • 13
-1

use grep -L (print none matching lines)

Sorry - thought you just wanted lines without 0,000 at the end

Martin Beckett
  • 94,801
  • 28
  • 188
  • 263