2

Using Sed because of large files, I'd like to match strings of form

'09/07/15 16:56:36,333000000','DD/MM/RR HH24:MI:SSXFF'

and replace it by

'09/07/15 16:56:36','DD/MM/RR HH24:MI:SS'

Checked by regex tester this regex seems to match
'\d{2}\/\d{2}\/\d{2}\s\d{2}:\d{2}:\d{2},\d{9}','DD\/MM\/RR HH24:MI:SSXFF'

but when I do

sed -ie "s#\(\x27\d{2}\/\d{2}\/\d{2}\s\d{2}:\d{2}:\d{2}\),\d{9}  
\(\x27,\x27DD\/MM\/RR HH24:MI:SS\)XFF\x27#\1\2\x27#g" inputfile  

it does not replace anything. What am I missing ?

McQuack
  • 393
  • 2
  • 15
  • Please note that `sed -ie` probably doesn't do what you want. `-i` actually takes an optional argument which it uses to create a backup of the file before modifying it. So in your case it will create `inputfilee`. If you didn't actually want to do a backup, I'd propose to change `sed -ie` to `sed -i -e` or even `sed -i` (`-e` is unnecessary if you provide only one expression at the command line). – werkritter Jul 19 '15 at 18:21
  • I tried with only -i switch but it does not work either. Does the regex given seem right ? I also tried with -r, but gave an error "invalid reference on s command". – McQuack Jul 19 '15 at 18:56
  • That was just another, somewhat separate problem. It may cause some potentially unexpected results (new files being created), but doesn't concern the main problem — that's why I described it in the comment. – werkritter Jul 19 '15 at 19:00

2 Answers2

2

Why not just use something like this?

#!/usr/bin/sed -f
s/,[[:digit:]]*//
s/XFF//
Zombo
  • 1
  • 62
  • 391
  • 407
  • Thank you, it worked although [[:digit:]] did not seem to work. Being on debian, I adapted to : `#!/bin/sed -f s/,[0-9]*//g s/XFF//g` – McQuack Jul 19 '15 at 19:17
  • I must be tired, I tried again and [[:digit:]] works as expected. – McQuack Jul 19 '15 at 19:50
0

NOTE: in the answer below I describe why your expression doesn't work in general. I would strongly suggest that you try to simplify your expression as much as possible first, or use @StevenPenny's excellent answer, because:

  • applying the changes described below in your present expression would turn it into a hulking, unmaintainable regex nightmare;
  • my remarks may not be exhaustive — they point out the cause, some of the particular problems, and sources for further investigation.

The problem is that sed and http://regexr.com/ regex engines are somewhat different. See the "RegEx engine" section on the website:

While the core feature set of regular expressions is fairly consistent, different implementations (ex. Perl vs Java) may have different features or behaviours.

RegExr uses your browser's RegExp engine for matching, and its syntax highlighting and documentation reflect the JavaScript RegExp standard.

Whereas the latest versions of GNU sed is mostly compatible with POSIX.2 Basic Regular Expressions (BREs). See the excerpt from the sed(1) manpage for GNU sed, version 4.2.2:

REGULAR EXPRESSIONS

POSIX.2 BREs should be supported, but they aren't completely because of performance problems. The \n sequence in a regular expression matches the newline character, and similarly for \a, \t, and other sequences.

The descriptions of POSIX regex languages (that is BRE — Basic Regular Expressions and ERE — Extended Regular Expressions) are in the regex(7) manpage.

In particular, concerning your expression:

  • Character class notation is different: for example, for digits you're using \d, while in BRE you should write [[:digit:]]; for white space, you're using \s, whereas in BRE there's [[:space:]].
  • Some characters have to be prepended with backslash in order to escape their literal meaning. That concerns {, which in BRE should be \{.
Community
  • 1
  • 1
werkritter
  • 1,479
  • 10
  • 12
  • Ok I see, thanks for the explanation. I'm new on both GNU sed tool and regular expressions. I was inspired by [this question](https://stackoverflow.com/questions/9721253/sed-regex-substitute) and my very basic knowledge without thinking enough about different implementations. Please, forgive my english, it's not my native language. – McQuack Jul 19 '15 at 19:46
  • The POSIX regexes are hard to approach just by themselves. If you looked into `regex(7)`, you could see that the manpage authors themselves have a negative attitude towards having multiple kinds of regexes: "Having two kinds of REs is a botch". – werkritter Jul 19 '15 at 19:50