Perl, delete everything after first three characters

Question

I promise you all I've searched the site for about two hours now. I've found several that should have worked, but they didn't.

I have a line that consists of a varying amount of numbers separated by spaces. I want to delete everything after the third number.

I should say that everything I've been writing has been assuming that \S\s\S\s\S would match the first three numbers. with spaces between 1 and 2, and 2 and 3.

I anticipated the following working:

s/^.*?[\S\s\S\s\S].{5}//s;

but it did the exact opposite of what I wanted.

I would like 2 3 0 4 5 6 7 1 0 1 2 to become 2 3 0

I would really prefer to keep it substitution. I've tried look-behind as one person mentioned and I had no luck. Should I be saving the first 3 numbers as a string before I'm trying these commands?

EDIT:

I should have clarified that these numbers could be in the form 1.57 or 1.00E01 as well. I had integers when I was trying to get that to just baseline work.

`\S\s\S\s\S` will indeed match `2 3 0`; however, square brackets denote a [character class](http://www.regular-expressions.info/charclass.html), so `[\S\s\S\s\S]` will match exactly one character, provided that character is either white-space (`\S`) or non-white-space (`\s`). So `s/^.*?[\S\s\S\s\S].{5}//s;` is equivalent to `s/^.*?.{6}//s;`, which (since `.*?` will match as little as it can, which in this case will always be the empty string) is equivalent to `s/^.{6}//s;` -- deleting the first six characters, provided the string *has* at least six characters. — ruakh, Aug 08 '12 at 17:55
@ruakh Thank you for linking to the character class page. I had originally just had one Character and one Space in my brackets to test if it worked, then expanded it without re-testing that. — caleb.breckon, Aug 08 '12 at 18:20
To get things a bit clearer: does your input consists of ONLY numbers, always separated by spaces? — pavel, Aug 08 '12 at 18:49
No, the program that produces the file I am editing with Perl loves to use 1.50E02 formatted exponents. — caleb.breckon, Aug 08 '12 at 22:23

score 5 · Accepted Answer · answered Aug 08 '12 at 18:05

\S\s\S\s\S will indeed match three non-space characters separated by space characters. However, ^.*?[\S\s\S\s\S].{5} does something completely different:

^ matches the beginning of the line.
.*? matches characters until the next match can start (not as many as it can). Since you specify /s, . will match newline as well.
[\S\s\S\s\S] is a character class, and so is the same as [\S\s]—match either \S or \s, which is to say anything.
.{5} will match five characters.

Since [\S\s] and . with /s match the same things, the .*? will never match any characters as it wants to match as little as possible. Thus, this is the same as s/^.{6}//s—delete the first six characters from the string. As you can see, that's not what you wanted!

One way to keep the first three numbers is to explicitly match them: s/^(\d \d \d).*/$1/s. Here, \d matches a single digit (0–9) with literal spaces in between them. We match the first three followed by anything at all, and then replace the whole match—since it ends in .*, that's the whole string—with just the bit in between parentheses, i.e. the first three numbers. If your numbers can be more than one digit long, then s/^(\d+ \d+ \d+).*/$1/s will do what you want; if you can have arbitrary space-like characters (space, tab, newline) separating them, then s/^(\d\s\d\s\d\s).*/$1/s is what you want (or \s+ if you can have multiple spaces). If you want to catch lines which have things other than digits, you can use \S or \S+, just as you were.

Another approach, using lookbehind, would be s/(?<=^\d \d \d).*//s. In other words, delete any characters which are preceded by ^\d \d \d—the beginning of the string followed by three space-separated numbers. There's no real advantage to this approach—I'd probably do it the other way—but since you mentioned lookbehind, here's how you can do it. (Again, things like s/(?<=^\S\s\S\s\S).*//s are more general.)

Thank you so much. Your solution was what I ultimately used, just substituting S for d so I could have decimal and E formed answers. `s/^(\S+ \S+ \S+).*/$1/s` — caleb.breckon, Aug 08 '12 at 18:20
Glad I could help! If this answer solved your problem, you could "accept" it by clicking the green check mark next to it. ([The FAQ entry on how to ask](http://stackoverflow.com/faq/#howtoask) explains what this is for.) — Antal Spector-Zabusky, Aug 08 '12 at 19:22

DavidO · Answer 2 · 2012-08-08T18:26:05.000

So match the first three numbers explicitly, and drop everything else.

s/^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$/$1 $2 $3/;

This works as follows:

$ perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new(q{^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$})->explain;'
The regular expression:

(?-imsx:^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [\dE.]+                  any character of: digits (0-9), 'E', '.'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [\dE.]+                  any character of: digits (0-9), 'E', '.'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    [\dE.]+                  any character of: digits (0-9), 'E', '.'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

(Updated in consideration of the changes the OP made to the original specification.)

This fails as currently written on negative numbers and on negative exponents. — Mark, Aug 08 '12 at 18:51
I don't see where the OP mentioned the possibility of negative exponents or negative numbers. But you are correct. At some point it becomes advantageous to just use Regexp::Common to employ a community-tested solution. — DavidO, Aug 08 '12 at 19:00
These are applying to program flags that for some reason can be negative. I'm still learning to clearly state my question. Thanks for your help! — caleb.breckon, Aug 08 '12 at 22:24
Just add a `-` as the last item inside each character-class group to allow for negatives. ...but we're getting to the point that Regexp::Common is probably going to be more robust. — DavidO, Aug 08 '12 at 23:29

Shawn Blakesley · Answer 3 · 2012-08-08T18:39:40.300

1

Your code where you say s/^.*?[\S\s\S\s\S].{5}//s; I would write as: s/^(\S\s\S\s\S).*$/$1/ You're forgetting to use a $1 to capture the part of the substitution that you want to keep, and having a .* at the beginning could lead to starting numbers being removed instead of trailing numbers. Also, I'm not sure if you have some guarantee of single digit numbers, or of single whitespace characters, so you could write the code with s/^(\S+\s+\S+\s+\S+).*$/$1/ to capture all of the spaces and all of the digits. Let me know if I need to clarify that a little more.

Here's a website I find super helpful for Perl regex: http://pubcrawler.org/perl-reference.html

edited Aug 08 '12 at 18:39

answered Aug 08 '12 at 18:02

Shawn Blakesley

1,743
1
17
33

Since this was the first time I've answered a question, it would be helpful to have an explanation for a down vote on this particular answer. – Shawn Blakesley Aug 08 '12 at 18:27
Did you test your solution? `s/^[\S+\s+\S+\s+\S+].*$/$1/` is wrong, because of the `[...]` character class brackets, and because there are no capturing parens (which your answer called attention to but didn't fix). – DavidO Aug 08 '12 at 18:30
Ah, I see. I totally mistyped that part my bad. I'll fix it. Thanks :D – Shawn Blakesley Aug 08 '12 at 18:39

im8bit · Answer 4 · 2012-08-08T20:41:49.800

Question is, why do u want to do such a thing with regexp? it seems easier to me with:

substr $string, 5;

or if u really want to (I didn't test):

s/^(.{5})(.*)/$1/

parentheses allows you to "remember" patterns, this is the way to say that you want to replace pretty much everything with just the first part of the pattern (the first five characters). this pattern will match any line of text and leave just the first 5 characters maybe you want to modify it to match 3 digits with spaces between them

Perl, delete everything after first three characters

4 Answers4