Break string after specific word and put remains on new line (Regex)

Question

Suppose that I have a text field in which a user can submit code snippets. I want to detect when a specific word occurs in the string and then do something with the words/characters that come after that word.

Let's say we have a string and that after the word pyjamas I want to start the rest of the code on a new line without an indent. (Very similar to how code beautifiers work.) The output will be rendered inside pre, so I don't want any <br> tags or other HTML tags.

There are some catches though.

Everything following a word (pyjamas) has to start on a new line on the same "level" (equally amount of tab indents) as the line before.
Commas should always start on a new line and reverse indented with a tab
When there is another character, let's say an exclamation mark !, the code following has to start on a new line with a tab as an indent.

Example:

Input:

Bananas! Apples and pears walk down pyjamas the street! and they say pyjamas hi to eachother, pyjamas But then! some one else comes pyjamas along pyjamas Who is he?, pyjamas I don't know who! he is pyjamas whatever,,

Output:

Bananas!
    Apples and pears walk down pyjamas
    the street!
        and they say pyjamas
        hi to eachother
    , pyjamas
    But then!
        some one else comes pyjamas
        along pyjamas
        Who is he?
    , pyjamas
    I don't know who!
        he is pyjamas
        whatever
    ,
,

I am working with jQuery, so you can use it if you want.

Here is a fiddle with the code above, so you can test it out. My result thus far is not great at all. (Type something in the textarea, the output will change.) As I'm currently only barely knowledgeable with regex, I am in need of some help.

What I have so far:

var a = $("textarea").val(),
    b = a.split('!').join("!\n  "),
    c = b.split('pyjamas').join("pyjamas \n");

$("textarea").keyup(function() {
    $("#output>pre").html(c);
});

I would find an open source version and modify it to do what you want. There can be thousands of rules associated with formatting code. — Adam Zuckerman, Mar 07 '14 at 19:54
@AdamZuckerman Don't see why that's necessary in this case. There are only a few restrictions and not much options. — Bram Vanroy, Mar 07 '14 at 19:56
Use a RegEx to make the breaks in a first pass, then come back with a loop to add the indention. It may be possible to do the indention with the RegEx, but I don't know how you would accomplish that. — Adam Zuckerman, Mar 07 '14 at 20:00
I think a recursive function would be useful for tracking your current indent level. I'm still thinking about what that would look like though... — Drewness, Mar 07 '14 at 20:33

Martin Ender · Accepted Answer · 2014-03-09T23:32:35.277

13

Here is a simple approach that doesn't require recursive functions and could even be done without regular expressions (but I find them convenient here).

function indent(str)
{
    var tabs = function(n) { return new Array(n+1).join('\t'); }

    var tokens = str.match(/!|,|pyjamas|(?:(?!pyjamas)[^!,])+/g);
    var depth = 0;
    var result = '';
    for (var i = 0; i < tokens.length; ++i)
    {
        var token = tokens[i];
        switch(token)
        {
        case '!':
            ++depth;
            result += token + '\n' + tabs(depth);
            break;
        case ',':
            --depth;
            result += '\n' + tabs(depth) + token;
            break;
        case 'pyjamas':
            result += token + '\n' + tabs(depth);
            break;
        default:
            result += token;
            break;
        }
    }
    return result;
}

First, we define a function that returns a string of n tabs (for convenience).

Then we split up the process into two steps. First we tokenise the string - that is we split it into !, ,, pyjamas and anything else. (There's an explanation of the regex at the end, but you could do the tokenisation some other way as well.) Then we simply walk the tokens one by one keeping the current indentation level in depth.

If it's an ! we increment the depth, print the !, a line break and the tabs.
If it's a , we decrement the depth, print a line break, the tabs and then the ,.
If it's pyjamas, we simply print that and a line break and the tabs.
If it's anything else we just print that token.

That's it. You might want to add some sanity check that depth doesn't go negative (i.e. you have more , than !) - currently that would simply be rendered without any tabs, but you'd need to write extra ! after that to get the depth back up to 1. This is quite easy to deal with, but I don't know what your assumptions or requirements about that are.

It also doesn't take care of additional spaces after line breaks yet (see the edit at the end).

Working demo.

Now for the regex:

/
  !               # Match a literal !
|                 # OR
  ,               # Match a literal ,
|                 # OR
  pyjamas         # Match pyjamas
|                 # OR
  (?:             # open a non-capturing group
    (?!pyjamas)   # make sure that the next character is not the 'p' of 'pyjamas'
    [^!,]         # match a non-!, non-, character
  )+              # end of group, repeat once or more (as often as possible)
/g

The g to find all matches (as opposed to just the first one). ECMAScript 6 will come with a y modifier, which will make tokenisation even easier - but annoyingly this y modifier is ECMAScript's own invention, whereas every other flavour that provides this feature uses a \G anchor within the pattern.

If some of the more advanced concepts in the regex are not familiar to you, I refer you to this great tutorial:

EDIT:

Here is an updated version that fixes the above caveat I mentioned regarding spaces after line breaks. At the end of the processing we simply remove all spaces after tabs with:

result = result.replace(/^(\t*)[ ]+/gm, '$1');

The regex matches the beginning of a line and then captures zero or more tabs, and then as many spaces as possible. The square brackets around the space are not necessary but improve readability. The modifier g is again to find all such matches and m makes ^ match at the beginning of a line (as opposed to just the beginning of the string). In the replacement string $1 refers to what we captured in the parentheses - i.e. all those tabs. So write back the tabs but swallow the spaces.

Working demo.

edited Mar 09 '14 at 23:32

answered Mar 09 '14 at 23:21

Martin Ender

43,427
11
90
130

Woah, this is great! I am still in the process of learning some decent regex, and this is some great stuff. If you have any good RegEx resources to study it, please let me know. Thanks again. +1 and bounty (as soon as I can). EDIT: could you explain the first `tabs` function? – Bram Vanroy Mar 10 '14 at 08:51
@BramVanroy, the tutorial from which I linked the individual regex concepts is about the best online resource you'll find on regular expressions. If you really want to learn regex I recommend reading it front to back. I'll post an explanation of the tabs later when I'm not on my phone, but try googling "JavaScript repeat string". – Martin Ender Mar 10 '14 at 09:11
1

@BramVanroy I'd also recommend waiting the full seven days before awarding the bounty. Someone might have a better answer for you than I do, and now that you've"given up" 150 rep anyway, you might as well try to make them count ;). – Martin Ender Mar 10 '14 at 09:25
1

@BramVanroy [this](http://stackoverflow.com/a/202627/1633117) is the article I was hoping you'd find with that search. Basically, you create an array of `n+1` empty elements. And then you join them with `\t`, yielding `n` tabs without anything in between. For instance, if you have `n` as `3`, you'd do `new Array(4)` giving you `[undefined, undefined, undefined, undefined]`, and when you call `join` on it that gives you `undefined + '\t' + undefined + '\t' + undefined + '\t' + undefined`, where the `undefined` are coerced to empty strings. Hence, you get `'\t\t\t'`. – Martin Ender Mar 10 '14 at 11:11
@m.buettner I am doing some modiciations on this, but I can't figure out why that last non-capturing group is necessary. (Even though when I remove it, the function doesn't work as expected.) – Bram Vanroy Apr 21 '14 at 16:01
@BramVanroy it's there to match everything that is not a special token. If you don't include that you won't have anything but those tokens in your result. I'm ruling out the single-character tokens with a negated character class and the multi-character one with a negative lookahead (so that I can match a `p` that is not part of `pyjamas`). You should be able to remove the non-capturing parentheses and the `+` though - that's just an optimisation to get as much non-token text as possible into a single match instead of creating one match for every individual character. – Martin Ender Apr 21 '14 at 16:22
@m.buettner Ah, I see - thanks. Soemthing strange, though: I am trying to add `)` to behave similar to `]`. So what I did was this: `str.match(/\[|\]|\)|and |(?:(?!and )[^\[\]\)])+/g);` and added a fall-through in the switch. But console returns: `Uncaught RangeError: Invalid array length`. I'm sorry to keep bothering you with this. – Bram Vanroy Apr 21 '14 at 16:38
@BramVanroy what line is that thrown in? – Martin Ender Apr 21 '14 at 17:03
You can see it happen [here](http://jsfiddle.net/nnpRj/9/) (changed the characters that get matched). It happens on the line where `tabs` is declared. But when you remove `\)` from `tokens` it does work... – Bram Vanroy Apr 21 '14 at 17:35
@BramVanroy your indentation is not balanced, because you outdent on `)` but you don't indent `(`, so you get a negative depth which doesn't work in the `tabs` function. Either add `(` for indentation, or (what you should actually do in any case), add some logic to handle unbalanced indentation - that is, either make sure that `depth` never goes below `0` or use something like `Math.max(0,depth)` in the tabs function. The behaviour is slightly different, but I don't know how you want to handle such cases. – Martin Ender Apr 21 '14 at 17:49
@m.buettner I've been at it for a while now, because I wanted to figure it out myself, but unfortunately I cannot. [Here](http://jsfiddle.net/nnpRj/10/) is the test case. I also want to match a closing parenthesis `)`, with the same behaviour as `]`. And also `< number(` (and other possibilities with `>`, `<` `<=` or >=`), with the same behaviour as `[`. Can you help me out? – Bram Vanroy Apr 27 '14 at 15:30
@BramVanroy http://jsfiddle.net/LJ5mh/ ... three things: a) you had `>=` twice in the switch block, so I changed one to `>`. b) you do need to include a bare `(`, otherwise your input is still not balanced (because you are counting all closing parentheses but not the opening ones that aren't after `number`). c) because `[<>]=? number` is a multi-character token, it needs to go into the lookahead as well. – Martin Ender Apr 27 '14 at 18:22

Casimir et Hippolyte · Answer 2 · 2014-03-12T16:25:20.720

Not so different from m.buettner solution, you can do it using the replace method:

var lvl = 1;
var res = str.replace(/(!)\s*|\s*(,)|(\bpyjamas)\s+/g, function (m, g1, g2, g3) {
    if (g1) return g1 + "\n" + Array(++lvl).join("\t");
    if (g2) return "\n" + Array((lvl>1)?--lvl:lvl).join("\t") + g2;
    return g3 + "\n" + Array(lvl).join("\t"); });

console.log(res);

The idea is to use three different capturing groups and to test them in the callback function. Depending of the capture group the level is incremented or decremented (the ground is level 1). When the level is 1 and a comma is found, the level stay set to 1. I added \s* and \s+ to trim spaces before commas and after ! and pyjamas. If you don't want this, you can remove it.

With your code:

$("#output>pre").html($("textarea").val());

$("textarea").keyup(function() {
    $("#output>pre").html(function() {
        var lvl = 1;
        return $("textarea").val().replace(/(!)\s*|\s*(,)|(\bpyjamas)\s+/g,
            function (m, g1, g2, g3) {
                if (g1) return g1 + "\n" + Array(++lvl).join("\t");
                if (g2) return "\n" + Array((lvl>1)?--lvl:lvl).join("\t") + g2;
                return g3 + "\n" + Array(lvl).join("\t"); });
    });
});

Note: it is probably more clean to define a function that you can reuse later.

I like this answer as well. Thanks for the explanation, this way I actually understand what I'm doing. I accepted buettner's answer though, because it seems a bit more straight forward to me. Thanks anyway! — Bram Vanroy, Mar 16 '14 at 11:36

Break string after specific word and put remains on new line (Regex)

2 Answers2

Linked