8

I am trying to extract a code block from a Markdown document using PCRE RegEx. For the uninitiated, a code block in Markdown is defined thus:

To produce a code block in Markdown, simply indent every line of the block by at least 4 spaces or 1 tab. A code block continues until it reaches a line that is not indented (or the end of the article).

So, given this text:

This is a code block:

    I need capturing along with
    this line

This is a code fence below (to be ignored):

``` json
This must have three backticks
flanking it
```

I love `inline code` too but don't capture

and one more short code block:

    Capture me

So far I have this RegEx:

(?:[ ]{4,}|\t{1,})(.+)

But it simply captures each line prefixed with at least four spaces or one tab. It doesn't capture the whole block.

What I need help with is how to set the condition to capture everything after 4 spaces or 1 tab until you either get to a line that is not indented or the end of the text.

Here's an online work in progress:

https://www.regex101.com/r/yMQCIG/5

Garry Pettet
  • 8,096
  • 22
  • 65
  • 103
  • What options are you setting on the regex? If you want to analyse text as a block rather than line by line, then try `/regex/m`, where `m` means switching on the "multiline" option. – halfer Dec 27 '16 at 20:44
  • I've tried toggling the `m` switch on regex101.com but it doesn't help the RegEx I currently have. Updated question to include a link to the online RegEx I have. – Garry Pettet Dec 27 '16 at 20:47
  • Enabling the multiline switch ('s') on regex101.com actually causes the RegEx in my question to match all of the example text which isn't right either... – Garry Pettet Dec 27 '16 at 20:55
  • 1
    *Capture me* was 3 space indented, see https://www.regex101.com/r/yMQCIG/3 with 4 spaces. – Wiktor Stribiżew Dec 27 '16 at 21:03
  • @WiktorStribiżew. Thanks. I've updated the regex101 sample text and the question to reflect this. – Garry Pettet Dec 27 '16 at 21:05
  • 2
    A regexp question on Stack Overflow with a prior attempt is the eighth wonder of the world! Good work. – halfer Dec 27 '16 at 22:27
  • [Here a little update](https://www.regex101.com/r/VmQ4lQ/1) of answer regex where [`^` looks not in right place](https://www.regex101.com/r/F8vgNR/1). – bobble bubble Dec 27 '16 at 23:01
  • Thanks, @bobblebubble, for spotting that. I updated my answer as well. – trincot Dec 27 '16 at 23:10

3 Answers3

9

You should use begin/end-of-string markers (^ and $ in combination with the m modifier). Also, your test text had only 3 leading spaces in the final block:

^((?:(?:[ ]{4}|\t).*(\R|$))+)

With \R and the repetition you match one whole block with each single match, instead of a line per match.

See demo on regex101

Disclaimer: The rules of markdown are more complicated than the presented example text shows. For instance, when (nested) lists have code blocks in them, these need to be prefixed with 8, 12 or more spaces. Regular expressions are not suitable to identify such code blocks, or other code blocks embedded in markdown notation that uses the wider range of format combinations.

trincot
  • 317,000
  • 35
  • 244
  • 286
  • 1
    What if the indented text is a paragraph nested in a list item? This doesn't account for that. – Waylan Dec 27 '16 at 23:02
  • @Waylan, indeed, it was not intended to account for that. The rules for dealing with lists in combination with indented blocks are more complicated, as the number of prefixed spaces would then need to be 8, 12, or whatever corresponds with the list indentation level. I doubt regular expressions would be the right tool for such parsing. – trincot Dec 27 '16 at 23:05
  • 1
    @trincot, I agree, which was my point. While your solution works in the simple case, it is hardly a complete solution. If the OP wants a complete solution, then REGEX is not the answer. – Waylan Dec 27 '16 at 23:15
  • 1
    We agree on that. I added a disclaimer to the answer. – trincot Dec 27 '16 at 23:33
1

There are 3 ways to highlight code: 1) using start-of-line indentation 2) using 3 or more backticks enclosing a multiline block of code or 3) inline code.
1 and 3 are part of John Gruber original Markdown specification.
Here is the way to achieve this. You need to perform 3 separate regexp tests:

  1. Using indentation

     (?:\n{2,}|\A)                   # Starting at beginning of string or with 2 new lines
     (?<code_all>
         (?:
             (?<code_prefix>         # Lines must start with a tab or a tab-width of spaces
                 [ ]{4}
                 |
                 \t
             )
             (?<code_content>.*\n+)  # with some content, possibly nothing followed by a new line
         )+
     )
     (?<code_after>
         (?=^[ ]{0,4}\S)             # Lookahead for non-space at line-start
         |
         \Z                          # or end of doc
     )
    

2a) Using code block with backticks (vanilla markdown)

    (?:\n+|\A)?                         # Necessarily at the begining of a new line or start of string
    (?<code_all>
        (?<code_start>
            [ ]{0,3}                    # Possibly up to 3 leading spaces
            \`{3,}                      # 3 code marks (backticks) or more
        )
        \n+
        (?<code_content>.*?)            # enclosed content
        \n+
        (?<!`)
        \g{code_start}                  # balanced closing block marks
        (?!`)
        [ \t]*                          # possibly followed by some space
        \n
    )
    (?<code_trailing_new_line>\n|\Z)    # and a new line or end of string

2b) Using code block with backticks with some class specifier (extended markdown)

    (?:\n+|\A)?                 # Necessarily at the beginning of a new line
    (?<code_all>
        (?<code_start>
            [ ]{0,3}            # Possibly up to 3 leading spaces
            \`{3,}              # 3 code marks (backticks) or more
        )
        [ \t]*                  # Possibly some spaces or tab
        (?:
            (?:
                (?<code_class>[\w\-\.]+)    # or a code class like html, ruby, perl
                (?:
                    [ \t]*
                    \{(?<code_def>[^\}]+)\} # a definition block like {.class#id}
                )?                          # Possibly followed by class and id definition in curly braces
            )
            |
            (?:
                [ \t]*
                \{(?<code_def>[^\}]+)\} # a definition block like {.class#id}
            )                           # Followed by class and id definition in curly braces
        )
        \n+
        (?<code_content>.*?)    # enclosed content
        \n+
        (?<!`)
        \g{code_start}          # balanced closing block marks
        (?!`)
    )
    (?:\n|\Z)                # and a new line or end of string
  1. Using 1 or more backticks for inline code

     (?<!\\)                     # Ensuring this is not escaped
     (?<code_all>
         (?<code_start>\`{1,})   # One or more backtick(s)
         (?<code_content>.+?)    # Code content inbetween back sticks
         (?<!`)                  # Not preceded by a backtick
         \g{code_start}          # Balanced closing backtick(s)
         (?!`)                   # And not followed by a backtick
     )
    
Jacques
  • 991
  • 1
  • 12
  • 15
  • The pattern for example 3 is wrong - it's the same as for pattern 1. A copy-paste error? – Senipah Sep 04 '20 at 11:05
  • Yes, copy/paste error. it should have been: (?<!\\) # Ensuring this is not escaped (? (?\`{1,}) # One or more backtick(s) (?.+?) # Code content inbetween back sticks (?<!`) # Not preceded by a backtick \g{code_start} # Balanced closing backtick(s) (?!`) # And not followed by a backtick ) See here – Jacques Sep 05 '20 at 12:16
0

Try this?

[a-z]*\n[\s\S]*?\n

It will extract from your example

This must have three backticks
flanking it
tzatalin
  • 404
  • 4
  • 7