1

I am trying to read a .java file into a perl variable, and I want to match a function, say for instance:

public String example(){

return "hello";

}

What would the regex patter for this look like?

Current Attempt:

use strict; 
use warnings;

open ( FILE, "example.java" ) || die "can't open file!";
my @lines = <FILE>;
close (FILE);

my $line;

foreach $line (@lines) {
 if($line =~ /String example(.*)}/s){
    print $line;
 }
}
Conor Thompson
  • 198
  • 1
  • 15
  • Not very simple to do. You have balanced text that are scope delimiters that can be hidden inside quotes and comments. Well, that's at least 3 things you have to consider and write a pretty big and complex regex to do. I would charge money to do this one. And, of course the function itself can be hidden in comments or quotes. –  Sep 05 '17 at 15:10
  • Why? I mean like sln said this gonna be complex. But what is the purpuse? And How many match `String example(` have in all the directories? If there is less than 20 match it will be quicker by hand. – Drag and Drop Sep 05 '17 at 15:16
  • This is just to replace a function in a file, without having to manually do it, for reasons. – Conor Thompson Sep 05 '17 at 16:01

2 Answers2

2

**Adopted from this answer

Regex:

^\s*([\w\s]+\(.*\)\s*(\{((?>"(?:[^"\\]*+|\\.)*"|'(?:[^'\\]*+|\\.)*'|//.*$|/\*[\s\S]*?\*/(\w+)["']?[^;]+\4;$|[^{}<'"/]++|[^{}]++|(?2))*)}))

Breakdown:

^ \s*
(                             # (1 start)
  [\w\s]+ \( .* \) \s*                 # How it matches a function definition
  (                             # (2 start)
      \{                                     # Opening curly bracket
      (                             # (3 start)
           (?>                               # Atomic grouping (for its non-capturing purpose only)
                "(?: [^"\\]*+ | \\ . )*"     # Double quoted strings
             |  '(?: [^'\\]*+ | \\ . )*'     # Single quoted strings
             |  // .* $                      # A comment block starting with //
             |  /\* [\s\S]*? \*/             # A multi-line comment block /*...*/
                ( \w+ )                      # (4) ^
                ["']? [^;]+ \4 ; $           # ^
             |  [^{}<'"/]++                  # Force engine to backtrack if it encounters special characters (possessive)
             |  [^{}]++                      # Default matching behavior (possessive)
             |  (?2)                         # Recurs 2nd capturing group
           )*                                # Zero to many times of atomic group
      )                             # (3 end)
      }                                      # Closing curly bracket
  )                             # (2 end)
)                             # (1 end)
revo
  • 47,783
  • 14
  • 74
  • 117
1

Revo's regex is the Right Way To Do it (as much as a regex ever can be!).

But sometimes you just need something quick, to manipulate a file you have control over. I find, when using regexes, that it's often important to define "Good enough".

So, it may be "good enough" to assume the indentation is correct. In that case, you can just detect the start of the fn, then read until you find the next closing curly with the same indentation:

(               # Capture \1.
    ^([\t ])+   # Match and capture leading whitespace to \2.
    (?:\w+\s*)? # Privacy specifier, if any.
    \w+\s*\(    # Name and opening round brace: is a function.
    .*?         # Need Dot-matches-newline, to match fn body.
    \n\2}       # Curly brace is as indented as start of fn.
)               # End capture of \1.

Should work on clean code that you wrote yourself, code you can pass through an auto-formatter first, etc.

Will work with K&R, Hortmann and Allman indent styles.

Will fail with one-line and in-line functions, and indent styles like GNU, Whitesmiths, Pico, Ratliff and Pico - things which Rico's answer handles with no problems at all.

Also fails on lambdas, nested functions, and functions which use generics, but even Revo's doesn't recognize those, and they're not that common.

And neither of our regexes capture the comments preceding a function, which is pretty sinful.

Dewi Morgan
  • 1,143
  • 20
  • 31
  • 1
    This one seems a lot neater! – Conor Thompson Sep 07 '17 at 08:01
  • @ConorThompson Neater perhaps, but only because I'm making assumptions about the data we're working with. So my suggestion is far less rugged than Revo's code, which makes exactly zero assumptions about formatting. Even then, neither of them are PHP parsers, so both break in some cases; Revo's just breaks in far *fewer* cases. – Dewi Morgan Sep 07 '17 at 15:34