1

I want to extract the return value from a given text that represents a method.

for example:

g: x and: y
Transcript show: x; show: y.
^x+y.

so to solve I used the regular expression:

\^\s*(\w+.*).

when I run this on some regex websites it seems to work and do what I want, for example : https://regex101.com/

but when I run the following program it seems that squeak returns nil to it (can't find a match). I suspect that is because I am using the character ^.

but I escaped that character so I have no idea why that is failing to work.

the code I used to test it:

|aString regexObj |
aString := 'g: x and: y
Transcript show: x; show: y.
^x+y.'.

regexObj := '\^\s*(\w+.*).' asRegex.
regexObj matches: aString.
returnedType:= (regexObj  subexpression:2). 
Transcript show: returnedType.

anyone knows why, and how to solve it?

Thanks.

Leandro Caniglia
  • 14,495
  • 4
  • 29
  • 51
  • Aviad, why did you try to acces subexpression *2*? There is only one capturing group, so it should be `(regexObj subexpression:1)` – Wiktor Stribiżew May 16 '17 at 07:04
  • 2
    Depending on what you are trying to do, it may be simpler to generate the AST and access the return node. E.g. `(MyClass>>#myMethod) parseTree nodesDo:[ :node | ... ]` (untested). – Max Leske May 16 '17 at 07:38
  • @WiktorStribiżew from some side testing and by reading the documentation you will see that subexpression 1 will give you the entire match while subexpression 2 will give you the group in the parentheses. the documentation: [link] (https://ci.inria.fr/pharo-contribution/job/UpdatedPharoByExample/lastSuccessfulBuild/artifact/book-result/Regex/Regex.html) – Aviad Shiber May 16 '17 at 07:42
  • @MaxLeske can you please explain in more details, and give maybe a reference to how to do it? thanks! – Aviad Shiber May 16 '17 at 07:43
  • @AviadShiber: Good link, it is clearer now. But still, look at your code and examples in the docs. Yours should look like `regexObj := '\^\s*(\w.*)\.' asRegex.` => `regexObj search: aString` => `regexObj subexpression: 2` - NOTE: `search` is used to look for partial matches, while `matches` needs a whole string match. At least try replacing `matches:` with `search:` in your code. – Wiktor Stribiżew May 16 '17 at 07:48
  • It would also help if you could clear out the actual requirements: do you only need to extract a part of the line that starts with `^`, then has 0+ whitespaces and then your required value starts with a word char after which you grab the rest of the line? Or do you need to parse arithmetic expressions (then, AST is the way to go)? – Wiktor Stribiżew May 16 '17 at 07:52
  • @WiktorStribiżew replacing matches with search fixed it! thanks!! :). basically my task is to check types in runtime with some given types that I already extracted based on comments in the code. so after extraction I need to inject the code to the compile method in Behaviour class. – Aviad Shiber May 16 '17 at 08:13
  • Ok, I will post the answer to explain the difference between `matches` and `search`, but if you have a problem with the regex, let know. – Wiktor Stribiżew May 16 '17 at 08:16

1 Answers1

3

You need to replace the method from matches to search. See 139.6. Matching:

matches: aString — true if the whole argument string (aString) matches.

and

search: aString — Search the string for the first occurrence of a matching substring. Note that the first two methods only try matching from the very beginning of the string. Using the above example with a matcher for a+, this method would answer success given a string 'baaa', while the previous two would fail.

The first two methods refers to matches (requiring a full string match) and matchesPrefix (that anchors the match at the start of the input string). The search allows matching the pattern anywhere inside the string.

A note on your regex: the final . is not escaped and matches any non-line break char. You should escape it to match a literal dot:

'\^\s*(\w.*)\.'

See the regex demo.

Also, \s match match across lines. If you do not want it, replace \s with \h (a PCRE pattern that matches only horizontal whitespaces). Watch out for the .* pattern: it will match any 0+ chars other than line break chars, as many as possible, so, it will match up to the last . on a matching line.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    thanks! I really missed it and now it is much more clear to me :) – Aviad Shiber May 16 '17 at 08:31
  • I dont know why but with search in catch more info than needed, for example: {code} |aString regexObj returnedType | aString:='g: x and: y "@Private" Transcript show: x; show: y. ^ x+y . "@ArgsTypes: SmallInteger, SmallInteger. "@RetType: Boolean"'. regexObj := '\^\s*(.+)\.' asRegex. regexObj search: aString. returnedType:= (regexObj subexpression:2). Transcript show: returnedType. {code} and \h does not work in squeak. – Aviad Shiber May 16 '17 at 09:18
  • It is written it uses PCRE regex flavor. It is not right then, if `\h` is not supported. As I noted, `.*` matches as many chars as possible. Maybe you need `.*?`? Could you please share a http://regex101.com fiddle showcasing the issue? – Wiktor Stribiżew May 16 '17 at 09:24
  • fixed it by using the follow regex: \^([^.]*)\. in regex101 the problem was now shown, only in smalltalk. – Aviad Shiber May 16 '17 at 09:32
  • Yes, just make it non-greedy. The `[^.]*` matches 0+ chars other than a dot, so it will make it up to the *first* dot. – Wiktor Stribiżew May 16 '17 at 09:37
  • Just from interest: doesn't `.*?` work? Are lazy quantifiers supported? – Wiktor Stribiżew May 16 '17 at 09:40
  • no they are no supported, I get the "invliad lookaround expression" error. – Aviad Shiber May 16 '17 at 09:56
  • Ok, I found that [*"the matcher passes H. Spencer's test suite"*](http://live.exept.de/doc/online/english/programming/goody_regex.html#IMPLEMENTATION). No lazy quantifier support mentioned. – Wiktor Stribiżew May 16 '17 at 10:06
  • Hi again Wiktor, when I use the following regex: https://regex101.com/r/oY82RO/2 I also get a match with the following line of ArgTypes.. I want to have a match with only between "@RetTypes:(...)." do you have any idea what can I do? thanks. – Aviad Shiber May 23 '17 at 13:34
  • Sorry, regex101 does not work on a mobile. If you need to extract part of text between `@RetTypes(` and `)`, use `@RetTypes\(([^)]+)\)`. – Wiktor Stribiżew May 23 '17 at 14:49