4

I have to tokenize a conditional string expression :

Aritmetic operators are = +, -, *, /, %

Boolean operators are = &&, ||

Conditional Operators are = ==, >=, >, <, <=, <,!=

An example expression is = (x+3>5*y)&&(z>=3 || k!=x)

What i want is tokenize this string = operators + operands.

Because of ">" and ">=" and "=" and "!=" [ which contains same string] i have problems with tokenizing.

PS1: I do not want to make complex lexial analysis. Just simply parse if possible with reqular expressions.

PS2: Or in other words, i look for a regular expression which is given sample expression wihout whitespace =

(x+3>5*y)&&(z>=3 || k!=x) 

and will produce each token is separated with a white space like :

( x + 3 > 5 * y ) && ( z >= 3 || k != x )
dav_i
  • 27,509
  • 17
  • 104
  • 136
Hippias Minor
  • 1,917
  • 2
  • 21
  • 46
  • 7
    "PS: I do not want to make complex lexial analysis. Just simply parse if possible with reqular expressions." - yes, but that presumes that what you are parsing is "regular" (which has a definition, etc). When it comes to processing complex expressions like this (the parenthesis in particular), personally I'd be using a simple tokenizer that tracks "what did we just read" (only yielding the token when it knows it has changed - so you don't yield `<` as a token until you've read the next character, so you know that it isn't `<=`), then a shunting yard algorithm to create an AST – Marc Gravell Jul 09 '13 at 08:07
  • I will use such an algorithm.No problem for evaluation. But first I have to tokenize this in a right way.Look for simple tokeizer which can tokenizer ">" , ">=" truely... – Hippias Minor Jul 09 '13 at 08:11
  • 1
    I've written several such tokenizers; they are indeed pretty simple - but none of them involve regex, for the reason that IMO this is not a problem that is ideally suited to regular expressions. – Marc Gravell Jul 09 '13 at 08:20
  • Well, Any simple tokenizer example which parse this string and give tokens as an array will be great. But be carefull i have an operators which has same characters like ">",">=" in which i have to check next characters etc which makes simple parser "ugly"... – Hippias Minor Jul 09 '13 at 08:27
  • Here's some hacky Java code I just came up with which seems to solve your problem - `"(x+3>5*y)&&(z>=3 || k!=x)".replaceAll("==?|>=?|<=?|!=|&&|\\|\\||[-()+*/%]", " $0 ").replaceAll(" {2,}", " ").trim()`. I'm sure you can manage to convert that to C# if it is sufficient. – Bernhard Barker Jul 09 '13 at 08:30
  • Dukeling...It can be hacky...But not even work for simple "3>=4" expression. And becarefull i hava ">" and ">=" oprators. – Hippias Minor Jul 09 '13 at 08:41
  • Do you need floating point literals? What about unary expressions (e.g. -5, -a, !myBool)? Boolean literals? – Bas Jul 09 '13 at 08:47
  • No just simple expression for now... – Hippias Minor Jul 09 '13 at 08:48
  • [It seems to work just fine here](https://ideone.com/d4HUKI). – Bernhard Barker Jul 09 '13 at 11:18

2 Answers2

4

Not a regex, but a basic tokenizer that might just work (note that you don't need to do the string.Join - you can use the IEnumerable<string> via foreach):

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
static class Program
{
    static void Main()
    {
        // and will produce each token is separated with a white space like : ( x + 3 > 5 * y ) && ( z >= 3 || k != x )
        string recombined = string.Join(" ", Tokenize("(x+3>5*y)&&(z>=3 || k!=x)"));
        // output: ( x + 3 > 5 * y ) && ( z >= 3 || k != x )
    }
    public static IEnumerable<string> Tokenize(string input)
    {
        var buffer = new StringBuilder();
        foreach (char c in input)
        {
            if (char.IsWhiteSpace(c))
            {
                if (buffer.Length > 0)
                {
                    yield return Flush(buffer);
                }
                continue; // just skip whitespace
            }

            if (IsOperatorChar(c))
            {
                if (buffer.Length > 0)
                {
                    // we have back-buffer; could be a>b, but could be >=
                    // need to check if there is a combined operator candidate
                    if (!CanCombine(buffer, c))
                    {
                        yield return Flush(buffer);
                    }
                }
                buffer.Append(c);
                continue;
            }

            // so here, the new character is *not* an operator; if we have
            // a back-buffer that *is* operators, yield that
            if (buffer.Length > 0 && IsOperatorChar(buffer[0]))
            {
                yield return Flush(buffer);
            }

            // append
            buffer.Append(c);
        }
        // out of chars... anything left?
        if (buffer.Length != 0)
            yield return Flush(buffer);
    }
    static string Flush(StringBuilder buffer)
    {
        string s = buffer.ToString();
        buffer.Clear();
        return s;
    }
    static readonly string[] operators = { "+", "-", "*", "/", "%", "=", "&&", "||", "==", ">=", ">", "<", "<=", "!=", "(",")" };
    static readonly char[] opChars = operators.SelectMany(x => x.ToCharArray()).Distinct().ToArray();

    static bool IsOperatorChar(char newChar)
    {
        return Array.IndexOf(opChars, newChar) >= 0;
    }
    static bool CanCombine(StringBuilder buffer, char c)
    {
        foreach (var op in operators)
        {
            if (op.Length <= buffer.Length) continue;
            // check starts with same plus this one
            bool startsWith = true;
            for (int i = 0; i < buffer.Length; i++)
            {
                if (op[i] != buffer[i])
                {
                    startsWith = false;
                    break;
                }
            }
            if (startsWith && op[buffer.Length] == c) return true;
        }
        return false;
    }

}
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Thanks.It works.For now just upvote the answer.Look for interesting reqular expression answer, although it may not be good candidate for reqular expression – Hippias Minor Jul 09 '13 at 08:55
1

If you can predefine all the operators that you're going to use, something like this might work for you.

Be sure to put the double-character operators earlier in the regex, so that you will try to match '<' before you match '<='.

using System;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string pattern = "!=|<=|>=|\\|\\||\\&\\&|\\d+|[a-z()+\\-*/<>]";
      string sentence = "(x+35>5*y)&&(z>=3 || k!=x)";

      foreach (Match match in Regex.Matches(sentence, pattern))
         Console.WriteLine("Found '{0}' at position {1}", 
                           match.Value, match.Index);
   }
}

Output:

Found '(' at position 0
Found 'x' at position 1
Found '+' at position 2
Found '35' at position 3
Found '>' at position 5
Found '5' at position 6
Found '*' at position 7
Found 'y' at position 8
Found ')' at position 9
Found '&&' at position 10
Found '(' at position 12
Found 'z' at position 13
Found '>=' at position 14
Found '3' at position 16
Found '||' at position 18
Found 'k' at position 21
Found '!=' at position 22
Found 'x' at position 24
Found ')' at position 25
Gustav Bertram
  • 14,591
  • 3
  • 40
  • 65