C++ Polynomial Tokenizer

Question

I am currently working on creating a tokenizer that takes in a polynomial as a string and outputs an array of monomials (individual terms) within the polynomial.

ex:

input: 4x^2+3x^-2+2

output: { "4x^2", "3x^-2", "2" }

I am not exactly sure where to start in regards to this due to the fact that polynomials are a little more tricky due to exceptions. Can anyone provide me any insight?

Can't you just split on plus/minus, and then trim whitespace? Also, polynomials cannot have negative powers. Once you allow negative powers it basically becomes equivalent to the space of regular expressions, which is a different (strictly larger) space. — Nir Friedman, Sep 25 '16 at 01:05
I could but exponents can be negative and I'm not sure how to account for that. — star, Sep 25 '16 at 01:07
Do not use a regex. Just scan through, character by character, and if the character is plus OR minus, you split off a new token. You should really show at least an attempted solution; imho right now this question fails the criteria of making a reasonable attempt to solve on your own . — Nir Friedman, Sep 25 '16 at 01:10
I don't have any code typed up currently, only writing out my algorithm on paper. I'm not really familiar with the c++ code required to tokenize and so I was just wanting suggestions on any built-ins that would be helpful. — star, Sep 25 '16 at 01:16
"I was just wanting suggestions on any built-ins that would be helpful." no such built-ins, prepare yourself to exchange the paper with your preferred C++ IDE — Adrian Colomitchi, Sep 25 '16 at 01:27

score 2 · Accepted Answer · edited Nov 20 '17 at 15:03

There may be some quick and dirty hacks that can be done using regular expressions or pattern matching, here.

However, the robust way of implementing this parsing is using standard tools that have been (or should've been) taught in our fine institutions of higher learning. Or, at least they were in my time. I am, of course, referring to lexical analyzers and LALR(1) parser generators.

A lexical analyzer, such as flex, takes a list of token definitions in form of regular expressions, and generates code that tokenizes the input stream. In this case, the following simple flex ruleset should be sufficient for tokenizing your polynomial, I think:

%{
#include "y.tab.h"
%}

digit         [0-9]
letter        [a-zA-Z]

%%
"+"                  { return PLUS;       }
"-"                  { return MINUS;      }
"*"                  { return TIMES;      }
"/"                  { return SLASH;      }
"^"                  { return EXPONENT;   }
{letter}+ {
                       yylval.id = strdup(yytext);
                       return IDENT;      }
{digit}+             { yylval.num = atoi(yytext);
                       return NUMBER;     }

This will do the initial task of parsing out the individual elements of the polynomial, from your input string.

The lexical analyzer works together with the LALR(1) parser generator, such as bison, which generates the y.tab.h file that defines the grammar to be parsed, and the elements in the grammar, like PLUS, MINUS and all the other tokens.

Bison takes a specification for a context-free grammar, and generates a parser for it. Grammar specifications, even for simple polynomials like that, tend to be fairly drawn out, so this would be just a subset of the grammar specification for your polynomials:

polynomial: additive_expression;

additive_expression: additive_term
                   | additive_expression plus_or_minus additive_term

plus_or_minus: PLUS | MINUS;

/* additive_term then fleshes out the structure of each polynomial term */

This would be supplemented, of course, with fragments of code that build a parse tree as part of the ruleset.

flex and bison have been around for a long time, originally generating C code (hence the C fragments in my flex example); but currently are capable of generating C++ code as well. It goes without saying that if you are not familiar with these tools, there will be a steep learning curve; but this is the time-tested way of implementing a parser for non-trivial syntax, such as your polynomials.

C++ Polynomial Tokenizer

1 Answers1