4

I have a string variable that can be 1 of the 3 things:

  1. A number
  2. A number in scientific notation
  3. Text

In cases 1 and 3 I want to do nothing and pass the data along. But in Case 2 I need to convert it a regular number. If I simply always convert the variable to a regular number then when it contains actual text it becomes "0". So I need to know if the string is a number in scientific notation. The obvious, dirty answer is an algorithm like this:

Iterate through string as long as you see numbers. If the first encountered letter is "e" or "E" and it is followed by either "+" or "-", OR strictly more numbers, then it's a number in scientific notation, otherwise it's just a regular number or text.

But I assume there is a better way to do this in C++98 (without boost). Is there any built-in method to help? Even if it's something that just uses a try/catch.

EDIT The question was closed because it was assumed to be homework. It is not homework. Therefore, it should be reopened. Also, to clarify, I am forced to use C++98 due to technical restraints.

I've drawn a proposed finite state automota based on my initial idea ("other" implies ALL characters not otherwise specified for that given state). I believe it is correct.

Some example inputs that should accept:

1.453e-8
0.05843E5
8.43e6
5.2342E-7

Some example inputs that should fail:

hello
03HX_12
8432
8432E
e-8
fail-83e1

enter image description here

juanchopanza
  • 223,364
  • 34
  • 402
  • 480
Mike S
  • 11,329
  • 6
  • 41
  • 76
  • Sounds like it's time to read up on [regular expressions](http://en.cppreference.com/w/cpp/regex). – tadman May 11 '16 at 20:43
  • This isn't homework... – Mike S May 11 '16 at 20:44
  • Please undo your vote if that was your reason. – Mike S May 11 '16 at 20:44
  • http://stackoverflow.com/questions/8986488/how-to-read-float-with-scientific-notation-from-file-c – pm100 May 11 '16 at 20:45
  • This seems like a job for regular expression. I'm pretty sure that c++98 has support for RE in std::tr1, but I'm taking that info from this answer: http://stackoverflow.com/questions/4716098/regular-expressions-in-c-stl – BHawk May 11 '16 at 20:46
  • @tadman I read that regex in C++ was not added until 11, is that not true? – Mike S May 11 '16 at 20:46
  • @pm100 the link you shared is not relevant. – Mike S May 11 '16 at 20:47
  • 3
    Whether or not it is homework, you didn't show any effort in solving the problem. Therefore, the question should remain closed. – Fred Larson May 11 '16 at 20:51
  • @FredLarson so if I write my proposed solution, and add it to the question, then you support reopening my question? – Mike S May 11 '16 at 20:51
  • here is the solution to your predicament [s="1.23e-5";Internal`StringToDouble...](http://mathematica.stackexchange.com/questions/1737/how-do-you-convert-a-string-containing-a-number-in-c-scientific-notation-to-a-ma) – DIEGO CARRASCAL May 11 '16 at 20:52
  • If you provide an [MCVE](http://stackoverflow.com/help/mcve) with example input and expected vs. actual output, yes. – Fred Larson May 11 '16 at 20:52
  • @FredLarson got it, thanks. – Mike S May 11 '16 at 20:54
  • @Mike That's why the documentation has flags like that on the definitions. C++11 is widely supported by pretty much every compiler so generally it's safe to use unless you're dealing with historical hardware and compilers. There's no reason to do this in C++98 unless you have a specific technical constraint. – tadman May 11 '16 at 20:57
  • 4
    @tadman I specified C++98 in both my title **and** the body. I think it's quite obvious that wasn't an oversight on my part. It is a restriction based on specific technical restraint far beyond my control. – Mike S May 11 '16 at 21:00
  • Then I wish you good luck working with a compiler that was released during the Clinton administration. – tadman May 11 '16 at 21:01
  • 5
    C++ brings out the pedantic people because C++ is extremely difficult to get right. Each community is a product of how much abuse the compiler or runtime inflicts on them on a daily basis. To crack this nut here: Write a state machine that steps through a `std::string`. – tadman May 11 '16 at 21:07
  • FWIW, [this lecture handout](http://www.cs.arizona.edu/~collberg/Teaching/453/2009/Handouts/Handout-3.pdf) contains a DFA for floating-point literals. – mindriot May 11 '16 at 21:16
  • 1
    I added the C++03 tag again because C++ is taken to mean the current standard. I don't think there is a C++98 tag, but I doubt that you need something that is only available in 98 and not 03, given that the latter consists mainly of fixes on top of the former. – juanchopanza May 11 '16 at 23:04

4 Answers4

3

The main problem with your automaton is in the area of specification/requirements: it requires one or more digits on both sides of the decimal point, rejecting inputs like this:

.123
.123E3
123.
123.E+3

It is not immediately obvious that these should be rejected; some programming languages allow these forms. It's probably a good idea to reject something with no mantissa digits at all, though:

.
.E+03

If this is missed requirements, rather than intentional, then your state machine has to be adjusted. Because now elements are optional, the easiest way to fix the state machine is to let it be non-deterministic (NFA graph). That just introduces unnecessary difficulties. Since you will end up writing code anyway, it's easiest to handle this with ad hoc procedural code.

The advantage which ad hoc procedural code has over a formal treatment with conventional automata is lookahead: it can peek at the next character without consuming the input, as an automaton does and based on the lookahead it can decide whether to consume the character or not, independently of transitioning among several code paths. In other words, pseudo code:

have_digits_flag = false

while (string begins with a digit character) {
   have_digits_flag = true
   consume digit character 
}

if (!string begins with a decimal point)
   goto bad;
consume decimal point

while (string begins with digit) {
  consume digit
  have_digits_flag = true;
}

if (!have_digits_flag)
  goto bad; // we scanned just a decimal point not flanked by digits!

if (string begins with e or E) {
   consume character
   if (string begins with + or -)
     consume character
   if (!string begins with digit)
     goto bad;
   while (string begins with digit)
     consume character
}

if (string is empty)
  return true;

// oops, trailing junk

bad:
  return false;

The "string begins with" operation obviously has to be safe against the string being empty. An empty string simply does not satisfy the "string begins with" predicate for any character.

How you implement "consume character" is up to you. You can literally remove a character from the std::string object, or move an iterator through it which indicates the input position.

The disadvantage of this type of approach is that it is not efficient. It perhaps doesn't matter here, but in general the ad hoc code suffers from the problem that it keeps testing multiple cases against the lookahead. In a complicated pattern, these can be numerous. A table-driven automaton never has to look at any input symbol a second time.

Kaz
  • 55,781
  • 9
  • 100
  • 149
  • I did intentionally leave out those examples, but now that I think more about it I think you're right. The ones you mentioned should be included. Thanks for the in-depth writeup and pseudo code. Performance is somewhat important because it will run hunderds of thousands of times but it's not crucial that it be as fast as possible. I'm going to play with thomas' answer a bit to see if it covers all the cases before implementing something along the lines of your pseudo code. – Mike S May 12 '16 at 16:25
2

JSON uses a pretty good state machine for parsing numbers. It is not overly strict but does not accept garbage like "e" or "-.e2".

It is:

  • "-", or,
  • nothing,

followed by

  • '0', or,
  • (one digit [1-9], followed by zero or more [0-9])

followed by

  • ('.' followed by one or more [0-9]), or,
  • nothing

followed by

  • (('e' or 'E'), followed by ('+' or '-' or nothing), followed by zero or more [0-9]), or,
  • nothing

If you wish to see the format specified more formally, see RFC 7159, Section 6.

number = [ minus ] int [ frac ] [ exp ]
decimal-point = %x2E       ; .
digit1-9 = %x31-39         ; 1-9
e = %x65 / %x45            ; e E
exp = e [ minus / plus ] 1*DIGIT
frac = decimal-point 1*DIGIT
int = zero / ( digit1-9 *DIGIT )
minus = %x2D               ; -
plus = %x2B                ; +
zero = %x30                ; 0

This is what I do in my JSON parser for numbers (I support an extended format that allows for full range 64 bit integers, JSON specification says integers cannot reliably lie outside the range -2^53+1 to 2^53-1):

template<typename InputIt,
         typename V = typename 
            std::iterator_traits<InputIt>::value_type>
static InputIt extractNumber(Variant& result, InputIt st, InputIt en);

template<typename InputIt, typename V>
InputIt JsonParser::extractNumber(Variant& result, InputIt st, InputIt en)
{
    if (st == en)
        parseError("Expected number at end of input");

    std::vector<V> text;

    auto accept = [&] {
        if (st == en)
            parseError("Expected number at end of input");
        text.emplace_back(*st++);
    };

    // -?(?:0|[1-9][0-9]*)(?:\.[0-9]+)?(?:[eE][+-]?[0-9]+)?
    // A    B C              D            E

    bool isFloatingPoint = false;

    if (*st == '-')
    {
        // A
        accept();
    }

    if (*st == '0')
    {
        // B
        accept();
    }
    else if (std::isdigit(*st, cLocale))
    {
        // C
        do
        {
            accept();
        }
        while (st != en && std::isdigit(*st, cLocale));
    }
    else
    {
        parseError("Invalid number");
    }

    if (st != en && *st == '.')
    {
        accept();

        isFloatingPoint = true;

        // D
        while (st != en && std::isdigit(*st, cLocale))
            accept();
    }

    if (st != en && (*st == 'E' || *st == 'e'))
    {
        isFloatingPoint = true;

        // E
        accept();

        if (st != en && (*st == '+' || *st == '-'))
            accept();

        if (st == en || !std::isdigit(*st, cLocale))
            parseError("Invalid number");

        while (st != en && std::isdigit(*st, cLocale))
            accept();
    }

    text.emplace_back(0);

    if (isFloatingPoint)
        result.assign(std::atof(text.data()));
    else
        result.assign(std::int64_t(std::atoll(text.data())));

    return st;
}

I had to tweak it a little because the surrounding implementation ensured st won't be equal to en on entry, originally I asserted. It passes my unit tests.

Community
  • 1
  • 1
doug65536
  • 6,562
  • 3
  • 43
  • 53
  • Thanks, I suppose I should have searched for an existing FSA for scientific notation before drawing one. But oh well, it was kind of fun to work it out on the whiteboard. – Mike S May 12 '16 at 15:59
  • Added back the implementation of strict number parse from my json parser. It can tell whether it is floating point and specializes integers to allow non-standard full-range 64-bit numbers. – doug65536 May 13 '16 at 01:09
1
bool is_valid(std::string src){
    std::stringstream ss;
    ss << src;
    double d=0;
    ss >> d; 
    if (ss){
       return true;
    }
    else{
       return false;
    }
}

I have a simple solution. Use C++ stream to test is the string is a number.

Mike S
  • 11,329
  • 6
  • 41
  • 76
thomas
  • 505
  • 3
  • 12
  • Thanks! I actually tried something similar to this, but I was printing out `ss` and it printed `0.000000` so I thought it was coercing the text to 0. But it seems like your function would work. – Mike S May 12 '16 at 16:02
  • Although technically this doesn't determine if the input is a number in scientific notation verse a regular number. – Mike S May 12 '16 at 16:41
  • This function doe not work for some examples in my question. For example, it returns `true` for the input "03HX_12". So ultimately "03HX_12" would be changed to "3.000000". – Mike S May 12 '16 at 18:08
  • I'm surprised there isn't a way to get the contents of `src` into the `stringstream` via a constructor. – Kaz May 12 '16 at 21:51
  • What happens if the input is `123985958940581409827405987453.8478374845738982734234E+987654321`? What is the requirement? The original state machine will accept this. – Kaz May 12 '16 at 21:56
  • @Kaz good point, it returns `false`. And the resulting value of `d` is `1.79769e+308` – Mike S May 12 '16 at 22:17
0

I ended up coding a function based on the FSA I drew in my question. Plus a few tweaks based on koz's information and advice (thanks), and minus some determinism because it's code and I can cut those corners :). It's bulky but it's simple to understand and it works. See the sample test cases below the code.

bool is_scientific_notation(string input) {
    int state = 0;
    for (string::size_type i = 0; i < input.size(); i++) {
        char character = input.at(i);
        //cout << character << endl;
        switch(state) {
            case 0: {
                // state 0: accept on '. or' '-' or digit
                if (character == '.') {
                    state = 3;
                } else if (character == '-') {
                    state = 1;
                } else if (isdigit(character)) {
                    state = 2;
                } else {
                    goto reject; // reject 
                }
                break;
            }
            case 1: {
                // state 1: accept on '. or digit
                if (character == '.') {
                    state = 3;
                } else if (isdigit(character)) {
                    state = 2;
                } else {
                    goto reject; // reject 
                }
                break;
            }
            case 2: {
                // state 2: accept on '.' or 'e' or 'E' digit
                if (character == '.') {
                    state = 4;
                } else if ((character == 'e') || (character == 'E')) {
                    state = 5;
                } else if (isdigit(character)) {
                    state = 2;
                } else {
                    goto reject; // reject 
                }
                break;
            }
            case 3: {
                // state 3: accept on digit
                if (isdigit(character)) {
                    state = 4;  
                } else {
                    goto reject; // reject 
                }
                break;
            }
            case 4: {
                // state 4: accept on 'e' or 'E' or digit
                if ((character == 'e') || (character == 'E')) {
                    state = 5;
                } else if (isdigit(character)) {
                    state = 4;  
                } else {
                    goto reject; // reject 
                }
                break;
            }
            case 5: {
                // state 5: accept on '+' or '-' or digit
                if ((character == '+') || (character == '-')) {
                    state = 6;
                } else if (isdigit(character)) {
                    state = 6;  
                } else {
                    goto reject; // reject 
                }
                break;
            }
            case 6: {
                // state 6: accept on digit
                if (isdigit(character)) {
                    state = 6;  
                } else {
                    goto reject; // reject 
                }
                break;
            }
        }
    }
    if (state == 6) {
        return true;
    } else {
        reject:
            return false;
    }
}

Tests:

// is_scientific_notation should return true
cout << ((is_scientific_notation("269E-9")) ? ("pass") : ("fail")) << endl;
cout << ((is_scientific_notation("269E9")) ? ("pass") : ("fail")) << endl;
cout << ((is_scientific_notation("269e-9")) ? ("pass") : ("fail")) << endl;
cout << ((is_scientific_notation("1.453e-8")) ? ("pass") : ("fail")) << endl;
cout << ((is_scientific_notation("8.43e+6")) ? ("pass") : ("fail")) << endl;
cout << ((is_scientific_notation("5.2342E-7")) ? ("pass") : ("fail")) << endl;
cout << ((is_scientific_notation(".2342E-7")) ? ("pass") : ("fail")) << endl;
cout << ((is_scientific_notation("8.e+2")) ? ("pass") : ("fail")) << endl;
cout << ((is_scientific_notation("-853.4E-2")) ? ("pass") : ("fail")) << endl;

// is_scientific_notation should return false
cout << ((is_scientific_notation("hello")) ? ("fail") : ("pass")) << endl;
cout << ((is_scientific_notation("03HX_12")) ? ("fail") : ("pass")) << endl;
cout << ((is_scientific_notation("8432")) ? ("fail") : ("pass")) << endl;
cout << ((is_scientific_notation("8432E")) ? ("fail") : ("pass")) << endl;
cout << ((is_scientific_notation("fail-83e1")) ? ("fail") : ("pass")) << endl;
cout << ((is_scientific_notation(".e8")) ? ("fail") : ("pass")) << endl;
cout << ((is_scientific_notation("E-8")) ? ("fail") : ("pass")) << endl;
cout << ((is_scientific_notation("2e.2")) ? ("fail") : ("pass")) << endl;
cout << ((is_scientific_notation("-E3")) ? ("fail") : ("pass")) << endl;

Output:

pass
pass
pass
pass
pass
pass
pass
pass
pass
pass
pass
pass
pass
pass
pass
pass
pass
pass
Mike S
  • 11,329
  • 6
  • 41
  • 76