6

So basically, I might have some string that looks like: "hey this is a string * this string is awesome 97 * 3 = 27 * this string is cool".

However, this string might be huge. I'm trying to remove all the asterisks from the string, unless that asterisk appears to represent multiplication. Efficiency is somewhat important here, and I'm having trouble coming up with a good algorithm to remove all the non-multiplication asterisks from this.

In order to determine whether an asterisk is for multiplication, I can obviously just check whether it's sandwiched in between two numbers.

Thus, I was thinking I could do something like (pseudocode):

wasNumber = false
Loop through string
   if number 
      set wasNumber = true
   else
      set wasNumber = false
   if asterisk
      if wasNumber
         if the next word is a number
            do nothing
         else
            remove asterisk
      else
         remove asterisk

However, that^ is ugly and inefficient on a huge string. Can you think of a better way to accomplish this in C++?

Also, how could I actually check whether a word is a number? It's allowed to be a decimal. I know there's a function to check if a character is a number...

Ioan Paul Pirau
  • 2,733
  • 2
  • 23
  • 26
Casey Patton
  • 4,021
  • 9
  • 41
  • 54
  • 4
    It seems like you could use a regex here. –  Jul 28 '11 at 16:53
  • 1
    Before being ugly or inefficient, that code is buggy. It never sets wasNumber back to false. – R. Martinho Fernandes Jul 28 '11 at 16:55
  • 1
    yeah just fixed that. I've honestly never used regular expressions in C++. Probably something like [^(0-9{1,}(\.+0-9*)].......man this seems complicated! – Casey Patton Jul 28 '11 at 16:58
  • This assumes that you cannot multiply anything that is not a number, which you might be fine with or not... just to make sure that you are conscious that multiplication cannot only apply to numbers, it can also be applied to symbols (ie. `3*pi` is a multiplication) – David Rodríguez - dribeas Jul 28 '11 at 17:23
  • @David Rodriguez: Good point, but for this program just handling numerical multiplication will suffice – Casey Patton Jul 28 '11 at 17:47
  • what about them? A multiplication symbol wouldn't ever be the first or last thing in parenthesis so I don't see how this relates. – Casey Patton Jul 28 '11 at 18:09
  • They would relate: "5*(6*3)", also you have whitespaces to skip. – sanosdole Jul 28 '11 at 18:15
  • ah got it. Good point, that does add a layer of complexity. However...my program doesn't really need to be perfect. Handling numbers should be enough for this. – Casey Patton Jul 28 '11 at 19:25
  • @Casey Patton you now have two variants that you can use for your problem. I would suggest testing both of them and choosing the appropriate one for your task. Please also inform us on the timings for both solutions :) – Ioan Paul Pirau Jul 29 '11 at 10:30
  • I will do this as soon as I get a chance and I will select an answer! – Casey Patton Jul 29 '11 at 15:56

4 Answers4

4

Fully functioning code:

#include <iostream>
#include <string>
using namespace std;

string RemoveAllAstericks(string);
void RemoveSingleAsterick(string&, int);
bool IsDigit(char);

int main()
{
    string myString = "hey this is a string * this string is awesome 97 * 3 = 27 * this string is cool";
    string newString = RemoveAllAstericks(myString);

    cout << "Original: " << myString << "\n";
    cout << "Modified: " << newString << endl;

    system("pause");
    return 0;
}

string RemoveAllAstericks(string s)
{
    int len = s.size();
    int pos;

    for(int i = 0; i < len; i++)
    {
       if(s[i] != '*') 
          continue;

       pos = i - 1;
       char cBefore = s[pos];
       while(cBefore == ' ')
       {
          pos--;
          cBefore = s[pos];
       }

       pos = i + 1;
       char cAfter  = s[pos];
       while(cAfter == ' ')
       {
          pos++;
          cAfter = s[pos];
       }

       if( IsDigit(cBefore) && IsDigit(cAfter) )
          RemoveSingleAsterick(s, i);
    }

    return s;
}

void RemoveSingleAsterick(string& s, int i)
{
    s[i] = ' '; // Replaces * with a space, but you can do whatever you want
}

bool IsDigit(char c)
{
   return (c <= 57 && c >= 48);
}

Top level overview:

Code searches the string until it encounters an *. Then, it looks at the first non-whitespace character before AND after the *. If both characters are numeric, the code decides that this is a multiplication operation, and removes the asterick. Otherwise, it is ignored.

See the revision history of this post if you'd like other details.

Important Notes:

  • You should seriously consider adding boundary checks on the string (i.e. don't try to access an index that is less than 0 or greater than len
  • If you are worried about parentheses, then change the condition that checks for whitespaces to also check for parentheses.
  • Checking whether every single character is a number is a bad idea. At the very least, it will require two logical checks (see my IsDigit() function). (My code checks for '*', which is one logical operation.) However, some of the suggestions posted were very poorly thought out. Do not use regular expressions to check if a character is numeric.

Since you mentioned efficiency in your question, and I don't have sufficient rep points to comment on other answers:

A switch statement that checks for '0' '1' '2' ..., means that every character that is NOT a digit, must go through 10 logical operations. With all due respect, please, since chars map to ints, just check the boundaries (char <= '9' && char >= '0')

user807566
  • 2,828
  • 3
  • 20
  • 27
  • Have you tried this solution with the proposed string ? As it is implemented it will definitely not work because in **all cases** `cBefore` and `cAfter` will not be numerals but empty spaces (`' '`) – Ioan Paul Pirau Jul 28 '11 at 21:14
  • @John Paul: No, I did not compile or test this. And you are right; the code will only work for `#*#` patterns. My fourth bullet addresses this. The code is fairly trivial to implement - but I didn't want to detract from the main algorithm. Also, I wanted to emphasize that every other posted solution involves checking every character for a numeric value, which is not necessary and most definitely not optimal. – user807566 Jul 29 '11 at 02:49
  • My code has been updated. I checked it with your sample, and it appears to be working. – user807566 Jul 29 '11 at 03:39
  • 1
    Your suggestion with the checks was good, I made an update adressing that. +1 because you corrected your code (and so that you may post comments in the future :) ). At this point, the sollution you proposed is a little bit better than mine if you expect relatively rare occurences of the asterisk, otherwise the sollution I proposed sails trough the string in a single dirrection and will do less navigation. – Ioan Paul Pirau Jul 29 '11 at 08:24
  • Yeah, I was thinking about that when I submitted. If there are a lot of *'s or if, for some reason, there is a lot of white space in between *'s, this code will require a lot of backtracking and extra operations. And thanks for the vote, by the way. :) – user807566 Jul 29 '11 at 11:40
3

You can start by implementing the slow version, it could be much faster than you think. But let's say it's too slow. It then is an optimization problem. Where does the inefficiency lies?

  • "if number" is easy, you can use a regex or anything that stops when it finds something that is not a digit
  • "if the next word is a number" is just as easy to implement efficiently.

Now, it's the "remove asterisk" part that is an issue to you. The key point to notice here is that you don't need to duplicate the string: you can actually modify it in place since you are only removing elements.

Try to run through this visually before trying to implement it.

Keep two integers or iterators, the first one saying where you are currently reading your string, and the second one saying where you are currently writing your string. Since you only erase stuff, the read one will always be ahead of the writing one.

If you decide to keep the current string, you just need to advance each of your integers/iterators one by one, and copying accordingly. If you don't want to keep it, just advance the reading string! Then you only have to cut the string by the amount of asterisks you removed. The complexity is simply O(n), without any additional buffer used.

Also note that your algorithm would be simpler (but equivalent) if written like this:

wasNumber = false
Loop through string
   if number 
      set wasNumber = true
   else
      set wasNumber = false
      if asterisk and wasNumber and next word is a number
          do nothing // using my algorithm, "do nothing" actually copies what you intend to keep
      else
          remove asterisk
Quentin Pradet
  • 4,691
  • 2
  • 29
  • 41
  • yeah my initial thought was just to erase from the middle of the string, though this is definitely something that would hurt efficiency. – Casey Patton Jul 28 '11 at 19:28
3

I found your little problem interesting and I wrote (and tested) a small and simple function that would do just that on a std::string. Here u go:

// TestStringsCpp.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <string>
#include <iostream>

using namespace std;

string& ClearAsterisk(string& iString)
{
    bool bLastCharNumeric = false;
    string lString = "0123456789";

    for (string::iterator it = iString.begin(); it != iString.end() ; ++it) {
        switch (*it) {
        case ' ':   break;//ignore whitespace characters
        case '*':
            if (bLastCharNumeric) {
                //asterisk is preceded by numeric character. we have to check if
                //the following non space character is numeric also
                for (string::iterator it2 = it + 1; it2 != iString.end() ; ++it2) {
                    if (*it2 != ' ') {
                        if (*it2 <= '9' && *it2 >= '0') break;
                        else iString.erase(it);
                        break;  //exit current for
                    }
                }
            }
            else iString.erase(it);;
            break;

        default:
            if (*it <= '9' && *it >= '0') bLastCharNumeric= true;
            else bLastCharNumeric = false;  //reset flag
        }
    }
    return iString;
}

int _tmain(int argc, _TCHAR* argv[])
{
    string testString = "hey this is a string * this string is awesome 97 * 3 = 27 * this string is cool";

    cout<<ClearAsterisk(testString).c_str();
    cin >> testString;  //this is just for the app to pause a bit :)

    return 0;
}

It will work perfectly with your sample string but it will fail if you have a text like this: "this is a happy 5 * 3day menu" because it checks only for the first nonspace character after the '*'. But frankly I can't immagine a lot of cases you would have this kind of construct in a sentence.

HTH,
JP.

Ioan Paul Pirau
  • 2,733
  • 2
  • 23
  • 26
0

A regular expression wouldn't necessarily be any more efficient, but it would let you rely on somebody else to do your string parsing and manipulation.

Personally, if I were worried about efficiency, I would implement your pseudocode version while limiting needless memory allocations. I might even mmap the input file. I highly doubt that you'll get much faster than that.

Max Lybbert
  • 19,717
  • 4
  • 46
  • 69