1

I am looking for a (relatively) simple way to parse a random string and extract all of the integers from it and put them into an Array - this differs from some of the other questions which are similar because my strings have no standard format.

Example:

pt112parah salin10n m5:isstupid::42$%&%^*%7first3

I would need to eventually get an array with these contents:

112 10 5 42 7 3

And I would like a method more efficient then going character by character through a string.

Thanks for your help

Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105
RBaxter
  • 13
  • 2
  • 2
    I'm pretty sure the only way to do this is going character by character unless you know a specific number to search for – 小太郎 Jun 07 '11 at 21:03
  • There is no more efficient way than going through character-by-character. However, you may find a library function that hides the loop under the hood. – Oliver Charlesworth Jun 07 '11 at 21:03
  • All I know is that it will be a number less than 256, non-negative. I could just find the index of a character that is a digit then call sscanf at that location and repeat, but I would think there is a more efficient (or at least cleaner) way of doing this. – RBaxter Jun 07 '11 at 21:05
  • What's wrong with going character by character? Can you imagine another way? – Joe Jun 07 '11 at 21:05
  • I would imagine a function that parses your string like strtok, except for any number, instead of a set token. Guess not ~_~ – RBaxter Jun 07 '11 at 21:07
  • [`strcspn()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/strcspn.html)? – pmg Jun 07 '11 at 21:36
  • Why are you doing this in C? Honestly? – Karl Knechtel Jun 08 '11 at 04:27

6 Answers6

2

A quick solution. I'm assuming that there are no numbers that exceed the range of long, and that there are no minus signs to worry about. If those are problems, then you need to do a lot more work analyzing the results of strtol() and you need to detect '-' followed by a digit.

The code does loop over all characters; I don't think you can avoid that. But it does use strtol() to process each sequence of digits (once the first digit is found), and resumes where strtol() left off (and strtol() is kind enough to tell us exactly where it stopped its conversion).

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>

int main(void)
{
    const char data[] = "pt112parah salin10n m5:isstupid::42$%&%^*%7first3";
    long results[100];
    int  nresult = 0;

    const char *s = data;
    char c;

    while ((c = *s++) != '\0')
    {
        if (isdigit(c))
        {
            char *end;
            results[nresult++] = strtol(s-1, &end, 10);
            s = end;
        }
    }

    for (int i = 0; i < nresult; i++)
        printf("%d: %ld\n", i, results[i]);
    return 0;
}

Output:

0: 112
1: 10
2: 5
3: 42
4: 7
5: 3
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • I note that the variable `end` is superfluous; the loop body could be the single statement `results[nresult++] = strtol(s-1, &s, 10);`. I also note that I didn't include a overflow check on the array of integers - that should be in there too. – Jonathan Leffler Jun 07 '11 at 21:36
  • Correction: using `end` is better because it is const-correct. Using `&s` instead of `&end` leads to a compilation warning on the call to `strtol()`. – Jonathan Leffler Jun 07 '11 at 22:09
1

Just because I've been writing Python all day and I want a break. Declaring an array will be tricky. Either you have to run it twice to work out how many numbers you have (and then allocate the array) or just use the numbers one by one as in this example.

NB the ASCII characters for '0' to '9' are 48 to 57 (i.e. consecutive).

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdbool.h>

int main(int argc, char **argv)
{
    char *input = "pt112par0ah salin10n m5:isstupid::42$%&%^*%7first3";

    int length = strlen(input);
    int value = 0;
    int i;
    bool gotnumber = false;
    for (i = 0; i < length; i++)
    {
        if (input[i] >= '0' && input[i] <= '9')
        {
            gotnumber = true;
            value = value * 10; // shift up a column
            value += input[i] - '0'; // casting the char to an int
        }
        else if (gotnumber) // we hit this the first time we encounter a non-number after we've had numbers
        {
            printf("Value: %d \n", value);
            value = 0;
            gotnumber = false;
        }
    }

    return 0;
}

EDIT: the previous verison didn't deal with 0

Joe
  • 46,419
  • 33
  • 155
  • 245
  • Interesting solution. Will adapt it to my code, thanks for the help! – RBaxter Jun 07 '11 at 21:15
  • what is gotnumber for? It's not used. – C.J. Jun 07 '11 at 21:19
  • Not a problem. Mark the tick if it answered your question! – Joe Jun 07 '11 at 21:21
  • See the comment on `else if`. It's used to indicate whether or not `value` has been used (i.e. we've encountered a number). It's set `false` after the value is used. Look at my edits to see why it's used. – Joe Jun 07 '11 at 21:21
1

More efficient than going through character by character?

Not possible, because you must look at every character to know that it is not an integer.

Now, given that you have to go though the string character by character, I would recommend simply casting each character as an int and checking that:

//string tmp = ""; declared outside of loop.
//pseudocode for inner loop:
int intVal = (int)c;
if(intVal >=48 && intVal <= 57){ //0-9 are 48-57 when char casted to int.
    tmp += c;
}
else if(tmp.length > 0){
    array[?] = (int)tmp; // ? is where to add the int to the array.
    tmp = "";
}

array will contain your solution.

Phil
  • 179
  • 3
  • 1
    The cast to `int` is pointless, you can use directly `c`; also, 48 and 57 are obscure and correct only for ASCII-based codepages, just use `'0'` and `'9'`, or the `isdigit` function. – Matteo Italia Jun 07 '11 at 21:22
0

Another solution is to use the strtok function

/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] = "pt112parah salin10n m5:isstupid::42$%&%^*%7first3";
  char * pch;
  printf ("Splitting string \"%s\" into tokens:\n",str);
  pch = strtok (str," abcdefghijklmnopqrstuvwxyz:$%&^*");
  while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " abcdefghijklmnopqrstuvwxyz:$%&^*");
  }
  return 0;
}

Gives:

112
10
5
42
7
3

Perhaps not the best solution for this task, since you need to specify all characters that will be treated as a token. But it is an alternative to the other solutions.

Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
  • `strtok()` immediately rules out scanning literal strings or other constant strings because it modifies the array it is parsing. – Jonathan Leffler Jun 07 '11 at 21:31
  • True, that was however not in the problem description and since the OP asked for something other than a loop over each character, I added this since it's an interesting function :-) – Fredrik Pihl Jun 07 '11 at 21:42
  • Independently of my reservations about modifying the string using `strtok()` (and parsing code should IMNSHO never modify input strings without explicit permission to do so), the code shown is not resilient to new characters showing up in the input. For example, it runs into problems with the first upper-case or accented letter. Making it resilient requires a 245-character second argument to `strtok()` (256 - 10 digits - NUL). – Jonathan Leffler Jun 07 '11 at 21:53
0
#include <stdio.h>
#include <string.h>
#include <math.h>

int main(void)
{
    char *input = "pt112par0ah salin10n m5:isstupid::42$%&%^*%7first3";
    char *pos = input;
    int integers[strlen(input) / 2];   // The maximum possible number of integers is half the length of the string, due to the smallest number of digits possible per integer being 1 and the smallest number of characters between two different integers also being 1
    unsigned int numInts= 0;

    while ((pos = strpbrk(pos, "0123456789")) != NULL) // strpbrk() prototype in string.h
    {
        sscanf(pos, "%u", &(integers[numInts]));

        if (integers[numInts] == 0)
            pos++;
        else
            pos += (int) log10(integers[numInts]) + 1;        // requires math.h

        numInts++;
    }

    for (int i = 0; i < numInts; i++)
        printf("%d ", integers[i]);

    return 0;
}

Finding the integers is accomplished via repeated calls to strpbrk() on the offset pointer, with the pointer being offset again by an amount equaling the number of digits in the integer, calculated by finding the base-10 logarithm of the integer and adding 1 (with a special case for when the integer is 0). No need to use abs() on the integer when calculating the logarithm, as you stated the integers will be non-negative. If you wanted to be more space-efficient, you could use unsigned char integers[] rather than int integers[], as you stated the integers will all be <256, but that isn't a necessity.

JAB
  • 20,783
  • 6
  • 71
  • 80
  • Made my answer a bit more proper... there's likely some ways it could be simplified, though. – JAB Jun 08 '11 at 15:04
0

And if you don't mind using C++ instead of C (usually there isn't a good reason why not), then you can reduce your solution to just two lines of code (using AXE parser generator):

vector<int> numbers;
auto number_rule = *(*(axe::r_any() - axe::r_num()) 
   & *axe::r_num() >> axe::e_push_back(numbers));

now test it:

std::string str = "pt112parah salin10n m5:isstupid::42$%&%^*%7first3";
number_rule(str.begin(), str.end());
std::for_each(numbers.begin(), numbers.end(), [](int i) { std::cout << "\ni=" << i; });

and sure enough, you got your numbers back.

And as a bonus, you don't need to change anything when parsing unicode wide strings:

std::wstring str = L"pt112parah salin10n m5:isstupid::42$%&%^*%7first3";
number_rule(str.begin(), str.end());
std::for_each(numbers.begin(), numbers.end(), [](int i) { std::cout << "\ni=" << i; });

and sure enough, you got the same numbers back.

Gene Bushuyev
  • 5,512
  • 20
  • 19