3

Let's say, for example, I have a list of user id's, access times, program names, and version numbers as a list of CSV strings, like this:

1,1342995305,Some Program,0.98
1,1342995315,Some Program,1.20
2,1342985305,Another Program,15.8.3
1,1342995443,Bob's favorite game,0.98
3,1238543846,Something else,
...

Assume this list is not a file, but is an in-memory list of strings.

Now let's say I want to find out how often a program has been accessed to certain programs, as listed by their version number. (e.g. "Some Program version 1.20" was accessed 193 times, "Some Program version 0.98" was accessed 876 times, and "Some Program 1.0.1" was accessed 1,932 times)

Would it be better to build a regular expression and then use regexec() to find the matches and pull out the version numbers, or strstr() to match the program name plus comma, and then just read the following part of the string as the version number? If it makes a difference, assume I am using GCC on Linux.

Is there a performance difference? Is one method "better" or "more proper" than the other? Does it matter at all?

Ωmega
  • 42,614
  • 34
  • 134
  • 203
cegfault
  • 6,442
  • 3
  • 27
  • 49

4 Answers4

3

Go with strstr() - using regex to count a number of occurrences is not a good idea, as you would need to use loop anyway, so I would suggest you to do a simple loop with searching for poistion of substring and increase counter and starting search position after each match.

Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • 2
    *I* would go with this method than what is in some of the other answers because it will reduce the number of calls to the C string functions versus `strchr`, `strtok` etc. Some of the string functions call `strlen` as the first operation which can eat up some time. Note that using a regex and compiling it just once _could_ rival the speed of `strstr`. – JimR Jul 22 '12 at 23:19
  • This is what I have been leaning towards. It seems like `strstr` would help with readability and maintainability because it has a low complexity (there is no prep needed). However, does this mean regex should be reserved *only* for complex searches, or is that a preference issue? – cegfault Jul 23 '12 at 00:05
  • @cegfault - Regex is a right tool when you need capture some/multiple parts from one input, like parsing, etc. It is not a good tool for counting substring occurrances. – Ωmega Jul 23 '12 at 00:55
  • but isn't that what I'm doing here? ie, capturing the version number from the line... – cegfault Jul 23 '12 at 01:19
  • @cegfault - No, as you know exactly how such match should look. Good example for regex is log files, when you are looking for list of IP addresses that access some part of your site, so you don't know what IP address will match, as so match comes based on match of url - so by regex you would be searching for url and then with regex pattern you will capture IP address. In your case, if I understand you question correctly, you know list of all program names with version(s) that you want to search, correct? – Ωmega Jul 23 '12 at 01:57
  • Got it. In this example I know all the program names, but not all the versions. – cegfault Jul 23 '12 at 02:33
  • @JimR: Unless `strstr` is implemented very poorly, there's no way regex could rival it. Regex it as best `O(n)` (where `n` is length of string to search). `strstr` is `O(n)` as well if implemented correctly, but the common cases' runtimes are proportional to `n/m` where `m` is the length of the needle (substring) being searched for. – R.. GitHub STOP HELPING ICE Jul 23 '12 at 14:37
  • @R..: We'll have to agree to disagree here as I can't find the web page where this was bench-marked a few years ago but I distinctly remember it. – JimR Jul 23 '12 at 15:35
  • @R: do you have a reference that shows (strstr=O(n)) <= (regex=O(n))? Or, perhaps even better, pseudo code for strstr and regex so I can look at them and count it up myself? – cegfault Aug 03 '12 at 21:15
1

strchr/memcmp is how most libc versions implemented strstr. Hardware-dependent implementations of strstr in glibc do better. Both SSE2 and SSE4.2 (x86) instruction sets can do way better than scanning byte-by-byte. If you want to see how, I posted a couple blog articles a while back --- SSE2 and strstr and SSE2 and BNDM search --- that you might find interesting.

Mischa
  • 2,240
  • 20
  • 18
0

strtok(), and break the data up into something more structured (like a list of structs).

Neuron
  • 5,141
  • 5
  • 38
  • 59
  • from the strktok man page: This interface is obsoleted by strsep(3). – eyalm Jul 23 '12 at 07:50
  • strtok() is ANSI/ISO, strsep() isn't. –  Jul 23 '12 at 10:16
  • This is the slowest and most bloated approach. Plus the usual stuff about `strtok` considered harmful... – R.. GitHub STOP HELPING ICE Jul 23 '12 at 14:38
  • I think searching for delimiters and extracting substrings is hacky. It's fine if it's a one off. If you're going to end up wanting to do more, to have more flexibility, getting the data into an easier to user form is useful. –  Jul 23 '12 at 14:56
0

I'd do neither: I'm betting it would be faster to use strchr() to find the commas, and strcmp() to check the program name.

As for performance, I expect string functions (strtok/strstr/strchr/strpos/strcmp...) to run all more or less at the same speed (i.e. really, really fast), and regex to run appreciably slower albeit still quite fast.

The real performance benefit would come from properly designing the search though: how many times it must run, is the number of programs fixed...?

For example, a single scan whereby you get ALL the frequency data for all the programs would be much slower than a single scan seeking for a given program. But properly designed, all subsequent queries for other programs would run way faster.

Ωmega
  • 42,614
  • 34
  • 134
  • 203
LSerni
  • 55,617
  • 10
  • 65
  • 107
  • This is unlikely to be faster if `strstr` is implemented well. – R.. GitHub STOP HELPING ICE Jul 23 '12 at 14:37
  • I think all those functions are more or less on par; to get any appreciable gain one would have to take into account the overall search. If searches were done repeatedly, several optimizations are possible (see e.g. http://blog.phusion.nl/2010/12/06/efficient-substring-searching/ ). – LSerni Jul 23 '12 at 18:21