5

I'm wondering if there would be any merit in trying to code a strlen function to find the \0 sequence in parallel. If so, what should such a function take into account? Thanks.

Dervin Thunk
  • 19,515
  • 28
  • 127
  • 217

6 Answers6

8

strlen() is sequential by spirit - one step beyond the null-terminator is undefined behavior and the null-terminator can be anywhere - the first character or the one millionth character, so you have to scan sequentially.

sharptooth
  • 167,383
  • 100
  • 513
  • 979
  • Yes, I thought so, but is also the spirit of all O(n) algorithms. The idea is to perhaps cut $n$ by the number of processors, finding words (64-bit words,I mean) in parallel and on thread-join search for the thread smaller thread with a zero... I'm just blabbing here, but bear with me, one objective of parallel processing is splitting n over the number of processors... what do you think? – Dervin Thunk Feb 25 '11 at 15:56
  • 1
    @Dervin Thunk: Okay, the string is of length 1, the first processor reads first 8 bytes, the second one - next 8 bytes and those happen to not be mapped into the address space and the program crashes. The point is that you can only read memory that belongs to the string and you don;t know where the string ends. – sharptooth Feb 25 '11 at 16:00
  • 2
    You could have a the buffer size, like strnlen – onof Feb 25 '11 at 16:08
  • @Onof is right. Suppose I have a string buffer of 32K (not horrifically large), and need to find the \0, maybe I can divide the buffer in p chunks, where p is the number of processors. Is it worth it to have a parallel strlen? However, @sharp's concerns are duly noted (I'm writing a paper on this, and the points he raised are vital). – Dervin Thunk Feb 25 '11 at 16:20
  • 1
    @Dervin: then it's not a parallel `strlen` as such. It's a parallel find that happens to be looking for a `0` byte. The only thing that distinguishes `strlen` from that is precisely this property that `strlen` can't over-read (except perhaps by an implementation-specific amount of "rounding up" according to how memory is mapped). – Steve Jessop Feb 25 '11 at 17:01
  • accessing bytes beyond the terminator is not necessarily undefined behaviour. It only depends on the actual length of the allocated object. – codymanix Feb 25 '11 at 17:05
  • @Steve. I'm not sure about what you're trying to say, but maybe the case can be made that `strlen` is a subclass of a string finding algorithm (which it is: brute-force search is a `strlen` with a variable pattern and that is what makes it O(n)), and for parallel archs we need a superclass because for bigger strings it will be faster... I hope I didn't digress too much – Dervin Thunk Feb 25 '11 at 17:24
  • @Dervin: I think my point is that parallel find in an array of known length surely is a previously studied problem (I know I have some rough ideas how I'd implement it and how I'd test whether it was improving performance, and that one of the crucial tweaks is going to be how you stop all the "later" threads if an "earlier" thread finds a match). `strlen` is different from that *only* in that it falls foul of what sharptooth says. You can't look very far ahead down the string. The `strnlen` onof suggests turns the question into a boring old parallel find. – Steve Jessop Feb 25 '11 at 19:25
  • That said, with suitable requirements on the platform you could just invent a number to use as the array size (probably one that gives each available thread a "large enough" chunk to work on that you'd expect a significant performance gain). Then if you get a segfault, handle it. You can't do that in portable C, but then I suppose to be pedantic you can't do parallelization in portable C anyway, so that might not be an issue. – Steve Jessop Feb 25 '11 at 19:31
4

You'd have to make sure the NUL found by a thread is the first NUL in the string, which means that the threads would need to synchronize on what their lowest NUL location is. So while it could be done, the overhead for the sync would be far more expensive than any potential gain from parallelization.

Also, there's the issue of caching. A single thread can read a string contiguously, which is cache friendly. Multiple threads run the risk of stepping on each other's toes.

chrisaycock
  • 36,470
  • 14
  • 88
  • 125
  • 1
    NUL, not NULL: NUL is ASCII 0, NULL is the null pointer. http://en.wikipedia.org/wiki/Null_character – lhf Feb 25 '11 at 15:59
  • You completely ignore the fact that memory beyond the end of string can be not accessible. – sharptooth Feb 25 '11 at 16:02
  • 2
    @sharptooth I had already voted-up your answer on undefined behaviour. I'm addressing the performance issues with my answer, since that's what the OP is interested in. There's no need for me to parrot your earlier response. – chrisaycock Feb 25 '11 at 16:07
  • I think you meant the "first" NUL in the string, not the last. You also wouldn't need to synchronize if the work was purposely divided to look at different sections of the string, and only that section. – Ioan Feb 25 '11 at 16:48
1

It would be possible on some parallel architectures, but only if one could guarantee that a substantial amount the memory beyond the string could be accessed safely; it would only be practical if the strings were expected to be quite long and thread communication and synchronization were cheap. For example, if one had sixteen processors and one knew that one could safely access 256KB beyond the end of a string, one could start by dispatching the sixteen processors to handle sixteen 4K chunks. Each time a processor finished and hadn't found a zero, it could either start to handle either the next 4K chunk (if it was within 256KB of the lowest chunk that was still in progress), or wait for the lowest processor to complete. In practice, unless strings were really huge, synchronization delays and excess work would moot any gains from parallelism, but if one needed to find the length of a multi-megabyte string, the task could be done in parallel.

supercat
  • 77,689
  • 9
  • 166
  • 211
0

You could use this on FIXED-WIDTH strings, but not much more than that.

Spidey
  • 2,508
  • 2
  • 28
  • 38
0

It depends on the architecture. Nothing wrong with having multiple compute units hunt for that first null character, but you will have to keep them fed with a steady stream of data from memory. You will probably want to perform platform specific tuning for the exact parameters keeping cache boundaries in mind.

Chad Brewbaker
  • 2,523
  • 2
  • 19
  • 26
0

To parallelize tasks, you have to split input data and dispatch it to multiple threads. Without knowing the length of the string in advance, you cannot split up the data.

So you must know the allocated size of the input data (which is not necessarily identical with the string-length) in advance, then it will work.

Your program might return multiple NUL-values which is may have found. Your function can only know that the correct NUL value has been found if all threads which are processing the data that comes before any of the NUL values that have been found, have been completed.

Say we have our string split up in 8 chunks (0-7). If we found NUL-values in chunk 3 we cannot know if maybe there are other NUL-values in chunks 0-2, so we have to wait for any of these threads, an we can immediately stop all other threads. If then a NUL-value is found in thread 1 we only have to wait for thread 0 to complete, so we can get a definitive answer.

codymanix
  • 28,510
  • 21
  • 92
  • 151