15

Is there any comparable function like Pos that is not case-sensitive in D2010 (unicode)?

I know I can use Pos(AnsiUpperCase(FindString), AnsiUpperCase(SourceString)) but that adds a lot of processing time by converting the strings to uppercase every time the function is called.

For example, on a 1000000 loop, Pos takes 78ms while converting to uppercase takes 764ms.

str1 := 'dfkfkL%&/s"#<.676505';
  for i := 0 to 1000000 do
    PosEx('#<.', str1, 1); // Takes 78ms

  for i := 0 to 1000000 do
    PosEx(AnsiUpperCase('#<.'), AnsiUpperCase(str1), 1); // Takes 764ms

I know that to improve the performance of this specific example I can convert the strings to uppercase first before the loop, but the reason why I'm looking to have a Pos-like function that is not case-sensitive is to replace one from FastStrings. All the strings I'll be using Pos for will be different so I will need to convert each and every one to uppercase.

Is there any other function that might be faster than Pos + convert the strings to uppercase?

smartins
  • 3,808
  • 7
  • 42
  • 54

12 Answers12

29

The built-in Delphi function to do that is in both the AnsiStrings.ContainsText for AnsiStrings and StrUtils.ContainsText for Unicode strings.

In the background however, they use logic very similar to your logic.

No matter in which library, functions like that will always be slow: especially to be as compatible with Unicode as possible, they need to have quite a lot of overhead. And since they are inside the loop, that costs a lot.

The only way to circumvent that overhead, is to do those conversions outside the loop as much as possible.

So: follow your own suggestion, and you have a really good solution.

--jeroen

Jeroen Wiert Pluimers
  • 23,965
  • 9
  • 74
  • 154
  • 6
    -1: AnsiStrings.ContainsText does not return the position of the substring, only whether the sub-string exists within the string. – Steve Oct 12 '09 at 07:15
  • 2
    Sorry - I misread the original post in that it was only interested if it contained the position. Anyhow, ContainsText is similar enough to shows how the Delphi team solved the issue, hence the conclusion in my message still holds. – Jeroen Wiert Pluimers Oct 12 '09 at 10:23
  • 5
    Steve, in my version, the `ContainsText` function is implemented by changing the case of the two input arguments, calling `Pos`, and then checking that the result is greater than zero. Return the result instead of checking that it's positive, and you have a case-insensitive `Pos`. – Rob Kennedy Oct 12 '09 at 14:45
  • 1
    Very late update. Even in 10.1 Berlin ContainsText uses AnsiUpperCase which in turn calls Windows api CharUpperBuffW – Gerry Coll Jul 31 '18 at 23:54
13

This version of my previous answer works in both D2007 and D2010.

  • In Delphi 2007 the CharUpCaseTable is 256 bytes
  • In Delphi 2010 it is 128 KB (65535*2).

The reason is Char size. In the older version of Delphi my original code only supported the current locale character set at initialization. My InsensPosEx is about 4 times faster than your code. Certainly it is possible to go even faster, but we would lose simplicity.

type
  TCharUpCaseTable = array [Char] of Char;

var
  CharUpCaseTable: TCharUpCaseTable;

procedure InitCharUpCaseTable(var Table: TCharUpCaseTable);
var
  n: cardinal;
begin
  for n := 0 to Length(Table) - 1 do
    Table[Char(n)] := Char(n);
  CharUpperBuff(@Table, Length(Table));
end;

function InsensPosEx(const SubStr, S: string; Offset: Integer = 1): Integer;
var
  n:            Integer;
  SubStrLength: Integer;
  SLength:      Integer;
label
  Fail;
begin
  Result := 0;
  if S = '' then Exit;
  if Offset <= 0 then Exit;

  SubStrLength := Length(SubStr);
  SLength := Length(s);

  if SubStrLength > SLength then Exit;

  Result := Offset;
  while SubStrLength <= (SLength-Result+1) do 
  begin
    for n := 1 to SubStrLength do
      if CharUpCaseTable[SubStr[n]] <> CharUpCaseTable[s[Result+n-1]] then
        goto Fail;
      Exit;
Fail:
    Inc(Result);
  end;
  Result := 0;
end;

//...

initialization
  InitCharUpCaseTable({var}CharUpCaseTable);
Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
GJ.
  • 10,810
  • 2
  • 45
  • 62
  • Indeed, your updated version is faster than any of the other functions I tried. UpCasePosEx: 171ms, PosEx(AnsiUpperCase... 827ms, XPos: 343ms – smartins Oct 12 '09 at 17:44
  • And if you still need more speed you can (it is good idea) switch off in compiler options "String format checking" or put at beginning of the unit {$STRINGCHECKS OFF} you can read more about: http://www.micro-isv.asia/2008/10/needless-string-checks-with-ensureunicodestring/ Now is about 10 times faster than your privius code. – GJ. Oct 13 '09 at 02:28
  • Thanks for the heads-up, I'm already using that compile directive on all my code via a include file. – smartins Oct 13 '09 at 09:34
  • Strange, becuse as I have measured the XPos is faster (better algorithem I have use CharUpCaseTable with 16 bit char size) for about 25%. What about other compiler switches? – GJ. Oct 13 '09 at 17:11
5

I have also faced the problem of converting FastStrings, which used a Boyer-Moore (BM) search to gain some speed, for D2009 and D2010. Since many of my searches are looking for a single character only, and most of these are looking for non-alphabetic characters, my D2010 version of SmartPos has an overload version with a widechar as the first argument, and does a simple loop through the string to find these. I use uppercasing of both arguments to handle the few non-case-sensitive case. For my applications, I believe the speed of this solution is comparable to FastStrings.

For the 'string find' case, my first pass was to use SearchBuf and do the uppercasing and accept the penalty, but I have recently been looking into the possibility of using a Unicode BM implementation. As you may be aware, BM does not scale well or easily to charsets of Unicode proportions, but there is a Unicode BM implementation at Soft Gems. This pre-dates D2009 and D2010, but looks as if it would convert fairly easily. The author, Mike Lischke, solves the uppercasing issue by including a 67kb Unicode uppercasing table, and this may be a step too far for my modest requirements. Since my search strings are usually short (though not as short as your single three-character example) the overhead for Unicode BM may also be a price not worth paying: the BM advantage increases with the length of the string being searched for.

This is definitely a situation where benchmarking with some real-world application-specific examples will be needed before incorporating that Unicode BM into my own applications.

Edit: some basic benchmarking shows that I was right to be wary of the "Unicode Tuned Boyer-Moore" solution. In my environment, UTBM results in bigger code, longer time. I might consider using it if I needed some of the extras this implementation provides (handling surrogates and whole-words only searches).

frogb
  • 2,040
  • 15
  • 22
4

Here's one that I wrote and have been using for years:

function XPos( const cSubStr, cString :string ) :integer;
var
  nLen0, nLen1, nCnt, nCnt2 :integer;
  cFirst :Char;
begin
  nLen0 := Length(cSubStr);
  nLen1 := Length(cString);

  if nLen0 > nLen1 then
    begin
      // the substr is longer than the cString
      result := 0;
    end

  else if nLen0 = 0 then
    begin
      // null substr not allowed
      result := 0;
    end

  else

    begin

      // the outer loop finds the first matching character....
      cFirst := UpCase( cSubStr[1] );
      result := 0;

      for nCnt := 1 to nLen1 - nLen0 + 1 do
        begin

          if UpCase( cString[nCnt] ) = cFirst then
            begin
              // this might be the start of the substring...at least the first
              // character matches....
              result := nCnt;

              for nCnt2 := 2 to nLen0 do
                begin

                  if UpCase( cString[nCnt + nCnt2 - 1] ) <> UpCase( cSubStr[nCnt2] ) then
                    begin
                      // failed
                      result := 0;
                      break;
                    end;

                end;

            end;


          if result > 0 then
            break;
        end;


    end;
end;

Steve
  • 1,769
  • 2
  • 22
  • 33
  • Doesn't this exacerbate the problem about which the OP asked? This will do the uppercase conversion inside the loop, perhaps multiple times. – Argalatyr Oct 11 '09 at 04:31
  • 2
    I disagree because the UpCase function is Char based only so there is no memory allocation involved. Remember, the slow part of most string functions is allocating the memory for the result and any local strings used within the function. If you look closely at my function you will realise that there is no memory allocation for strings. However, I would agrue that the proof of the pudding is in the eating, so let smartins benchmark indicate whether it is efficient of not. – Steve Oct 11 '09 at 09:37
  • Actually, I just tested it and it comes up at 93ms against 764ms of PosEx(AnsiUpperCase.... Would this be Unicode compatible? – smartins Oct 11 '09 at 19:45
  • 1
    I don't know what your benchmarking methodology is smartins, but this isn't what I found. Even removing the bounds checking of the parameters ("taking the safety off") this routine clocked in about 10% slower than Pos + Uppercase in my tests. – Deltics Oct 12 '09 at 06:35
  • However, I'm giving this +1 because the char-wise handling of the problem goes some way toward avoiding potential issues introduced by case conversion of strings that may affect the *length* of those strings, thereby causing errors in any char-position based results, w.r.t the input strings. – Deltics Oct 12 '09 at 06:45
  • I'm doing the following: for i := 0 to 1000000 do j := XPos(searchstr, str1); //296ms and: for i := 0 to 1000000 do j := Pos(AnsiUpperCase(searchstr), AnsiUpperCase(str1)); // 780ms I call GetTickCount before and after the loop to get the time it takes in ms – smartins Oct 12 '09 at 06:56
  • To make this routine fully Unicode compatible you will need to change the UpCase functions calls with ToUpper calls (defined in Character.pas). My benchmarks give more or less the same timings for my unicode/non-unicode versions, both of which are still faster than your original version. – Steve Oct 12 '09 at 07:05
  • Actually, my benchmark after replacing UpCase with ToUpper shows your function takes 3432ms to complete while "mine" takes 874ms. – smartins Oct 12 '09 at 10:23
  • Very strange?? My benchmarks for your code and my code and any variants are all pretty much in the same ballpark. – Steve Oct 12 '09 at 11:06
2

Why not just convert the both the substring and the source string to lower or upper case within the regular Pos statement. The result will effectively be case-insensitive because both arguments are all in one case. Simple and lite.

CongSpace
  • 21
  • 1
1

The Jedi Code Library has StrIPos and thousands of other useful functions to complement Delphi's RTL. When I still worked a lot in Delphi, JCL and its visual brother JVCL were among the first things I added to a freshly installed Delphi.

fvu
  • 32,488
  • 6
  • 61
  • 79
  • This is what I'm seeing in the current JEDI: `function StrIPos(const SubStr, S: string): SizeInt; begin Result := Pos(StrUpper(SubStr), StrUpper(S)); end;` Not exactly the most sophisticated way of doing it, and basically what the OP said they were trying to avoid due to the performance hit. – jep Dec 14 '15 at 21:54
0

Instead 'AnsiUpperCase' you can use Table it is much faster. I have reshape my old code. It is very simple and also very fast. Check it:

type
  TAnsiUpCaseTable = array [AnsiChar] of AnsiChar;

var
  AnsiTable: TAnsiUpCaseTable;

procedure InitAnsiUpCaseTable(var Table: TAnsiUpCaseTable);
var
  n: cardinal;
begin
  for n := 0 to SizeOf(TAnsiUpCaseTable) -1 do
  begin
    AnsiTable[AnsiChar(n)] := AnsiChar(n);
    CharUpperBuff(@AnsiTable[AnsiChar(n)], 1);
  end;
end;

function UpCasePosEx(const SubStr, S: string; Offset: Integer = 1): Integer;
var
  n              :integer;
  SubStrLength   :integer;
  SLength        :integer;
label
  Fail;
begin
  SLength := length(s);
  if (SLength > 0) and (Offset > 0) then begin
    SubStrLength := length(SubStr);
    result := Offset;
    while SubStrLength <= SLength - result + 1 do begin
      for n := 1 to SubStrLength do
        if AnsiTable[SubStr[n]] <> AnsiTable[s[result + n -1]] then
          goto Fail;
      exit;
Fail:
      inc(result);
    end;
  end;
  result := 0;
end;

initialization
  InitAnsiUpCaseTable(AnsiTable);
end.
GJ.
  • 10,810
  • 2
  • 45
  • 62
  • 1
    -1: Actually, contrary to the naming, AnsiUpperCase does support Unicode. Your function does not. – Jeroen Wiert Pluimers Oct 11 '09 at 12:19
  • @Jeroen Pluimers: You are wrong. Check my code again! My code at initialization call InitAnsiUpCaseTable which inside call API CharUpperBuff. So my function use at initialization current locale character set. – GJ. Oct 11 '09 at 12:51
  • GJ: Jeroen is right in the sense that you don't handle any multibyte encodings properly. Be it the older MBCS (BIG-5 etc) or UTF-8. jeroen: I didn't see the D2010 bit originally. If the ansi* definition changed to UTF-8 nowadays, forget what I said. – Marco van de Voort Oct 11 '09 at 15:18
  • @Marco van de Voort: In fact the algorithem must work! In D2009 and D2010 size of char is 16bit in older versions 8bit. API call CharUpperBuff return Unicode uppercase character. In my code the size of TAnsiUpCaseTable is 128 Kbytes if compiled under D2009 or D2010, compiled under older versions the size is only 256 bytes and support only at initialization current locale character set. OK instead 'SizeOf(TAnsiUpCaseTable)' must be 'Length(Table)'. I didn't test code under D2009 but I believe that work. – GJ. Oct 11 '09 at 19:01
  • GJ your code doesn't even compile in D2010. "E2010 Incompatible types: 'AnsiChar' and 'Char'" on AnsiTable[SubStr[n]] because SubStr is a string and AnsiTable is an array of AnsiChar. – smartins Oct 11 '09 at 21:13
  • GJ, as Smartins points out, you don't handle the conversion from UnicodeChar to AnsiChar. Even in non-Unicode Delphi, though, your code is wrong. In an AnsiString, a character can be represented by **multiple bytes**. Although you call CharUpperBuffer, you only handle a single byte, so if the character has multiple bytes, your code will fail to convert it properly. – Rob Kennedy Oct 12 '09 at 15:02
  • Rob Kennedy, I aggry, but the character size in D2010 is now 16bit and compiler will call API CharUpperBufferW instead CharUpperBufferA the UpCaseTable size is now 128KBytes. Check my second version. – GJ. Oct 12 '09 at 15:36
0

I think, converting to upper or lower case before Pos is the best way, but you should try to call AnsiUpperCase/AnsiLowerCase functions as less as possible.

samir105
  • 949
  • 11
  • 16
0

On this occasion I couldn't find any approach that was even as good as, let alone better than Pos() + some form of string normalisation (upper/lowercase conversion).

This is not entirely surprising as when benchmarked the Unicode string handling in Delphi 2009 I found that the Pos() RTL routine has improved significantly since Delphi 7, explained in part by the fact that aspects of the FastCode libraries have been incorporated into the RTL for some time now.

The FastStrings library on the other hand has not - iirc - been significantly updated for a long time now. In tests I found that many FastStrings routines have in fact been overtaken by the equivalent RTL functions (with a couple of exceptions, explained by the unavoidable overhead incurred by the additional complications of Unicode).

The "Char-Wise" processing of the solution presented by Steve is the best so far imho.

Any approach that involves normalising the entire strings (both string and sub-string) risks introducing errors in any character-based position in the results due to the fact that with Unicode strings a case conversion may result in a change in the length of the string (some characters convert to more/fewer characters in a case conversion).

These may be rare cases but Steve's routine avoids them and is only about 10% slower than the already quite fast Pos + Uppercase (your benchmarking results don't tally with mine on that score).

Deltics
  • 22,162
  • 2
  • 42
  • 70
0

Often the simple solution is the one you'd want to use:

if AnsiPos(AnsiupperCase('needle'), AnsiupperCase('The Needle in the haystack')) <> 0 then
    DoSomething;

Reference:

T.S
  • 355
  • 4
  • 18
0

Any program on Windows can call a shell-API function, which keeps your code-size down. As usual, read the program from the bottom up. This has been tested with Ascii-strings only, not wide strings.

program PrgDmoPosIns; {$AppType Console} // demo case-insensitive Pos function for Windows

// Free Pascal 3.2.2 [2022/01/02], Win32 for i386
// FPC.EXE -vq -CoOr -Twin32 -oPrgStrPosDmo.EXE PrgStrPosDmo.LPR
//         -vq Verbose: Show message numbers
//             -C Code generation:
//               o Check overflow of integer operations
//                O Check for possible overflow of integer operations - Integer Overflow checking turns on Warning 4048
//                 r Range checking
//                   -Twin32 Target 32 bit Windows operating systems
// 29600 bytes code, 1316 bytes data, 35,840 bytes file

function StrStrIA( pszHaystack, pszNeedle : PChar ) : PChar; stdcall; external 'shlwapi.dll'; // dynamic link to Windows API's case-INsensitive search
// https://learn.microsoft.com/en-us/windows/win32/api/shlwapi/nf-shlwapi-strstria
// "FPC\3.2.2\Source\Packages\winunits-base\src\shlwapi.pp" line 557

function StrPos(        strNeedle, strHaystk : string ) : SizeInt; // return the position of Needle within Haystack, or zero if not found
var
  intRtn       : SizeInt; // function result
  ptrHayStk             , // pointers to
  ptrNeedle             , //   search strings
  strMchFnd    : PChar  ; // pointer to match-found string, or null-pointer/empty-string when not found
  bolFnd       : boolean; // whether Needle was found within Haystack
  intLenHaystk          , // length of haystack
  intLenMchFnd : SizeInt; // length of needle
begin
  strHayStk :=       strHayStk + #0            ; // strings passed to API must be
  strNeedle :=       strNeedle + #0            ; //   null-terminated

  ptrHayStk := Addr( strHayStk[ 1 ] )          ; // set pointers to point at first characters of
  ptrNeedle := Addr( strNeedle[ 1 ] )          ; //   null-terminated strings, so API gets C-style strings

  strMchFnd := StrStrIA( ptrHayStk, ptrNeedle ); // call Windows to perform search; match-found-string now points inside the Haystack
  bolFnd    := ( strMchFnd <> '' )             ; // variable is True when           match-found-string is not null/empty

  if bolFnd then begin                         ; // when Needle was yes found in Haystack
    intLenMchFnd := Length( strMchFnd )        ; // get length of needle
    intLenHaystk := Length( strHayStk )        ; // get length of haystack
    intRtn       := intLenHaystk - intLenMchFnd; // set  function result to the position of needle within haystack, which is the difference in lengths
  end       else                                 // when Needle was not found in Haystack
    intRtn       := 0                          ; // set  function result to tell caller needle does not appear within haystack
  StrPos := intRtn                             ; // pass function result back to caller
end; // StrPos

procedure TstOne( const strNeedle, strHayStk : string ); // run one test with this Needle
var
  intPos : SizeInt; // found-match location of Needle within Haystack, or zero if none
begin
  write  ( 'Searching for : [', strNeedle, ']' ); // bgn output row for this test
  intPos := StrPos(  strNeedle, strHaystk      ); // get Needle position
  writeln(' StrPos is '       , intPos         ); // end output row for this test
end; // TstOne

procedure TstAll(                                     ); // run all tests with various Needles
const
  strHayStk = 'Needle in a Haystack'; // all tests will search in this string
begin
  writeln( 'Searching in  : [', strHayStk, ']' ); // emit header row
  TstOne ( 'Noodle'           , strHayStk      ); // test not-found
  TstOne ( 'Needle'           , strHayStk      ); // test found at yes-first character
  TstOne ( 'Haystack'         , strHayStk      ); // test found at not-first character
end; // TstAll

begin // ***** MAIN *****
  TstAll( ); // run all tests
end.
Bilbo
  • 358
  • 1
  • 10
0
function TextPos(const ASubText, AText: UnicodeString): Integer;
var
    res: Integer;
begin
{
    Locates a substring in a given text string without case sensitivity. 

    Returns the index of the first occurence of ATextin AText,
    or zero if the text was not found
}

    res := FindNLSString(LOCALE_USER_DEFAULT, FIND_FROMSTART or LINGUISTIC_IGNORECASE, PWideChar(AText), Length(AText), PWideChar(ASubText), Length(ASubText), nil);
    Result := (res+1); //convert zero-based to one-based index, and -1 not found to zero.
end;

And in case you don't have the definitions:

function FindNLSString(Locale: LCID; dwFindNLSStringFlags: DWORD; lpStringSource: PWideChar; cchSource: Integer; lpStringValue: PWideChar; cchValue: Integer; cchFound: PInteger): Integer; stdcall; external 'Kernel32.dll';

const
    FIND_FROMSTART = $00400000;  // look for value in source, starting at the 
    LINGUISTIC_IGNORECASE       = $00000010;  // linguistically appropriate 'ignore 
Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219