16

I need to strip out all non standard text characers from a string. I need remove all non ascii and control characters (except line feeds/carriage returns).

IElite
  • 1,818
  • 9
  • 39
  • 64

6 Answers6

24

And here's a variant of Cosmin's that only walks the string once, but uses an efficient allocation pattern:

function StrippedOfNonAscii(const s: string): string;
var
  i, Count: Integer;
begin
  SetLength(Result, Length(s));
  Count := 0;
  for i := 1 to Length(s) do begin
    if ((s[i] >= #32) and (s[i] <= #127)) or (s[i] in [#10, #13]) then begin
      inc(Count);
      Result[Count] := s[i];
    end;
  end;
  SetLength(Result, Count);
end;
David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • 2
    Very good variant, only one reallocation and possimbly no reallocations if the string doesn't contain ani non-ASCII chars. – Cosmin Prund Apr 13 '11 at 18:46
  • var l, i, Count: Integer; begin l := Length(s); SetLength(Result, l); if l = 0 then Exit; Count := 0; for i := 1 to l do begin if ((s[i] >= #32) and (s[i] <= #127)) or (s[i] in [#10, #13]) then begin inc(Count); Result[Count] := s[i]; end; end; if l <> Count then SetLength(Result, Count); end; – Zam Feb 21 '20 at 17:53
15

Something like this should do:

// For those who need a disclaimer: 
// This code is meant as a sample to show you how the basic check for non-ASCII characters goes
// It will give low performance with long strings that are called often.
// Use a TStringBuilder, or SetLength & Integer loop index to optimize.
// If you need really optimized code, pass this on to the FastCode people.
function StripNonAsciiExceptCRLF(const Value: AnsiString): AnsiString;
var
  AnsiCh: AnsiChar;
begin
  for AnsiCh in Value do
    if (AnsiCh >= #32) and (AnsiCh <= #127) and (AnsiCh <> #13) and (AnsiCh <> #10) then
      Result := Result + AnsiCh;
end;

For UnicodeString you can do something similar.

Jeroen Wiert Pluimers
  • 23,965
  • 9
  • 74
  • 154
  • 6
    I would not reallocate Result over and over. –  Apr 13 '11 at 14:28
  • 3
    I would fix it if speed became a problem. – Jeroen Wiert Pluimers Apr 13 '11 at 17:36
  • 1
    There are two potential problems: 1) Speed 2) Memory fragmentation. Could not be an issue if the function is called sometimes and with small strings. Could become one if the function is called often with larges strings. As usual, optimizations requires to understand where some code is expected to work. –  Apr 13 '11 at 18:17
  • This will probably work well with small strings because the memory manager is optimised to deal with this pattern of allocation and because the small blocks make the required mem copy operation fairly fast. But given a reallocation-free drop-in alternative was offered (David's code, not mine) I'd never use this. – Cosmin Prund Apr 13 '11 at 18:57
  • 1
    @David: wow, you are harsh on me today. First of all, this is a code sample showing how to do the proper comparisons. Optimizing it distracts from that point. Furthermore, premature optimization causes a lot of evil code. That's why I optimize code when performance is indeed an issue. I've added some comments in the code to warn, but for me those warnings would go with most sample code I encounter that prove a basic algorithm. – Jeroen Wiert Pluimers Apr 13 '11 at 19:07
  • @Jeroen This is pretty trivial stuff and to do it right isn't hard or particularly long-winded. It's a very common pattern. I wouldn't class this as an optimisation. I'd regard it as the baseline for reasonable code. Any optimised version would involve unrolling the loop. – David Heffernan Apr 13 '11 at 19:17
  • 2
    @David: for you this is trivial, for me this is trivial, but for a lot of SO readers this is not trivial. It's the classic example of the Pareto Principle. I teach software developers for a part of my living and see that 80/20 rule on a very regular base. Hence my samples are meant to be understood by lots of people, and the people that need optimization will figure that out themselves. I can understand you see that in a different way, but I think commenting 'sloppy programmer' based on one code sample is way to harsh, especially since there is no secondary communication involved. – Jeroen Wiert Pluimers Apr 13 '11 at 19:35
  • @Jeroen You contradict yourself. In an earlier comment you stated, "I would fix it if speed became a problem." – David Heffernan Apr 13 '11 at 19:40
  • @David: I didn't see that Shane indicate that speed is a problem here. If he does, I can now point him to your optimized code (I upvoted it). If you hadn't posted it, I would optimize the code myself, and split the code into two methods: the regular one to show the basics, and the optimized one. That way anyone can make a comparison and see why things were optimized in a certain way. – Jeroen Wiert Pluimers Apr 13 '11 at 20:00
  • 1
    Wow, #13 and #10 will always be stripped as the code stands, how could this be the accepted answer? – LU RD Oct 10 '13 at 18:27
  • @LURD probably because of the disclaimer. – Jeroen Wiert Pluimers Oct 13 '13 at 13:51
  • 3
    @JeroenWiertPluimers Premature micro-optimization and worrying about technical details below the abstraction of the language appear to be unfortunate traits of many Delphi developers (although I have no idea where or why it became part of the culture). Thus, I feel that your lesson about writing clean, clear code first and only optimizing if necessary (and normally after profiling) is even more important than your instruction about stripping characters from strings! – alcalde Feb 02 '14 at 00:08
5

if you don't need to do it in-place, but generating a copy of the string, try this code

 type CharSet=Set of Char;

 function StripCharsInSet(s:string; c:CharSet):string;
  var i:Integer;
  begin
     result:='';
     for i:=1 to Length(s) do
       if not (s[i] in c) then 
         result:=result+s[i];
  end;  

and use it like this

 s := StripCharsInSet(s,[#0..#9,#11,#12,#14..#31,#127]);

EDIT: added #127 for DEL ctrl char.

EDIT2: this is a faster version, thanks ldsandon

 function StripCharsInSet(s:string; c:CharSet):string;
  var i,j:Integer;
  begin
     SetLength(result,Length(s));
     j:=0;
     for i:=1 to Length(s) do
       if not (s[i] in c) then 
        begin
         inc(j);
         result[j]:=s[i];
        end;
     SetLength(result,j);
  end;  
PA.
  • 28,486
  • 9
  • 71
  • 95
3

Here's a version that doesn't build the string by appending char-by-char, but allocates the whole string in one go. It requires going over the string twice, once to count the "good" char, once to effectively copy those chars, but it's worth it because it doesn't do multiple reallocations:

function StripNonAscii(s:string):string;
var Count, i:Integer;
begin
  Count := 0;
  for i:=1 to Length(s) do
    if ((s[i] >= #32) and (s[i] <= #127)) or (s[i] in [#10, #13]) then
      Inc(Count);
  if Count = Length(s) then
    Result := s // No characters need to be removed, return the original string (no mem allocation!)
  else
    begin
      SetLength(Result, Count);
      Count := 1;
      for i:=1 to Length(s) do
        if ((s[i] >= #32) and (s[i] <= #127)) or (s[i] in [#10, #13]) then
        begin
          Result[Count] := s[i];
          Inc(Count);
        end;
    end;
end;
Cosmin Prund
  • 25,498
  • 2
  • 60
  • 104
  • 1
    Why would anyone downvote this? Not that it matters much, just curious. – Cosmin Prund Apr 14 '11 at 06:42
  • I would have not used StringOfChar but just SetLength(), anyway not a reason to downvote, although it requires walking the string twice. –  Apr 14 '11 at 07:31
  • It does require walking the string twice, but it *guarantees* optimal allocation. If this is done for many-many strings optimal allocation is going to matter allot more then walking the string only once. – Cosmin Prund Apr 14 '11 at 08:16
  • Edited the answer to use `SetLength` and to implement a tiny optimization that allows the routine to do it's job with ZERO or 1 string allocations. – Cosmin Prund Apr 14 '11 at 08:19
  • @Cosmin one downside of multiple walks is that this code has two identical if statements which violates DRY – David Heffernan Apr 14 '11 at 08:44
  • @David, that's true. To be honest I value DRY allot more then runtime performance. I don't write speed-critical applications. – Cosmin Prund Apr 14 '11 at 08:51
  • @Cosmin As a maintainer of a 25 year old codebase, I agree, DRY comes first – David Heffernan Apr 14 '11 at 09:01
0

my performance solution;

function StripNonAnsiChars(const AStr: String; const AIgnoreChars: TSysCharSet): string;
var
  lBuilder: TStringBuilder;
  I: Integer;
begin
  lBuilder := TStringBuilder.Create;
  try
    for I := 1 to AStr.Length do
      if CharInSet(AStr[I], [#32..#127] + AIgnoreChars) then
        lBuilder.Append(AStr[I]);
    Result := lBuilder.ToString;
  finally
    FreeAndNil(lBuilder);
  end;
end;

I wrote by delphi xe7

0

my version with Result array of byte :

interface

type
  TSBox = array of byte;

and the function :

function StripNonAscii(buf: array of byte): TSBox;
var temp: TSBox;
    countr, countr2: integer;
const validchars : TSysCharSet = [#32..#127];
begin
if Length(buf) = 0 then exit;
countr2:= 0;
SetLength(temp, Length(buf)); //setze temp auf länge buff
for countr := 0 to Length(buf) do if CharInSet(chr(buf[countr]), validchars) then
  begin
    temp[countr2] := buf[countr];
    inc(countr2); //count valid chars
  end;
SetLength(temp, countr2);
Result := temp;
end;