5

I'm working with very large (45,000,000+ character) strings in VBA, and I need to remove superfluous whitespace.

One space (aka, ASCII Code 32) is okay but any sections with two or more consecutive spaces should be reduced to only one.

I found a similar question here, although that OP's definition of a "very long string" was only 39,000 characters. The accepted answer was a loop using Replace:

Function MyTrim(s As String) As String
    Do While InStr(s, "  ") > 0
        s = Replace$(s, "  ", " ")
    Loop
    MyTrim = Trim$(s)
End Function

I tried this method and it was "worked", but was painfully slow:

Len In:  44930886 
Len Out: 35322469
Runtime: 247.6 seconds

Is there a faster way to remove whitespace from a "very large" string?

ashleedawg
  • 20,365
  • 9
  • 72
  • 105

2 Answers2

6

I suspect the performance problem is due to creating a very large number of large intermediate strings. So, any method that does things without creating intermediate strings or with much fewer would perform better.

A Regex replace has a good chance of that.

Option Explicit

Sub Test(ByVal text As String)

  Static Regex As Object
  If Regex Is Nothing Then
    Set Regex = CreateObject("VBScript.RegExp")
    Regex.Global = True
    Regex.MultiLine = True
  End If

  Regex.Pattern = " +" ' space, one or more times

  Dim result As String: result = Regex.Replace(text, " ")
  Debug.Print Len(result), Left(result, 20)
End Sub

With an input string of 45 million characters takes about a second.

Runner:

Sub Main()

  Const ForReading As Integer = 1
  Const FormatUTF16 As Integer = -1 ' aka TriStateTrue
  Dim fso As Object: Set fso = CreateObject("Scripting.FileSystemObject")
  Dim file As Object: Set file = fso.OpenTextFile("C:\ProgramData\test.txt", ForReading, False, FormatUTF16)
  Dim text As String: text = file.ReadAll()
  Set file = Nothing
  Set fso = Nothing
  Debug.Print Len(text), Left(text, 20)

  Test (text)

End Sub

Test data creator (C#):

var substring = "××\n× ××   ";
var text = String.Join("", Enumerable.Repeat(substring, 45_000_000 / substring.Length));
var encoding = new UnicodeEncoding(false, false);
File.WriteAllText(@"C:\ProgramData\test.txt", text, encoding);

BTW—Since VBA (VB4, Java, JavaScript, C#, VB, …) uses UTF-16, the space character is the one UTF-16 code unit ChrW(32). (Any similarity to or comparison with ASCII, is unnecessary mental gymnastics, and if put into code as ANSI [Chr(32)], unnecessary conversion behind the scenes, with different behavior for different machines, users and times.)

Tom Blodget
  • 20,260
  • 3
  • 39
  • 72
  • 1
    **Excellent**, your `Test` procedure works amazingly: `44930886 bytes in 1.6 seconds`. I found that using your method of *opening* the file with FileSystemObject [after I changed `FormatUTF16` to `1`; the only option that worked with my file] takes about twice as long as good ol' `Open fName For Input As #1: myText= Input(LOF(1), #1): Close #1`, but that's couple seconds is irrelevant either way since my question was talking about *4-hour runtimes*! I appreciate it; this opens up a new door with my overall project... Thanks! (also to @florent-b!) – ashleedawg Feb 10 '18 at 20:40
  • (This was the first question I answered myself, but it turn out better than I expected... should I "flag" my own answer as ***"lame"***? lol) – ashleedawg Feb 10 '18 at 20:41
1

In VBA, the size of a String is limited to approximately 2 Billion Characters. The "Replace-Loop" method above took 247 seconds for a 45 Million character string, which is over 4 minutes.

Theoretically, that means a 2 Billion character string would take at least 3 hours — if it even finished without crashing — so it's not exactly practical.

Excel has a built-in worksheet function Trim which is not the same as VBA's Trim function.

Worksheet function Trim removes all spaces from text except for single spaces between words.

The problem is that Trim, like all functions called with Application.WorksheetFunction, has a size limit of 32,767 characters, and this [unfortunately] applies even when calling the function from VBA with a string that's not even in a cell.

However, we can still use the function if we use it to loop through our "gigantic string" in sections, like this:

EDIT: Don't even bother with this crap (my function, below)! See the RegEx answer above.

Function bigTrim(strIn As String) As String

    Const maxLen = 32766
    Dim loops As Long, x As Long
    loops = Int(Len(strIn) / maxLen)
    If (Len(strIn) / maxLen) <> loops Then loops = loops + 1

    For x = 1 To loops
        bigTrim = bigTrim & _
            Application.WorksheetFunction.Trim(Mid(strIn, _
            ((x - 1) * maxLen) + 1, maxLen))
    Next x

End Function

Running this function on the same string that was used with the "Replace-Loop" method yielded much better results:

Len In:  44930886 
Len Out: 35321845
Runtime: 33.6 seconds

That's more than 7x faster than the "Replace-Loop" method, and managed to remove 624 spaces that were somehow missed by the other method.

(I though about looking into why the first method missed characters, but since I know my string isn't missing anything, and the point of this exercise was to save time, that would be silly!)

ashleedawg
  • 20,365
  • 9
  • 72
  • 105
  • 3
    There's 2 issues with your solution. It will not work if a serie of white-space is separated by the `Mid` and string concatenation is an expressive operation which should no be used within a loop. You could easily trim a string of 40000000 characters with a regular expression in about a second: `re.Pattern = "\s+"`, `Trim(re.Replace(strIn, " "))`. – Florent B. Feb 10 '18 at 15:13
  • 2
    @FlorentB. That sounds like it has the makings of an answer. – QHarr Feb 10 '18 at 17:11
  • OMG *I didn't even think of **RegEx** (probably partly since I haven't had much luck figuring it out in the past!) I will try that now and report back...! – ashleedawg Feb 10 '18 at 18:13
  • well this has gone on my favs list. – QHarr Feb 10 '18 at 18:16