0

I'm trying to parse a large TXT-File line by line (6mio. Lines, 200MB) using if statements with the String.contains(String) method. At the moment it is very slow is there a method to improve the speed.

I know there's also String.firstIndexOf but that seems to be slower. Regex is probably slower too.

Importing the TXT and splitting lines:

   let content = try String(contentsOfFile:path, encoding: String.Encoding.ascii)
           print("LOADED 0");
           return content.components(separatedBy: "\n")

Parsing:

     if(line.contains("<TAG1>")) {
         var thisline = line;
         thisline = thisline.replacingOccurrences(of: "<TAG1>", with: "")
         thisline = thisline.replacingOccurrences(of: "</TAG1>", with: "")
         text = "\(text)\n\(thisline): ";
     } else if(line.contains("<TAG2>")) {
         var thisline = line;
         thisline = thisline.replacingOccurrences(of: "<TAG2>", with: "")
         thisline = thisline.replacingOccurrences(of: "</TAG2>", with: "")
         text = "\(text) - \(thisline) ";
     }

There will probably be more if statements (which will probably slow down the parsing even more)

It would be awesome if the speed could be improved, it takes approx. 5-10 Minutes on my Macbook (depending on the filesize)

Edit: It seems like string + " \n " + string2 is faster than "(string) \n (string2)", but it doesn't help too much

Edit2: I've added a progress-bar to the application and it seems to start fast and get slower by the end?

user2531284
  • 184
  • 1
  • 13
  • Are you parsing XML? HTML? – rmaddy Sep 27 '19 at 18:18
  • It's actually XML but I renamed it to TXT and am acting like it's a Textdocument to parse it easier (I only need very specific parts of it) – user2531284 Sep 27 '19 at 18:21
  • I'm not sure how smart the compiler is, but `text = "\(text)\n\(thisline): "` could get very slow/expensive as the string gets ever longer. Try `text.append("\n\(thisline): ")` - you may be surprised at the speedup. (Your "Edit2" comment strongly suggests that it is the copying of an ever growing string that is eating the time) – Chris Sep 27 '19 at 18:22
  • Use an XML parser such as the `XMLParser` class. – rmaddy Sep 27 '19 at 18:23
  • @Chris: Awesome! Seems to be obvious now but I didn't get it, thank you very much If you post an answer I accept it as solution – user2531284 Sep 27 '19 at 18:26

1 Answers1

1

Building up your final text variable as you are causes an ever-growing string to be copied (with a small addition) for every line and then re-assigned back to text.

// Slow
text = "\(text)\n\(thisline): "

Appending just the addition to the original variable will be much quicker:

// Fast(er)
text.append("\n\(thisline): ")

Depending on the required level of sophistication (and whether this is just a one-time transformation or something that will happen frequently?), you may want to look into @rmaddy's suggestion of using a proper parser.

Chris
  • 3,445
  • 3
  • 22
  • 28