2

I wrote my own maximum matching function in Swift to divide Chinese sentences into words. It works fine, except with abnormally long sentences the memory usage goes up over 1 gb. I need help figuring out how to modify my code so that there isn't this memory problem. I'm not sure if it has to do with how I am using RealmSwift or if it is my algorithm in general.

Here is my code:

    func splitSentenceIntoWordsWithDictionaryMaximumMatching(string: String) -> [String] {
    var string = string
    var foundWordsArray: [String] = []
    var position = count(string)

    while position > 0
    {
        var index = advance(string.startIndex, position)
        let partialString = string.substringToIndex(index)
        if let found = Realm().objects(Word).filter("simplified == '\(partialString)'").first
        {
            foundWordsArray.append(partialString)
            position = position - 1
            var partialStringCount = count(partialString)
            while partialStringCount > 0
            {
                string = dropFirst(string)
                partialStringCount -= 1
            }
            position = count(string)
            index = advance(string.startIndex, position)
        }
        else if count(partialString) == 1
        {
            addNewEntryToDictionaryInTransaction(partialString, "", [partialString], partialString)
            foundWordsArray.append(partialString)
            var partialStringCount = count(partialString)
            while partialStringCount > 0
            {
                string = dropFirst(string)
                partialStringCount -= 1
            }
            position = count(string)
            index = advance(string.startIndex, position)
        }
        else
        {
            position = position - 1
            index = advance(string.startIndex, position)
        }
    }

    return foundWordsArray
}
webmagnets
  • 2,266
  • 3
  • 33
  • 60

1 Answers1

2

In this case, you should use autoreleasepool (see: Use Local Autorelease Pool Blocks to Reduce Peak Memory Footprint) in the loop:

    while position > 0 {
        autoreleasepool {
            var index = advance(string.startIndex, position)
            ...
        }
    }

BTW, your code has too many memory copy operations and O(N) operations on String.Index (count() and advance()), that may cause serious performance problems. Instead, you should effectively use String.Index something like this :

import Foundation

var  words:Set = ["今日", "献立", "魚", "味噌汁", "定食", "焼き魚", "です"]
func splitSentenceIntoWordsWithDictionaryMaximumMatching(string: String) -> [String] {
    var foundWordsArray: [String] = []
    var start = string.startIndex
    var end = string.endIndex

    while start != end {
        autoreleasepool { // In this case(using builtin `Set`), I think we don't need `autoreleasepool` here. But this is just a demo :)
            let partialString = string[start ..< end]
            if !words.contains(partialString) {
                if end.predecessor() != start { // faster way of `count(partialString) == 1`
                    end = end.predecessor()
                    return // we cannot use `continue` here because we are in `autoreleasepool` closure
                }
                words.insert(partialString)
            }
            foundWordsArray.append(partialString)
            start = end
            end = string.endIndex
        }
    }

    return foundWordsArray
}

var str = "今日の献立は焼き魚定食と味噌汁です"
let result = splitSentenceIntoWordsWithDictionaryMaximumMatching(str)

debugPrintln(result) // ["今日", "の", "献立", "は", "焼き魚", "定食", "と", "味噌汁", "です"]
debugPrintln(words) // Set(["献立", "焼き魚", "今日", "魚", "味噌汁", "定食", "は", "と", "です", "の"])

I used builtin Set, but I think you can easily adopts your Realm codes here :)


ADDED: in reply to the comments:

reversed version:

var  words:Set = ["研究", "研究生", "生命", "起源"]
func splitSentenceIntoWordsWithDictionaryMaximumMatchingReversed(string: String) -> [String] {
    var foundWordsArray: [String] = []
    var start = string.startIndex
    var end = string.endIndex

    while start != end {
        autoreleasepool {
            let partialString = string[start ..< end]
            if !words.contains(partialString) {
                if start.successor() != end {
                    start = start.successor()
                    return
                }
                words.insert(partialString)
            }
            foundWordsArray.append(partialString)
            end = start
            start = string.startIndex
        }
    }

    return foundWordsArray.reverse()
}

var str = "研究生命起源"
let result = splitSentenceIntoWordsWithDictionaryMaximumMatchingReversed(str)

debugPrintln(result) // ["研究", "生命", "起源"]
rintaro
  • 51,423
  • 14
  • 131
  • 139
  • Thanks that worked. With the Realm code you still need the autoreleasepool. – webmagnets May 25 '15 at 15:02
  • Could you give me a hint how to do this in reverse? In addition to the way this function does it, I need another function that matches from the back of the string instead of from the front. The part I am having most trouble with is ```let partialString = string[start ..< end]```. Should it be something like ```let partialString = string[end ..< start]```? (of course that doesn't work)` ```fatal error: Can't form Range with end < start ``` – webmagnets May 25 '15 at 16:26
  • I can award you a bounty if you help me do the same thing in reverse. – webmagnets May 26 '15 at 19:42
  • I'm not sure what "reverse" you mean. When you have words `"ABC", "CDE", "EDC", "CBA"` and a string `"ABCDE"`, what you expect as a result? BTW, I don't need bounty or reps for this :) – rintaro May 27 '15 at 08:42
  • I mean if the text is "研究生命起源" then forward would return "["研究生","命","起源"] because "研究生" is the longest word in the front. Reverse would return ["研究","生命","起源"]. because it would start from the end and find "生命" before it finally finds "研究". – webmagnets May 27 '15 at 12:03
  • Both functions would be returning valid results because both "研究生" and "研究" are in the dictionary as well as both "生命" and "命". – webmagnets May 27 '15 at 12:09
  • Thank you. Bounty in 23 hours. – webmagnets May 27 '15 at 13:27