4

What I want is something like

"word1 word2 word3".rangeOfWord(2) => 6 to 10

The result could come as a Range or a tuple or whatever.

I'd rather not do the brute force of iterating over the characters and using a state machine. Why reinvent the lexer? Is there a better way?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Andrew Duncan
  • 3,553
  • 4
  • 28
  • 55
  • It's hardly a "lexer" you'd be implementing! – Noldorin Dec 23 '15 at 22:18
  • Hi, Andrew - Do you know about NSLinguisticTagger? — Or, in your rather simple-minded example, wouldn't NSRegularExpression be sufficient? – matt Dec 23 '15 at 22:35
  • 1
    You know, as a (former) Perl hacker, I should have thought of REs. Although I'm not interested in just finding the Nth word, but finding its range. Could I do that with an RE? Not a pure computer-science RE, of course, but maybe with enhanced ones. – Andrew Duncan Dec 24 '15 at 00:19

5 Answers5

4

In your example, your words are unique, and you can use the following method:

let myString = "word1 word2 word3"
let wordNum = 2
let myRange = myString.rangeOfString(myString.componentsSeparatedByString(" ")[wordNum-1])
    // 6..<11

As pointed out by Andrew Duncan in the comments below, the above is only valid if your words are unique. If you have non-unique words, you can use this somewhat less neater method:

let myString = "word1 word2 word3 word2 word1 word3 word1"
let wordNum = 7 // 2nd instance (out of 3) of "word1"
let arr = myString.componentsSeparatedByString(" ")
var fromIndex = arr[0..<wordNum-1].map { $0.characters.count }.reduce(0, combine: +) + wordNum - 1

let myRange = Range<String.Index>(start: myString.startIndex.advancedBy(fromIndex), end: myString.startIndex.advancedBy(fromIndex+arr[wordNum-1].characters.count))
let myWord = myString.substringWithRange(myRange) 
    // string "word1" (from range 36..<41)

Finally, lets use the latter to construct an extension of String as you have wished for in your question example:

extension String {
    private func rangeOfNthWord(wordNum: Int, wordSeparator: String) -> Range<String.Index>? {
        let arr = myString.componentsSeparatedByString(wordSeparator)

        if arr.count < wordNum {
            return nil
        }
        else {
            let fromIndex = arr[0..<wordNum-1].map { $0.characters.count }.reduce(0, combine: +) + (wordNum - 1)*wordSeparator.characters.count
            return Range<String.Index>(start: myString.startIndex.advancedBy(fromIndex), end: myString.startIndex.advancedBy(fromIndex+arr[wordNum-1].characters.count))
        }
    }
}

let myString = "word1 word2 word3 word2 word1 word3 word1"
let wordNum = 7 // 2nd instance (out of 3) of "word1"

if let myRange = myString.rangeOfNthWord(wordNum, wordSeparator: " ") {
        // myRange: 36..<41
    print(myString.substringWithRange(myRange)) // prints "word1"
}

You can tweak the .rangeOfNthWord(...) method if word separation is not unique (say some words are separated by two blankspaces " ").


Also pointed out in the comments below, the use of .rangeOfString(...) is not, per se, pure Swift. It is, however, by no means bad practice. From Swift Language Guide - Strings and Characters:

Swift’s String type is bridged with Foundation’s NSString class. If you are working with the Foundation framework in Cocoa, the entire NSString API is available to call on any String value you create when type cast to NSString, as described in AnyObject. You can also use a String value with any API that requires an NSString instance.

See also the NSString class reference for rangeOfString method:

// Swift Declaration:
func rangeOfString(_ searchString: String) -> NSRange
dfrib
  • 70,367
  • 12
  • 127
  • 192
1

I went ahead and wrote the state machine. (Grumble..) FWIW, here it is:

extension String {
    private func halfOpenIntervalOfBlock(n:Int, separator sep:Character? = nil) -> (Int, Int)? {
        enum State {
            case InSeparator
            case InPrecedingSeparator
            case InWord
            case InTarget
            case Done
        }

        guard n > 0 else {
            return nil
        }

        var state:State
        if n == 1 {
            state = .InPrecedingSeparator
        } else {
            state = .InSeparator
        }

        var separatorNum = 0
        var startIndex:Int = 0
        var endIndex:Int = 0

        for (i, c) in self.characters.enumerate() {
            let inSeparator:Bool
            // A bit inefficient to keep doing this test.
            if let s = sep {
                inSeparator = c == s
            } else {
                inSeparator = c == " " || c == "\n"
            }
            endIndex = i

            switch state {
            case .InPrecedingSeparator:
                if !inSeparator {
                    state = .InTarget
                    startIndex = i
                }

            case .InTarget:
                if inSeparator {
                    state = .Done
                }

            case .InWord:
                if inSeparator {
                    separatorNum += 1
                    if separatorNum == n - 1 {
                        state = .InPrecedingSeparator
                    } else {
                        state = .InSeparator
                    }
                }

            case .InSeparator:
                if !inSeparator {
                    state = .InWord
                }

            case .Done:
                break
            }

            if state == .Done {
                break
            }
        }

        if state == .Done {
            return (startIndex, endIndex)
        } else if state == .InTarget {
            return (startIndex, endIndex + 1) // We ran off end.
        } else {
            return nil
        }
    }

    func rangeOfWord(n:Int) -> Range<Index>? {
        guard let (s, e) = self.halfOpenIntervalOfBlock(n) else {
            return nil
        }
        let ss = self.startIndex.advancedBy(s)
        let ee = self.startIndex.advancedBy(e)
        return Range(start:ss, end:ee)
    }

 }
Andrew Duncan
  • 3,553
  • 4
  • 28
  • 55
1

It's not really clear whether the string has to be considered divided in words by separators it may contains, or if you're just looking for a specific substring occurrence. Anyway both cases could be addressed in this way in my opinion:

extension String {
   func enumerateOccurencies(of pattern: String, _ body: (Range<String.Index>, inout Bool) throws -> Void) rethrows {
        guard
            !pattern.isEmpty,
            count >= pattern.count
        else { return }
    
        var stop = false
        var lo = startIndex
        while !stop && lo < endIndex {
            guard 
                let r = self[lo..<endIndex].range(of: pattern)
            else { break }
            
            try body(r, &stop)
            lo = r.upperBound
        }
    }
    
}

You'll then set stop to true in the body closure once reached the desired occurrence number and capture the range passed to it:

let words = "word1, word1, word2, word3, word1, word3"
var matches = 0
var rangeOfThirdOccurencyOfWord1: Range<String.Index>? = nil
words.enumerateOccurencies(of: "word1") { range, stop in 
    matches +=1
    stop = matches == 3
    if stop {
        rangeOfThirdOccurencyOfWord1 = range
    } 
}

Regarding the DFA: recently I've wrote one leveraging on Hashable and using a an Array of Dictionaries as its state nodes, but I've found that the method above is faster, cause maybe range(of:) uses finger-printing.

UPDATE

Otherwise you could also achieve that API you've mentioned in this way:

import Foundation

extension String {
    func rangeOfWord(order: Int, separator: String) -> Range<String.Index>? {
        precondition(order > 0)
        guard
            !isEmpty,
            !separator.isEmpty,
            separator.count < count
        else { return nil }
        
        var wordsSoFar = 0
        var lo = startIndex
        while let r = self[lo..<endIndex].range(of: separator) {
            guard
                r.lowerBound != lo
            else {
                lo = r.upperBound
                continue
            }
            wordsSoFar += 1
            guard
                wordsSoFar < order
            else { return lo..<r.lowerBound }
            
            lo = r.upperBound
        }
        
        if
            lo < endIndex,
            wordsSoFar + 1 == order
        {
            return lo..<endIndex
        }
        
        return nil
    }
}

let words = "word anotherWord oneMore lastOne"
if let r = words.rangeOfWord(order: 4, separator: " ") {
    print(words[r])
} else {
    print("not found")
}

Here order parameter refers to the nth order of the word in the string, starting from 1. I've also added the separator parameter to specify a string token to use for finding words in the string (it can also be defaulted to " " to be able to call the function without having to specify it).

valeCocoa
  • 344
  • 1
  • 8
1

Here's my attempt at an updated answer in Swift 5.5:

import Foundation

extension String {

    func rangeOfWord(atPosition wordAt: Int) -> Range<String.Index>? {
        let fullrange = self.startIndex..<self.endIndex
        var count = 0
        var foundAt: Range<String.Index>? = nil

        self.enumerateSubstrings(in: fullrange, options: .byWords) { _, substringRange, _, stop in
            count += 1
            if count == wordAt {
                foundAt = substringRange
                stop = true  // Stop the enumeration after the word range is found.
            }
        }

        return foundAt
    }
}

let lorem = "Morbi leo risus, porta ac consectetur ac, vestibulum at eros."

if let found = lorem.rangeOfWord(atPosition: 8) {
    print("found: \(lorem[found])")
} else {
    print("not found.")
}

This solution doesn't make a new array to contain the words so uses less memory (I have not tested but in theory it should use less memory). As much as possible, the build in method is used therefore less chance of bugs.

weyhan
  • 703
  • 4
  • 15
0

Swift 5 solution, which allows you to specify the word separator

extension String {
    func rangeOfWord(atIndex wordIndex: Int) -> Range<String.Index>? {
        let wordComponents = self.components(separatedBy: " ")
        guard wordIndex < wordComponents.count else {
            return nil
        }
        let characterEndCount = wordComponents[0...wordIndex].map { $0.count }.reduce(0, +)
        let start = String.Index(utf16Offset: wordIndex + characterEndCount - wordComponents[wordIndex].count, in: self)
        let end = String.Index(utf16Offset: wordIndex + characterEndCount, in: self)
        return start..<end
    }
}
ultraflex
  • 164
  • 3
  • 8