1

I want to separate a string into two parts if a token from an array is found at the end of the string. I have tried this:

x = "Canton Female"
GENDER_TOKENS = ["m", "male", "men", "f", "w", "female", "wom"]

x.partition(/(^|[[:space:]]+)[#{Regexp.union(GENDER_TOKENS)}]$/i)
 #=> ["Canton Female", "", ""]

But although the word "female" is part of my tokens, it is not getting split out. How do I adjust my regex so that it gets split properly?

Sagar Pandya
  • 9,323
  • 2
  • 24
  • 35
Dave
  • 15,639
  • 133
  • 442
  • 830
  • 1
    What value do you want returned? – Sagar Pandya Dec 21 '17 at 18:09
  • 1
    You are making the same mistake: you use `Regexp.union` inside a regex literal and the `i` is not affecting these alternations. Also, you put this group into a character class, and it ruins the pattern altogether. Not sure what you need here, see [this demo](https://ideone.com/jCz5le), try `x.partition(/(?:^|[[:space:]]+)(?:#{Regexp.union(GENDER_TOKENS).source})$/i)` – Wiktor Stribiżew Dec 21 '17 at 18:13

3 Answers3

3

I'm a little unclear what you are asking - what is the desired result? However, here's what I think you're looking for:

GENDER_TOKENS = ["m", "male", "men", "f", "w", "female", "wom"]

"Canton Female".split(/\b(#{Regexp.union(GENDER_TOKENS).source})$/i)
#=> => ["Canton ", "Female"]

"Tom Lord".split(/\b(#{Regexp.union(GENDER_TOKENS).source})$/i)
#=> => ["Tom Lord"]
  • String#split will split the string on each match; unlike String#partition, which returns [head, match, tail]. I think that's probably what you wanted?
  • \b is a word boundary anchor. This is a cleaner solution than trying to match on "start of line or whitespace".
  • The Regexp union is wrapped in round brackets to group the values together, not square brackets. The latter makes it a character set, which is clearly not what you wanted.
  • Regexp#source returns only the inner "text" of the regexp; unlike the (implicit) Regexp#to_s you were using, which returns the full object including option toggles - i.e. /(?-mix:m|male|men|f|w|female|wom)/
Tom Lord
  • 27,404
  • 4
  • 50
  • 77
  • Worth noting the original example had the `Regexp.union` part within `[...]` brackets (set of characters) which makes it behave completely differently. – tadman Dec 21 '17 at 19:37
2

Why not split first?

parts = x.split
if GENDER_TOKENS.include? parts.last.downcase
  # ...
end

Probably not much slower, and way more readable

Max
  • 21,123
  • 5
  • 49
  • 71
1
GENDER_TOKENS = %w[m male men f w female wom]
GENDER_REGEX = /\b(?:#{GENDER_TOKENS.join('|')})\z/i
  #=> /\b(?:m|male|men|f|w|female|wom)\z/i

def split_off_token(str)
  idx = str =~ GENDER_REGEX
  case idx
  when nil
    [str]
  when 0
    ['', str]
  else
    [str[0, idx].rstrip, str[idx..-1]]
  end
end

split_off_token("Canton Female")
  #=> ["Canton", "Female"]
split_off_token("Canton M")
  #=> ["Canton", "M"]
split_off_token("wom")
  #=> ["", "wom"]
split_off_token("Canton Fella")
  #=> ["Canton Fella"]
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100