How do I set word delimiters?

Question

User's guide chapter 6.1.5 The Word Chunk A word is a string of characters delimited by space, tab, or return characters or enclosed by double quotes. Is it possible to have additional word delimiters?

I have the following code snippet taken from the User's Guide chapter 6.5.1 'When to use arrays', p. 184

on mouseUp

   --cycle through each word adding each instance to an array
   repeat for each word tWord in field "sample text"
      add 1 to tWordCount[tWord]
   end repeat

   -- combine the array into text
   combine tWordCount using return and comma
   answer tWordCount

end mouseUp

It counts the number of occurences of each word form in the field "Sample text".

I realize that full stops after words are counted as part of the word with the default setting.

How do I change the settings that a full stop (and, or a comma) is considered a word boundary?

score 1 · Accepted Answer · answered May 18 '13 at 06:45

Alternatively you could simply remove the offending characters before processing. This can be done using either the REPLACE function or the "REPLACETEXT function. The REPLACETEXT function can use a regular expression matchstring but is slower than the REPLACE function. So here I am using the REPLACE function.

on mouseUp
   put field "sample" into twords
   --remove all trailing puncuation and quotes
   replace "." with "" in twords
   replace "," with "" in twords
   replace "?" with "" in twords
   replace ";" with "" in twords
   replace ":" with "" in twords
   replace quote with "" in twords
   --hyphenated words need to be seperated?
   replace "-" with " " in twords

   repeat for each word tword in twords
       add 1 to twordcount[tword]
   end repeat
   combine twordcount using return and comma
  answer twordcount
end mouseUp

This is an interesting solution as well. I wonder which one is faster. — z--, May 18 '13 at 19:40

dunbarx · Answer 2 · 2013-05-19T21:48:31.780

1

I think you are asking a question about delimiters. Some delimiters are built-in:

spaces for words,

commas for items,

return (CR) for lines.

The ability to create your own custom delimiter property (the itemDelimiter) is a powerful feature of the language, and pertains to "items". You can set this to any single character:

set the itemDelimiter to "C"

answer the number of items in "XXCXXCXX" --call this string "theText"

The result will be "3"

As others have pointed out, the method of replacing one string for another allows formidable control over custom parsing of text:

replace "C" with space in theText

yields "XX XX XX"

Craig Newman

edited May 19 '13 at 21:48

answered May 19 '13 at 20:13

dunbarx

146
2

Yes, for items I can set the delimiters and this is very useful. For the words however it is set to space, tab or return. And this means that a full stop following a word immediately is considered part of the word. My question is about the easiest way to get the "effective" words, i.e. the wordforms without punctuation included. – z-- May 20 '13 at 04:45

z-- · Answer 3 · 2013-05-18T06:39:52.127

As the User's guide says in chapter 6.1.5 The Word Chunk A word is a string of characters delimited by space, tab, or return characters or enclosed by double quotes.

There is itemDelimiter but not wordDelimiter.

So punctuation as to be removed first before adding the word to the word count array.

This may be done with a function effectiveWord.

function effectiveWord aWord
   put last char of aWord into it
   if it is "." then delete last char of aWord
   if it is "," then delete last char of aWord
   if it is ":" then delete last char of aWord
   if it is ";" then delete last char of aWord
   return aWord
end effectiveWord



on mouseUp

   --cycle through each word adding each instance to an array
   repeat for each word tWord in field "Sample text"
      add 1 to tWordCount[effectiveWord(tWord)]
   end repeat

   -- combine the array into text
   combine tWordCount using return and comma
   answer tWordCount

end mouseUp

You can do this much more effectively with regex: replaceText(myVar,"[^a-zA-Z0-9]",empty). — Mark, May 18 '13 at 11:43
The type of text you seem to need this for can easily be converted to ASCII text. — Mark, May 20 '13 at 15:20

How do I set word delimiters?

3 Answers3