0

Hello programmers around the world, I would like to ask if there is anyway to create a perfect tokenizer which can detect only English words in any given string? For Example if you have this string "JGS (8/8/01 5:20:19 PM); We need enabled; disabled & hover icons for the following actions:; CopyToClipboardActionDelegate; RelaunchActionDelegate; TerminateAndRemoveActionDelegate; ; DW " then the expected result should be: we, need , enabled, disabled, hover, icons, for, the, following, actions, Copy, To, Clipboard, Action, Delegate, Relaunch, Action, Delegate, Terminate, And, Remove, Action, Delegate and so on.... I wonder if such thing is even possible?

I've tried word_tokenize and also trying to find patters within the string with substrings made of regexes but so far I can't get the result I need.

This is what I have in order to get the result from above. P.S I know that this method of tokenization cannot achieve what I am looking for, I just don't know how to do regexes :(

def tokenization(series):
    list = [];
    s = series.to_numpy();
    for series in s:
        tokens = word_tokenize(series);
        list.append(tokens);
    return list;

I have also tried this and failed miserably

def tokenization(series):
    list = [];
    s = series.to_numpy();
    for series in s:
        tokens = re.findall('^[a-zA-Z]*$', series);
        list.append(tokens);
    return list;

The result I get is: 'JGS', '(', '8/8/01', '5:20:19', 'PM', ')', ';', 'We', 'need', 'enabled', ';', 'disabled', '&', 'hover', 'icons', 'for', 'the', 'following', 'actions', ':', ';', 'CopyToClipboardActionDelegate', ';', 'RelaunchActionDelegate', ';', 'TerminateAndRemoveActionDelegate', ';', ';', 'DW', '(', '9/24/2001', '2:22:48', 'PM', ')', ';', 'Use', 'the', 'standard', 'copy', 'icon', 'for', 'copy', 'to', 'clipboard', '(', 'desktop', 'likely', 'exposes', 'it', ')', '.', ';', ';', 'DW', '(', '9/24/2001', '2:23:05', 'PM', ')', ';', 'Made', 'requests', 'for', ';', 'Relaunch', ';', 'Terminate', 'All', ';', 'Terminate', '&', 'Remove' and as state above should be it should be pretty much the same as far as tokenizing goes but only words should be present. So if anybody has any ideas all help would be appreciated.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Please define "English words", especially in the light of them being not separated by anything, not even non-alpha characters. Do you have a list of valid words? To ask differently, you have created the desired output - how did you do that? "I simply recognise English words" would in my opinion confirm the "list of valid words" theory. Given the string "thought", would you recognise "thou" as an English word? It is one, though an old one. – Yunnosch Oct 01 '19 at 20:37
  • Why is "Clipboard" in your desired output? I.e why not "Clip" and "board"? Both are English words. – Yunnosch Oct 01 '19 at 20:39
  • I would tokenize it in a regular manner, and then validate whether it's word, or not as per your criteria. In order to do that try with ```nltk``` package, plus this might be useful: https://stackoverflow.com/a/3789057/11610186 – Grzegorz Skibinski Oct 01 '19 at 20:51
  • Question has nothing to do with `machine-learning` - kindly do not spam irrelevant tags (removed). – desertnaut Oct 01 '19 at 21:29
  • @Yunnosch Thanks you for your questions! Basically what I would like to achieve is to have a list of tokens which only contains words. By words I mean strings containing only letters or you can also look at it like string containing only alphabetical characters. So just letter no numbers and no signs like brackets or other characters like that. Hope that helps. – Христо Петков Oct 01 '19 at 21:59
  • By that explanation, why is "PM" and "JGS" not in your desired result? – Yunnosch Oct 01 '19 at 22:03
  • @Yunnosch I am not saying that it is not there. I am saying that I don't need it to be there. I am doing sentiment analysis on bug reports so I need words that have meaning. – Христо Петков Oct 01 '19 at 22:05
  • How do you tell the words with meaning apart from those without meaning? What is the difference between the meaningful "actions" and the meaningless "JGS" ? – Yunnosch Oct 01 '19 at 22:08
  • @Yunnosch I don't think I need to, provided I get a list only of words then I can run through it in order to remove "stopwords" like names or just words like ''the" "and" "or" that kind of words which have no meaning other than making a sentence grammatically correct. So I think all I need is a list of all the possible words from and bug report. The words should be strings which contain only letters inside not numbers or any other symbols. – Христо Петков Oct 01 '19 at 22:11
  • How about explaining that in the question, instead of asking for English words? You could [edit] for that. You could also update the desired result, to match that. I.e. include JGS and PM and TerminateAndRemoveActionDelegate in the desired result and remove the versions which are split into shorter English words. – Yunnosch Oct 01 '19 at 23:33

0 Answers0