Hello programmers around the world, I would like to ask if there is anyway to create a perfect tokenizer which can detect only English words in any given string? For Example if you have this string "JGS (8/8/01 5:20:19 PM); We need enabled; disabled & hover icons for the following actions:; CopyToClipboardActionDelegate; RelaunchActionDelegate; TerminateAndRemoveActionDelegate; ; DW " then the expected result should be: we, need , enabled, disabled, hover, icons, for, the, following, actions, Copy, To, Clipboard, Action, Delegate, Relaunch, Action, Delegate, Terminate, And, Remove, Action, Delegate and so on.... I wonder if such thing is even possible?
I've tried word_tokenize and also trying to find patters within the string with substrings made of regexes but so far I can't get the result I need.
This is what I have in order to get the result from above. P.S I know that this method of tokenization cannot achieve what I am looking for, I just don't know how to do regexes :(
def tokenization(series):
list = [];
s = series.to_numpy();
for series in s:
tokens = word_tokenize(series);
list.append(tokens);
return list;
I have also tried this and failed miserably
def tokenization(series):
list = [];
s = series.to_numpy();
for series in s:
tokens = re.findall('^[a-zA-Z]*$', series);
list.append(tokens);
return list;
The result I get is: 'JGS', '(', '8/8/01', '5:20:19', 'PM', ')', ';', 'We', 'need', 'enabled', ';', 'disabled', '&', 'hover', 'icons', 'for', 'the', 'following', 'actions', ':', ';', 'CopyToClipboardActionDelegate', ';', 'RelaunchActionDelegate', ';', 'TerminateAndRemoveActionDelegate', ';', ';', 'DW', '(', '9/24/2001', '2:22:48', 'PM', ')', ';', 'Use', 'the', 'standard', 'copy', 'icon', 'for', 'copy', 'to', 'clipboard', '(', 'desktop', 'likely', 'exposes', 'it', ')', '.', ';', ';', 'DW', '(', '9/24/2001', '2:23:05', 'PM', ')', ';', 'Made', 'requests', 'for', ';', 'Relaunch', ';', 'Terminate', 'All', ';', 'Terminate', '&', 'Remove' and as state above should be it should be pretty much the same as far as tokenizing goes but only words should be present. So if anybody has any ideas all help would be appreciated.