Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
9
votes
4 answers

c++ tokenize std string

Possible Duplicate: How do I tokenize a string in C++? Hello I was wondering how I would tokenize a std string with strtok string line = "hello, world, bye"; char * pch = strtok(line.c_str(),","); I get the following error error: invalid…
Daniel Del Core
  • 3,071
  • 13
  • 38
  • 52
8
votes
4 answers

Basic NLP in CoffeeScript or JavaScript -- Punkt tokenizaton, simple trained Bayes models -- where to start?

My current web-app project calls for a little NLP: Tokenizing text into sentences, via Punkt and similar; Breaking down the longer sentences by subordinate clause (often it’s on commas except when it’s not) A Bayesian model fit for chunking…
fish2000
  • 4,289
  • 2
  • 37
  • 76
8
votes
1 answer

what is so special about special tokens?

what exactly is the difference between "token" and a "special token"? I understand the following: what is a typical token what is a typical special token: MASK, UNK, SEP, etc when do you add a token (when you want to expand your vocab) What I…
8
votes
3 answers

RegEx Tokenizer: split text into words, digits, punctuation, and spacing (do not delete anything)

I almost found the answer to this question in this thread (samplebias's answer); however I need to split a phrase into words, digits, punctuation marks, and spaces/tabs. I also need this to preserve the order in which each of these things occurs…
floer32
  • 2,190
  • 4
  • 29
  • 50
8
votes
1 answer

what is the difference between len(tokenizer) and tokenizer.vocab_size

I'm trying to add a few new words to the vocabulary of a pretrained HuggingFace Transformers model. I did the following to change the vocabulary of the tokenizer and also increase the embedding size of the model: tokenizer.add_tokens(['word1',…
8
votes
3 answers

What is the most accurate open-source tool for sentence splitting?

I need to split text into sentences. I'm currently playing around with OpenNLP's sentence detector tool. I've also heard of NLTK and Stanford CoreNLP tools. What is the most accurate English sentence detection tools out there? I don't need too many…
samxli
  • 1,536
  • 5
  • 17
  • 28
8
votes
1 answer

Reloading Keras Tokenizer during Testing

I followed the tutorial here: (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) However, I modified the code to be able to save the generated model through h5py. Thus, after running the training script, I have a…
8
votes
6 answers

How do I split a word's letters into an Array in C#?

How do I split a string into an array of characters in C#? Example String word used is "robot". The program should print out: r o b o t The orginal code snippet: using System; using System.Collections.Generic; using System.Linq; using…
JavaNoob
  • 3,494
  • 19
  • 49
  • 61
8
votes
2 answers

Nltk french tokenizer in python not working

Why is the french tokenizer that comes with python not working for me? Am I doing something wrong? I'm doing import nltk content_french = ["Les astronomes amateurs jouent également un rôle important en recherche; les plus sérieux participant…
Atirag
  • 1,660
  • 7
  • 32
  • 60
8
votes
1 answer

How can I prevent spacy's tokenizer from splitting a specific substring when tokenizing a string?

How can I prevent spacy's tokenizer from splitting a specific substring when tokenizing a string? More specifically, I have this sentence: Once unregistered, the folder went away from the shell. which gets tokenized as…
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
8
votes
4 answers

PHP: split a string of alternating groups of characters into an array

I have a string whose correct syntax is the regex ^([0-9]+[abc])+$. So examples of valid strings would be: '1a2b' or '00333b1119a555a0c' For clarity, the string is a list of (value, letter) pairs and the order matters. I'm stuck with the input…
Stilez
  • 558
  • 5
  • 14
8
votes
2 answers

How to tokenize Perl source code?

I have some reasonable (not obfuscated) Perl source files, and I need a tokenizer, which will split it to tokens, and return the token type of each of them, e.g. for the script print "Hello, World!\n"; it would return something like this: keyword…
pts
  • 80,836
  • 20
  • 110
  • 183
8
votes
0 answers

what should I choose? ngram filter,ngram tokenizer or fuzzy match query?

I am a little confused about usage of filter, tokenizer vs query. I can select ngram filter or tokenizer during indexing (through an analyzer) I can also use multi_field to store different variation of same field for different usage of a query so I…
8
votes
2 answers

Text tokenization with Stanford NLP : Filter unrequired words and characters

I use Stanford NLP for string tokenization in my classification tool. I want to get only meaningful words, but I get non-word tokens (like ---, >, . etc.) and not important words like am, is, to (stop words). Does anybody know a way to solve this…
dmitrievanthony
  • 1,501
  • 1
  • 15
  • 41
8
votes
3 answers

A string tokenizer in C++ that allows multiple separators

Is there a way to tokenize a string in C++ with multiple separators? In C# I would have done: string[] tokens = "adsl, dkks; dk".Split(new [] { ",", " ", ";" }, StringSplitOptions.RemoveEmpty);
Hao Wooi Lim
  • 3,928
  • 4
  • 29
  • 35