Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

4 answers

c++ tokenize std string

Possible Duplicate: How do I tokenize a string in C++? Hello I was wondering how I would tokenize a std string with strtok string line = "hello, world, bye"; char * pch = strtok(line.c_str(),","); I get the following error error: invalid…

c++ tokenize strtok

asked Sep 27 '12 at 17:49

Daniel Del Core

3,071
13
38
52

votes

4 answers

Basic NLP in CoffeeScript or JavaScript -- Punkt tokenizaton, simple trained Bayes models -- where to start?

My current web-app project calls for a little NLP: Tokenizing text into sentences, via Punkt and similar; Breaking down the longer sentences by subordinate clause (often it’s on commas except when it’s not) A Bayesian model fit for chunking…

javascript nlp coffeescript user-experience tokenize

asked Mar 15 '12 at 13:54

fish2000

4,289
2
37
76

votes

1 answer

what is so special about special tokens?

what exactly is the difference between "token" and a "special token"? I understand the following: what is a typical token what is a typical special token: MASK, UNK, SEP, etc when do you add a token (when you want to expand your vocab) What I…

nlp tokenize huggingface-transformers bert-language-model huggingface-tokenizers

asked Mar 30 '22 at 14:58

ShaoMin Liu

votes

3 answers

RegEx Tokenizer: split text into words, digits, punctuation, and spacing (do not delete anything)

I almost found the answer to this question in this thread (samplebias's answer); however I need to split a phrase into words, digits, punctuation marks, and spaces/tabs. I also need this to preserve the order in which each of these things occurs…

python regex nltk tokenize

asked Aug 08 '11 at 19:16

floer32

2,190
4
29
50

votes

1 answer

what is the difference between len(tokenizer) and tokenizer.vocab_size

I'm trying to add a few new words to the vocabulary of a pretrained HuggingFace Transformers model. I did the following to change the vocabulary of the tokenizer and also increase the embedding size of the model: tokenizer.add_tokens(['word1',…

nlp tokenize huggingface-transformers huggingface-tokenizers

asked May 06 '21 at 06:30

mitra mirshafiee

votes

3 answers

What is the most accurate open-source tool for sentence splitting?

I need to split text into sentences. I'm currently playing around with OpenNLP's sentence detector tool. I've also heard of NLTK and Stanford CoreNLP tools. What is the most accurate English sentence detection tools out there? I don't need too many…

parsing nlp tokenize

asked Mar 14 '11 at 16:48

samxli

1,536
5
17
28

votes

1 answer

Reloading Keras Tokenizer during Testing

I followed the tutorial here: (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) However, I modified the code to be able to save the generated model through h5py. Thus, after running the training script, I have a…

tensorflow keras tokenize text-classification word-embedding

asked Jun 26 '17 at 13:31

Vandenn

votes

6 answers

How do I split a word's letters into an Array in C#?

How do I split a string into an array of characters in C#? Example String word used is "robot". The program should print out: r o b o t The orginal code snippet: using System; using System.Collections.Generic; using System.Linq; using…

c# .net string tokenize

asked Nov 22 '10 at 15:14

JavaNoob

3,494
19
49
61

votes

2 answers

Nltk french tokenizer in python not working

Why is the french tokenizer that comes with python not working for me? Am I doing something wrong? I'm doing import nltk content_french = ["Les astronomes amateurs jouent également un rôle important en recherche; les plus sérieux participant…

python nltk tokenize

asked Feb 23 '17 at 23:54

Atirag

1,660
7
32
60

votes

1 answer

How can I prevent spacy's tokenizer from splitting a specific substring when tokenizing a string?

How can I prevent spacy's tokenizer from splitting a specific substring when tokenizing a string? More specifically, I have this sentence: Once unregistered, the folder went away from the shell. which gets tokenized as…

python nlp tokenize spacy

asked Jan 26 '17 at 03:26

Franck Dernoncourt

77,520
72
342
501

votes

4 answers

PHP: split a string of alternating groups of characters into an array

I have a string whose correct syntax is the regex ^([0-9]+[abc])+$. So examples of valid strings would be: '1a2b' or '00333b1119a555a0c' For clarity, the string is a list of (value, letter) pairs and the order matters. I'm stuck with the input…

php regex tokenize regex-greedy

asked Mar 25 '16 at 08:50

Stilez

votes

2 answers

How to tokenize Perl source code?

I have some reasonable (not obfuscated) Perl source files, and I need a tokenizer, which will split it to tokens, and return the token type of each of them, e.g. for the script print "Hello, World!\n"; it would return something like this: keyword…

perl tokenize

asked Aug 19 '10 at 09:08

pts

80,836
20
110
183

votes

0 answers

what should I choose? ngram filter,ngram tokenizer or fuzzy match query?

I am a little confused about usage of filter, tokenizer vs query. I can select ngram filter or tokenizer during indexing (through an analyzer) I can also use multi_field to store different variation of same field for different usage of a query so I…

indexing elasticsearch tokenize fuzzy-search

asked Jul 05 '15 at 07:37

Mohammad Reza Esmaeilzadeh

1,313
18
31

votes

2 answers

Text tokenization with Stanford NLP : Filter unrequired words and characters

I use Stanford NLP for string tokenization in my classification tool. I want to get only meaningful words, but I get non-word tokens (like ---, >, . etc.) and not important words like am, is, to (stop words). Does anybody know a way to solve this…

java machine-learning tokenize stanford-nlp

asked May 03 '15 at 20:30

dmitrievanthony

1,501
1
15
41

votes

3 answers

A string tokenizer in C++ that allows multiple separators

Is there a way to tokenize a string in C++ with multiple separators? In C# I would have done: string[] tokens = "adsl, dkks; dk".Split(new [] { ",", " ", ";" }, StringSplitOptions.RemoveEmpty);

c# c++ string tokenize

asked Apr 16 '10 at 15:19

Hao Wooi Lim

3,928
4
29
35

Prev 1 2 3

…

99 100 Next