Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

2 answers

Is it bad idea using regex to tokenize string for lexer?

I'm not sure how am I gonna tokenize source for lexer. For now, I only can think of using regex to parse string into array with given rule (identifier, symbols such as +,-, etc). For instance, begin x:=1;y:=2; then I want to tokenize word, variable…

regex tokenize lexer

asked Feb 07 '13 at 21:56

REALFREE

4,378
7
40
73

votes

1 answer

Difference between WhitespaceTokenizerFactory and StandardTokenizerFactory

I am new to Solr. By reading Solr's wiki, I don't understand the differences between WhitespaceTokenizerFactory and StandardTokenizerFactory. What's their real difference?

solr tokenize

asked Jun 25 '12 at 03:00

trillions

3,669
10
40
59

votes

3 answers

Boost::Split using whole string as delimiter

I would like to know if there is a method using boost::split to split a string using whole strings as a delimiter. For example: str = "xxaxxxxabcxxxxbxxxcxxx" is there a method to split this string using "abc" as a a delimiter? Therefore…

c++ string boost tokenize

asked Sep 15 '11 at 20:17

andre

7,018
4
43
75

votes

1 answer

Class hierarchy of tokens and checking their type in the parser

I'm attempting to write a reusable parsing library (for fun). I wrote a Lexer class which generates a sequence of Tokens. Token is a base class for a hierarchy of subclasses, each representing different token type, with its own specific properties.…

c++ parsing types class-design tokenize

asked Sep 09 '11 at 13:13

SasQ

14,009
7
43
43

votes

3 answers

How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

I've been looking to use Hugging Face's Pipelines for NER (named entity recognition). However, it is returning the entity labels in inside-outside-beginning (IOB) format but without the IOB labels. So I'm not able to map the output of the pipeline…

nlp tokenize transformer-model named-entity-recognition huggingface-transformers

asked Mar 30 '20 at 18:58

Union find

7,759
13
60
111

votes

1 answer

How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?

So if I were to not pass num_words argument when initializing Tokenizer(), how do I find the vocabulary size after it is used to tokenize the training dataset? Why this way, I don't want to limit the tokenizer vocabulary size to know how well my…

machine-learning keras deep-learning nlp tokenize

asked Nov 28 '18 at 18:37

karthiks

7,049
7
47
62

votes

2 answers

Division/RegExp conflict while tokenizing Javascript

I'm writing a simple javascript tokenizer which detects basic types: Word, Number, String, RegExp, Operator, Comment and Newline. Everything is going fine but I can't understand how to detect if the current character is RegExp delimiter or division…

javascript regex token tokenize

asked Jan 18 '11 at 16:14

Orme

votes

2 answers

Using multiple tokenizers in Solr

What I want to be able to do is perform a query and get results back that are not case sensitive and that match partial words from the index. I have a Solr schema set up at the moment that has been modified so that I can query and return results no…

solr tokenize

asked Aug 05 '10 at 03:07

Matt Dell

9,205
11
41
58

votes

3 answers

get indices of original text from nltk word_tokenize

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e. import nltk x = 'hello world' tokens = nltk.word_tokenize(x) >>> ['hello', 'world'] How can…

python text nltk tokenize

asked Jul 28 '15 at 06:05

genekogan

votes

5 answers

how to convert csv to table in oracle

How can I make a package that returns results in table format when passed in csv values. select * from table(schema.mypackage.myfunction('one, two, three')) should return one two three I tried something from ask tom but that only works with sql…

string oracle csv plsql tokenize

asked Jun 29 '10 at 16:22

Mehur

votes

4 answers

How to build a parse tree of a mathematical expression?

I'm learning how to write tokenizers, parsers and as an exercise I'm writing a calculator in JavaScript. I'm using a prase tree approach (I hope I got this term right) to build my calculator. I'm building a tree of tokens based on operator…

parsing tokenize evaluation

asked Jul 01 '14 at 23:45

bodacydo

75,521
93
229
319

votes

4 answers

Google-like search query tokenization & string splitting

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query: the quick "brown fox" jumps over the "lazy dog" I would like to have a string array with the following…

c# search tokenize

asked Dec 10 '09 at 18:54

jamesaharvey

14,023
15
52
63

votes

4 answers

Using boost::tokenizer with string delimiters

I've been looking boost::tokenizer, and I've found that the documentation is very thin. Is it possible to make it tokenize a string such as "dolphin--monkey--baboon" and make every word a token, as well as every double dash a token? From the…

c++ string boost tokenize

asked Aug 09 '09 at 20:38

Martin

votes

1 answer

Is there way to boost original term more while using Solr synonyms?

For example I have synonyms laptop,netbook,notebook in index_synonyms.txt When user search for netbook I want to boost original text more then expanded by synonyms? Is there way to specify this in SynonymFilterFactory? For example use original term…

solr tokenize synonym solr-schema

asked May 18 '12 at 22:44

yura

14,489
21
77
126

votes

2 answers

Boost::tokenizer comma separated (c++)

Should be an easy one for you guys..... I'm playing around with tokenizers using Boost and I want create a token that is comma separated. here is my code: string s = "this is, , , a test"; boost::char_delimiters_separator…

c++ boost tokenize boost-tokenizer

asked Oct 29 '11 at 21:08

Lexicon

2,467
7
33
41

Prev 1 2 3

…

99 100 Next