Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
12
votes
2 answers

Is it bad idea using regex to tokenize string for lexer?

I'm not sure how am I gonna tokenize source for lexer. For now, I only can think of using regex to parse string into array with given rule (identifier, symbols such as +,-, etc). For instance, begin x:=1;y:=2; then I want to tokenize word, variable…
REALFREE
  • 4,378
  • 7
  • 40
  • 73
12
votes
1 answer

Difference between WhitespaceTokenizerFactory and StandardTokenizerFactory

I am new to Solr. By reading Solr's wiki, I don't understand the differences between WhitespaceTokenizerFactory and StandardTokenizerFactory. What's their real difference?
trillions
  • 3,669
  • 10
  • 40
  • 59
11
votes
3 answers

Boost::Split using whole string as delimiter

I would like to know if there is a method using boost::split to split a string using whole strings as a delimiter. For example: str = "xxaxxxxabcxxxxbxxxcxxx" is there a method to split this string using "abc" as a a delimiter? Therefore…
andre
  • 7,018
  • 4
  • 43
  • 75
11
votes
1 answer

Class hierarchy of tokens and checking their type in the parser

I'm attempting to write a reusable parsing library (for fun). I wrote a Lexer class which generates a sequence of Tokens. Token is a base class for a hierarchy of subclasses, each representing different token type, with its own specific properties.…
SasQ
  • 14,009
  • 7
  • 43
  • 43
11
votes
3 answers

How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

I've been looking to use Hugging Face's Pipelines for NER (named entity recognition). However, it is returning the entity labels in inside-outside-beginning (IOB) format but without the IOB labels. So I'm not able to map the output of the pipeline…
11
votes
1 answer

How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?

So if I were to not pass num_words argument when initializing Tokenizer(), how do I find the vocabulary size after it is used to tokenize the training dataset? Why this way, I don't want to limit the tokenizer vocabulary size to know how well my…
karthiks
  • 7,049
  • 7
  • 47
  • 62
11
votes
2 answers

Division/RegExp conflict while tokenizing Javascript

I'm writing a simple javascript tokenizer which detects basic types: Word, Number, String, RegExp, Operator, Comment and Newline. Everything is going fine but I can't understand how to detect if the current character is RegExp delimiter or division…
Orme
  • 236
  • 2
  • 8
11
votes
2 answers

Using multiple tokenizers in Solr

What I want to be able to do is perform a query and get results back that are not case sensitive and that match partial words from the index. I have a Solr schema set up at the moment that has been modified so that I can query and return results no…
Matt Dell
  • 9,205
  • 11
  • 41
  • 58
11
votes
3 answers

get indices of original text from nltk word_tokenize

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e. import nltk x = 'hello world' tokens = nltk.word_tokenize(x) >>> ['hello', 'world'] How can…
genekogan
  • 671
  • 2
  • 10
  • 20
11
votes
5 answers

how to convert csv to table in oracle

How can I make a package that returns results in table format when passed in csv values. select * from table(schema.mypackage.myfunction('one, two, three')) should return one two three I tried something from ask tom but that only works with sql…
Mehur
  • 147
  • 2
  • 2
  • 8
11
votes
4 answers

How to build a parse tree of a mathematical expression?

I'm learning how to write tokenizers, parsers and as an exercise I'm writing a calculator in JavaScript. I'm using a prase tree approach (I hope I got this term right) to build my calculator. I'm building a tree of tokens based on operator…
bodacydo
  • 75,521
  • 93
  • 229
  • 319
11
votes
4 answers

Google-like search query tokenization & string splitting

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query: the quick "brown fox" jumps over the "lazy dog" I would like to have a string array with the following…
jamesaharvey
  • 14,023
  • 15
  • 52
  • 63
11
votes
4 answers

Using boost::tokenizer with string delimiters

I've been looking boost::tokenizer, and I've found that the documentation is very thin. Is it possible to make it tokenize a string such as "dolphin--monkey--baboon" and make every word a token, as well as every double dash a token? From the…
Martin
  • 509
  • 1
  • 6
  • 16
11
votes
1 answer

Is there way to boost original term more while using Solr synonyms?

For example I have synonyms laptop,netbook,notebook in index_synonyms.txt When user search for netbook I want to boost original text more then expanded by synonyms? Is there way to specify this in SynonymFilterFactory? For example use original term…
yura
  • 14,489
  • 21
  • 77
  • 126
10
votes
2 answers

Boost::tokenizer comma separated (c++)

Should be an easy one for you guys..... I'm playing around with tokenizers using Boost and I want create a token that is comma separated. here is my code: string s = "this is, , , a test"; boost::char_delimiters_separator
Lexicon
  • 2,467
  • 7
  • 33
  • 41