Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
17
votes
1 answer

How to avoid NLTK's sentence tokenizer splitting on abbreviations?

I'm currently using NLTK for language processing, but I have encountered a problem of sentence tokenizing. Here's the problem: Assume I have a sentence: "Fig. 2 shows a U.S.A. map." When I use punkt tokenizer, my code looks like this: from…
joe wong
  • 453
  • 2
  • 9
  • 24
17
votes
7 answers

Tokenizer for full-text

This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain. Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. …
Rabbit
  • 1,741
  • 2
  • 18
  • 27
17
votes
5 answers

tokenize a string keeping delimiters in Python

Is there any equivalent to str.split in Python that also returns the delimiters? I need to preserve the whitespace layout for my output after processing some of the tokens. Example: >>> s="\tthis is an example" >>> print s.split() ['this', 'is',…
fortran
  • 74,053
  • 25
  • 135
  • 175
16
votes
10 answers

Have you ever effectively used lexer/parser in real world application?

Recently, I've started learning ANTLR. I know that lexers/parsers together can be used to construct programming languages. Other than DSLs or programming languages, have you ever directly or indirectly used lexer/parser tools (and knowledge) to…
16
votes
4 answers

Is there a bi gram or tri gram feature in Spacy?

The below code breaks the sentence into individual tokens and the output is as below "cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies" import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp("Cloud computing is…
venkatttaknev
  • 669
  • 1
  • 7
  • 21
16
votes
1 answer

Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

I want to include hyphenated words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on StackOverflow, Github, its documentation and elsewhere, I also wrote a custom tokenizer as below: import…
Vishal
  • 227
  • 1
  • 2
  • 8
16
votes
9 answers

How can I split a string of a mathematical expressions in python?

I made a program which convert infix to postfix in python. The problem is when I introduce the arguments. If i introduce something like this: (this will be a string) ( ( 73 + ( ( 34 - 72 ) / ( 33 - 3 ) ) ) + ( 56 + ( 95 - 28 ) ) ) it will split it…
Fernaku
  • 173
  • 1
  • 1
  • 5
16
votes
3 answers

Pass tokens to CountVectorizer

I have a text classification problem where i have two types of features: features which are n-grams (extracted by CountVectorizer) other textual features (e.g. presence of a word from a given lexicon). These features are different from n-grams…
Yonanam
  • 321
  • 1
  • 3
  • 6
16
votes
1 answer

How do I parse basic arithmetic (eg "5+5") using a simple recursive descent parser in C++?

This has been on my mind for a while. I'm intrigued by recursive descent parsers, and would like to know how to implement one. What I want is a simple parser that will understand simple arithmetic such as "5+5", or "(5+5)*3". I figure the first step…
user377628
15
votes
7 answers

Split a string into an array in C++

Possible Duplicate: How to split a string in C++? I have an input file of data and each line is an entry. in each line each "field" is seperated by a white space " " so I need to split the line by space. other languages have a function called…
Ahoura Ghotbi
  • 2,866
  • 12
  • 36
  • 65
15
votes
1 answer

Why is n+++n valid while n++++n is not?

In Java, the expression: n+++n Appears to evaluate as equivalent to: n++ + n Despite the fact that +n is a valid unary operator with higher precedence than the arithmetic + operator in n + n. So the compiler appears to be assuming that the…
Trevor Freeman
  • 7,112
  • 2
  • 21
  • 40
15
votes
4 answers

Int tokenizer

I know there are string tokenizers but is there an "int tokenizer"? For example, I want to split the string "12 34 46" and have: list[0]=12 list[1]=34 list[2]=46 In particular, I'm wondering if Boost::Tokenizer does this. Although I couldn't find…
Steve
  • 11,831
  • 14
  • 51
  • 63
15
votes
6 answers

Java StringTokenizer.nextToken() skips over empty fields

I am using a tab (/t) as delimiter and I know there are some empty fields in my data e.g.: one->two->->three Where -> equals the tab. As you can see an empty field is still correctly surrounded by tabs. Data is collected using a loop : while…
FireFox
  • 472
  • 2
  • 4
  • 14
15
votes
3 answers

shlex alternative for Java

Is there a shlex alternative for Java? I'd like to be able to split quote delimited strings like the shell would process them. For example, if I'd send : one two "three four" and perform a split, I'd like to receive the tokens onetwothree four
Geo
  • 93,257
  • 117
  • 344
  • 520
14
votes
3 answers

Replacing all tokens based on properties file with ANT

I'm pretty sure this is a simple question to answer and ive seen it asked before just no solid answers. I have several properties files that are used for different environments, i.e xxxx-dev, xxxx-test, xxxx-live The properties files contain…
Grofit
  • 17,693
  • 24
  • 96
  • 176