Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

1 answer

How to avoid NLTK's sentence tokenizer splitting on abbreviations?

I'm currently using NLTK for language processing, but I have encountered a problem of sentence tokenizing. Here's the problem: Assume I have a sentence: "Fig. 2 shows a U.S.A. map." When I use punkt tokenizer, my code looks like this: from…

python nlp nltk tokenize

asked Jan 15 '16 at 07:01

joe wong

votes

7 answers

Tokenizer for full-text

This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain. Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. …

c++ full-text-search tokenize

asked Apr 08 '10 at 14:24

Rabbit

1,741
2
18
27

votes

5 answers

tokenize a string keeping delimiters in Python

Is there any equivalent to str.split in Python that also returns the delimiters? I need to preserve the whitespace layout for my output after processing some of the tokens. Example: >>> s="\tthis is an example" >>> print s.split() ['this', 'is',…

python string split tokenize

asked Nov 30 '09 at 15:02

fortran

74,053
25
135
175

votes

10 answers

Have you ever effectively used lexer/parser in real world application?

Recently, I've started learning ANTLR. I know that lexers/parsers together can be used to construct programming languages. Other than DSLs or programming languages, have you ever directly or indirectly used lexer/parser tools (and knowledge) to…

parsing compiler-construction tokenize lexical-analysis

asked Mar 14 '09 at 05:46

Mohan Narayanaswamy

2,149
6
33
40

votes

4 answers

Is there a bi gram or tri gram feature in Spacy?

The below code breaks the sentence into individual tokens and the output is as below "cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies" import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp("Cloud computing is…

python-3.x nlp tokenize spacy n-gram

asked Dec 03 '18 at 16:50

venkatttaknev

votes

1 answer

Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

I want to include hyphenated words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on StackOverflow, Github, its documentation and elsewhere, I also wrote a custom tokenizer as below: import…

regex nlp tokenize spacy linguistics

asked Jun 24 '18 at 17:45

Vishal

votes

9 answers

How can I split a string of a mathematical expressions in python?

I made a program which convert infix to postfix in python. The problem is when I introduce the arguments. If i introduce something like this: (this will be a string) ( ( 73 + ( ( 34 - 72 ) / ( 33 - 3 ) ) ) + ( 56 + ( 95 - 28 ) ) ) it will split it…

python string python-3.x split tokenize

asked Apr 13 '17 at 10:22

Fernaku

votes

3 answers

Pass tokens to CountVectorizer

I have a text classification problem where i have two types of features: features which are n-grams (extracted by CountVectorizer) other textual features (e.g. presence of a word from a given lexicon). These features are different from n-grams…

scikit-learn tokenize

asked Mar 08 '16 at 12:32

Yonanam

votes

1 answer

How do I parse basic arithmetic (eg "5+5") using a simple recursive descent parser in C++?

This has been on my mind for a while. I'm intrigued by recursive descent parsers, and would like to know how to implement one. What I want is a simple parser that will understand simple arithmetic such as "5+5", or "(5+5)*3". I figure the first step…

c++ parsing tokenize recursive-descent

asked Apr 30 '12 at 05:28

user377628

votes

7 answers

Split a string into an array in C++

Possible Duplicate: How to split a string in C++? I have an input file of data and each line is an entry. in each line each "field" is seperated by a white space " " so I need to split the line by space. other languages have a function called…

c++ string tokenize

asked Dec 09 '11 at 16:00

Ahoura Ghotbi

2,866
12
36
65

votes

1 answer

Why is n+++n valid while n++++n is not?

In Java, the expression: n+++n Appears to evaluate as equivalent to: n++ + n Despite the fact that +n is a valid unary operator with higher precedence than the arithmetic + operator in n + n. So the compiler appears to be assuming that the…

java syntax tokenize

asked Sep 05 '13 at 18:59

Trevor Freeman

7,112
2
21
40

votes

4 answers

Int tokenizer

I know there are string tokenizers but is there an "int tokenizer"? For example, I want to split the string "12 34 46" and have: list[0]=12 list[1]=34 list[2]=46 In particular, I'm wondering if Boost::Tokenizer does this. Although I couldn't find…

c++ tokenize

asked Jul 17 '09 at 06:55

Steve

11,831
14
51
63

votes

6 answers

Java StringTokenizer.nextToken() skips over empty fields

I am using a tab (/t) as delimiter and I know there are some empty fields in my data e.g.: one->two->->three Where -> equals the tab. As you can see an empty field is still correctly surrounded by tabs. Data is collected using a loop : while…

java string tokenize

asked Jul 10 '12 at 08:22

FireFox

votes

3 answers

shlex alternative for Java

Is there a shlex alternative for Java? I'd like to be able to split quote delimited strings like the shell would process them. For example, if I'd send : one two "three four" and perform a split, I'd like to receive the tokens onetwothree four

java bash shell tokenize

asked Jul 04 '09 at 20:52

Geo

93,257
117
344
520

votes

3 answers

Replacing all tokens based on properties file with ANT

I'm pretty sure this is a simple question to answer and ive seen it asked before just no solid answers. I have several properties files that are used for different environments, i.e xxxx-dev, xxxx-test, xxxx-live The properties files contain…

ant tokenize

asked Dec 22 '10 at 10:16

Grofit

17,693
24
96
176

Prev 1 2 3

…

99 100 Next