Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

1 answer

Lucene standard analyzer split on period

How do I make Lucene's Standard Analyzer tokenize on the'.' char? For eg., on querying for "B" I need it to return the B in "A.B.C" as the result. I need to treat numbers the way the standard analyzer treats it, and hence the Simple analyzer is not…

lucene tokenize

asked Mar 14 '11 at 12:37

Nacha

votes

3 answers

Indexing and Querying URLS in Solr

I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls. I've tried a few things, and I think I'm close but not sure why…

url indexing solr tokenize querying

asked Jan 13 '11 at 18:59

KidA78

votes

1 answer

boost::split pushes an empty string to the vector even with token_compress_on

When the input string is blank, boost::split returns a vector with one empty string in it. Is it possible to have boost::split return an empty vector instead? MCVE: #include #include #include int…

c++ boost tokenize

asked Oct 03 '17 at 08:27

rustyx

80,671
25
200
267

votes

2 answers

Tokenizing texts in both Chinese and English improperly splits English words into letters

When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code: from nltk.tokenize.stanford_segmenter import StanfordSegmenter segmenter =…

python-3.x nlp nltk stanford-nlp tokenize

asked Aug 29 '17 at 13:59

yhylord

votes

2 answers

Tokenize by using regular expressions (parenthesis)

I have the following text: I don't like to eat Cici's food (it is true) I need to tokenize it to ['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')'] I have found out that the following regex expression…

regex string split tokenize

asked Mar 29 '17 at 12:02

Jürgen K.

3,427
9
30
66

votes

2 answers

which tokenizer is better to be used with nltk

I have started learning nltk and following this tutorial. First we use the built-in tokenizer by using sent_tokenize and later we use PunktSentenceTokenizer. The tutorial mentions that PunktSentenceTokenizer is capable of unsupervised machine…

python nltk tokenize

asked Jun 22 '16 at 04:49

Riken Shah

3,022
5
29
56

votes

1 answer

sqlite-fts3: custom tokenizer?

Does anyone here have experience with writing custom FTS3 (the full-text-search extension) tokenizers? I'm looking for a tokenizer that will ignore HTML tags. Thanks.

html sqlite tokenize full-text-search

asked Sep 07 '10 at 10:52

noamtm

12,435
15
71
107

votes

5 answers

String tokenizer for CPP String?

I want to use string Tokenizer for CPP string but all I could find was for Char*. Is there anything similar for CPP string?

c++ string tokenize

asked Aug 26 '10 at 09:32

Scarlet

votes

4 answers

c++ what is the advantage of lex and bison to a selfmade tokenizer / parser

I would like to do some parsing and tokenizing in c++ for learning purposes. Now I often times came across bison/yacc and lex when reading about this subject online. Would there be any mayor benefit of using those over for instance a…

c++ parsing bison tokenize

asked Jul 13 '10 at 14:48

moka

4,353
2
37
63

votes

2 answers

splitting a string but keeping empty tokens c++

I am trying to split a string and put it into a vector however, I also want to keep an empty token whenever there are consecutive delimiter: For example: string mystring = "::aa;;bb;cc;;c" I would like to tokenize this string on :; delimiters but…

c++ tokenize

asked Jun 12 '15 at 07:38

XDProgrammer

votes

3 answers

Splitting chinese document into sentences

I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese. Please can you let me know any good sentence splitters for Chinese preferably in Java or Python.

nlp tokenize stanford-nlp sentence

asked Dec 12 '14 at 10:04

pjesudhas

votes

6 answers

Tokenizing Twitter Posts in Lucene

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene? More detailed version: I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work…

twitter lucene tokenize

asked Mar 31 '10 at 17:26

Ruggiero Spearman

6,735
5
26
37

votes

3 answers

Python 2 newline tokens in tokenize module

I am using the tokenize module in Python and wonder why there are 2 different newline tokens: NEWLINE = 4 NL = 54 Any examples of code that would produce both tokens would be appreciated.

python tokenize

asked Jul 01 '14 at 20:49

baallezx

votes

6 answers

Splitting strings in python

I have a string which is like this: this is [bracket test] "and quotes test " I'm trying to write something in Python to split it up by space while ignoring spaces within square braces and quotes. The result I'm looking for is: ['this','is','bracket…

python string split parsing tokenize

asked Oct 24 '08 at 17:33

user31256

votes

1 answer

Stemming unstructured text in NLTK

I tried the regex stemmer, but I get hundreds of unrelated tokens. I'm just interested in the "play" stem. Here is the code I'm working with: import nltk from nltk.book import * f = open('tupac_original.txt', 'rU') text = f.read() text1 =…

nltk tokenize text-analysis lemmatization

asked Sep 26 '13 at 18:49

user2221429

Prev 1 2 3

…

99 100 Next