Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
7
votes
1 answer

Lucene standard analyzer split on period

How do I make Lucene's Standard Analyzer tokenize on the'.' char? For eg., on querying for "B" I need it to return the B in "A.B.C" as the result. I need to treat numbers the way the standard analyzer treats it, and hence the Simple analyzer is not…
Nacha
  • 107
  • 1
  • 5
7
votes
3 answers

Indexing and Querying URLS in Solr

I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls. I've tried a few things, and I think I'm close but not sure why…
KidA78
  • 81
  • 2
  • 3
7
votes
1 answer

boost::split pushes an empty string to the vector even with token_compress_on

When the input string is blank, boost::split returns a vector with one empty string in it. Is it possible to have boost::split return an empty vector instead? MCVE: #include #include #include int…
rustyx
  • 80,671
  • 25
  • 200
  • 267
7
votes
2 answers

Tokenizing texts in both Chinese and English improperly splits English words into letters

When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code: from nltk.tokenize.stanford_segmenter import StanfordSegmenter segmenter =…
yhylord
  • 430
  • 4
  • 13
7
votes
2 answers

Tokenize by using regular expressions (parenthesis)

I have the following text: I don't like to eat Cici's food (it is true) I need to tokenize it to ['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')'] I have found out that the following regex expression…
Jürgen K.
  • 3,427
  • 9
  • 30
  • 66
7
votes
2 answers

which tokenizer is better to be used with nltk

I have started learning nltk and following this tutorial. First we use the built-in tokenizer by using sent_tokenize and later we use PunktSentenceTokenizer. The tutorial mentions that PunktSentenceTokenizer is capable of unsupervised machine…
Riken Shah
  • 3,022
  • 5
  • 29
  • 56
7
votes
1 answer

sqlite-fts3: custom tokenizer?

Does anyone here have experience with writing custom FTS3 (the full-text-search extension) tokenizers? I'm looking for a tokenizer that will ignore HTML tags. Thanks.
noamtm
  • 12,435
  • 15
  • 71
  • 107
7
votes
5 answers

String tokenizer for CPP String?

I want to use string Tokenizer for CPP string but all I could find was for Char*. Is there anything similar for CPP string?
Scarlet
  • 271
  • 2
  • 12
7
votes
4 answers

c++ what is the advantage of lex and bison to a selfmade tokenizer / parser

I would like to do some parsing and tokenizing in c++ for learning purposes. Now I often times came across bison/yacc and lex when reading about this subject online. Would there be any mayor benefit of using those over for instance a…
moka
  • 4,353
  • 2
  • 37
  • 63
7
votes
2 answers

splitting a string but keeping empty tokens c++

I am trying to split a string and put it into a vector however, I also want to keep an empty token whenever there are consecutive delimiter: For example: string mystring = "::aa;;bb;cc;;c" I would like to tokenize this string on :; delimiters but…
XDProgrammer
  • 853
  • 2
  • 14
  • 31
7
votes
3 answers

Splitting chinese document into sentences

I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese. Please can you let me know any good sentence splitters for Chinese preferably in Java or Python.
pjesudhas
  • 399
  • 4
  • 13
7
votes
6 answers

Tokenizing Twitter Posts in Lucene

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene? More detailed version: I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work…
Ruggiero Spearman
  • 6,735
  • 5
  • 26
  • 37
7
votes
3 answers

Python 2 newline tokens in tokenize module

I am using the tokenize module in Python and wonder why there are 2 different newline tokens: NEWLINE = 4 NL = 54 Any examples of code that would produce both tokens would be appreciated.
baallezx
  • 471
  • 3
  • 14
7
votes
6 answers

Splitting strings in python

I have a string which is like this: this is [bracket test] "and quotes test " I'm trying to write something in Python to split it up by space while ignoring spaces within square braces and quotes. The result I'm looking for is: ['this','is','bracket…
user31256
  • 71
  • 1
  • 2
7
votes
1 answer

Stemming unstructured text in NLTK

I tried the regex stemmer, but I get hundreds of unrelated tokens. I'm just interested in the "play" stem. Here is the code I'm working with: import nltk from nltk.book import * f = open('tupac_original.txt', 'rU') text = f.read() text1 =…
user2221429
  • 71
  • 1
  • 4