Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

4 answers

How to change GWT Place URL from the default ":" to "/"?

By default, a GWT Place URL consists of the Place's simple class name (like "HelloPlace") followed by a colon (:) and the token returned by the PlaceTokenizer. My question is how can I change ":" to be "/"?

gwt tokenize

asked Apr 07 '11 at 10:52

Joly

3,218
14
44
70

votes

1 answer

Is it better to Keras fit_to_text on the entire x_data or just the train_data?

I have a dataframe with text columns. I separated them into x_train and x_test. My question is if its better to do Keras's Tokenizer.fit_on_text() on the entire x data set or just x_train? Like this: tokenizer =…

python keras tokenize

asked Feb 26 '19 at 17:56

The Dodo

votes

1 answer

nltk sentence tokenizer, consider new lines as sentence boundary

I am using nltk's PunkSentenceTokenizer to tokenize a text to a set of sentences. However, the tokenizer doesn't seem to consider new paragraph or new lines as a new sentence. >>> from nltk.tokenize.punkt import PunktSentenceTokenizer >>> tokenizer…

python nlp nltk tokenize

asked Mar 13 '15 at 20:41

CentAu

10,660
15
59
85

votes

2 answers

JavaScript: avoiding empty strings with String.split, and regular expression precedence

I am creating a syntax highlighter, and I am using String.split to create tokens from an input string. The first issue is that String.split creates a huge amount of empty strings, which causes everything to be quite slower than it could otherwise…

javascript regex split tokenize

asked Nov 11 '13 at 23:36

user2503048

1,021
1
10
22

votes

4 answers

Tokenizer, Stop Word Removal, Stemming in Java

I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system. For example: "The big fat cat, said 'your funniest guy i know' to the kangaroo..." the…

java tokenize stemming stop-words

asked Nov 03 '09 at 00:04

Phil

votes

5 answers

Tokenize a string and include delimiters in C++

I'm tokening with the following, but unsure how to include the delimiters with it. void Tokenize(const string str, vector& tokens, const string& delimiters) { int startpos = 0; int pos = str.find_first_of(delimiters, startpos); …

c++ tokenize

asked Oct 02 '09 at 18:07

Jeremiah

votes

1 answer

C++ Templates Angle Brackets Pitfall - What is the C++11 fix?

In C++11, this is now valid syntax: vector> MyMatrix; whereas previously, it had to be written like this (notice the space): vector > MyMatrix; My question is what is the fix that the standard uses to allow the first…

c++ parsing templates tokenize

asked Apr 03 '13 at 10:58

Norswap

11,740
12
47
60

votes

1 answer

How to have a "custom split()" in a list with strtk?

I have read http://www.codeproject.com/KB/recipes/Tokenizer.aspx and I want to have the last example ( at the end, just before all the graphs) "Extending Delimiter Predicates" in my main, but I don't get the same output tokens as the author of the…

c++ split tokenize

asked Jun 16 '11 at 15:05

bdelmas

votes

2 answers

Word break in languages without spaces between words (e.g., Asian)?

I'd like to make MySQL full text search work with Japanese and Chinese text, as well as any other language. The problem is that these languages and probably others do not normally have white space between words. Search is not useful when you must…

php full-text-search tokenize cjk wordbreaker

asked Oct 22 '09 at 06:26

Joe Langeway

votes

2 answers

Writing a tokenizer in Python

I want to design a custom tokenizer module in Python that lets users specify what tokenizer(s) to use for the input. For instance, consider the following input: Q: What is a good way to achieve this? A: I am not so sure. I think I will use…

python regex token tokenize nltk

asked Apr 10 '13 at 14:51

Legend

113,822
119
272
400

votes

1 answer

Difference between StandardTokenizerFactory and KeywordTokenizerFactory in Solr?

I am new to Solr.I want to know when to use StandardTokenizerFactory and KeywordTokenizerFactory? I read the docs on Apache Wiki, but I am not getting it. Can anybody explain the difference between StandardTokenizerFactory and…

java solr solrnet tokenize

asked Oct 04 '11 at 09:00

ravidev

2,708
6
26
42

votes

2 answers

Tokenizing using Pandas and spaCy

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of text per 'cell') and have been using pandas to…

python python-3.x pandas tokenize spacy

asked Oct 27 '17 at 18:12

LMGagne

1,636
6
24
47

votes

2 answers

Nested strtok function problem in C

I have a string like this: a;b;c;d;e f;g;h;i;j 1;2;3;4;5 and i want to parse it element by element. I used nested strtok function but it just splits first line and makes null the token pointer. How can i overcome this? Here is the code: token =…

c nested token tokenize strtok

asked Jan 14 '11 at 17:17

mausmust

votes

3 answers

Get bigrams and trigrams in word2vec Gensim

I am currently using uni-grams in my word2vec model as follows. def review_to_sentences( review, tokenizer, remove_stopwords=False ): #Returns a list of sentences, where each sentence is a list of words # #NLTK tokenizer to split the…

python tokenize word2vec gensim n-gram

asked Sep 09 '17 at 09:49

user8566323

votes

1 answer

Python re.split() vs nltk word_tokenize and sent_tokenize

I was going through this question. Am just wondering whether NLTK would be faster than regex in word/sentence tokenization.

python regex nlp nltk tokenize

asked Feb 11 '16 at 17:11

lobjc

2,751
5
24
30

Prev 1 2 3

…

99 100 Next