Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
20
votes
4 answers

How to change GWT Place URL from the default ":" to "/"?

By default, a GWT Place URL consists of the Place's simple class name (like "HelloPlace") followed by a colon (:) and the token returned by the PlaceTokenizer. My question is how can I change ":" to be "/"?
Joly
  • 3,218
  • 14
  • 44
  • 70
20
votes
1 answer

Is it better to Keras fit_to_text on the entire x_data or just the train_data?

I have a dataframe with text columns. I separated them into x_train and x_test. My question is if its better to do Keras's Tokenizer.fit_on_text() on the entire x data set or just x_train? Like this: tokenizer =…
The Dodo
  • 711
  • 5
  • 14
20
votes
1 answer

nltk sentence tokenizer, consider new lines as sentence boundary

I am using nltk's PunkSentenceTokenizer to tokenize a text to a set of sentences. However, the tokenizer doesn't seem to consider new paragraph or new lines as a new sentence. >>> from nltk.tokenize.punkt import PunktSentenceTokenizer >>> tokenizer…
CentAu
  • 10,660
  • 15
  • 59
  • 85
20
votes
2 answers

JavaScript: avoiding empty strings with String.split, and regular expression precedence

I am creating a syntax highlighter, and I am using String.split to create tokens from an input string. The first issue is that String.split creates a huge amount of empty strings, which causes everything to be quite slower than it could otherwise…
user2503048
  • 1,021
  • 1
  • 10
  • 22
20
votes
4 answers

Tokenizer, Stop Word Removal, Stemming in Java

I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system. For example: "The big fat cat, said 'your funniest guy i know' to the kangaroo..." the…
Phil
  • 665
  • 5
  • 9
  • 14
20
votes
5 answers

Tokenize a string and include delimiters in C++

I'm tokening with the following, but unsure how to include the delimiters with it. void Tokenize(const string str, vector& tokens, const string& delimiters) { int startpos = 0; int pos = str.find_first_of(delimiters, startpos); …
Jeremiah
  • 751
  • 9
  • 21
19
votes
1 answer

C++ Templates Angle Brackets Pitfall - What is the C++11 fix?

In C++11, this is now valid syntax: vector> MyMatrix; whereas previously, it had to be written like this (notice the space): vector > MyMatrix; My question is what is the fix that the standard uses to allow the first…
Norswap
  • 11,740
  • 12
  • 47
  • 60
18
votes
1 answer

How to have a "custom split()" in a list with strtk?

I have read http://www.codeproject.com/KB/recipes/Tokenizer.aspx and I want to have the last example ( at the end, just before all the graphs) "Extending Delimiter Predicates" in my main, but I don't get the same output tokens as the author of the…
bdelmas
  • 922
  • 2
  • 12
  • 20
18
votes
2 answers

Word break in languages without spaces between words (e.g., Asian)?

I'd like to make MySQL full text search work with Japanese and Chinese text, as well as any other language. The problem is that these languages and probably others do not normally have white space between words. Search is not useful when you must…
Joe Langeway
  • 300
  • 2
  • 8
18
votes
2 answers

Writing a tokenizer in Python

I want to design a custom tokenizer module in Python that lets users specify what tokenizer(s) to use for the input. For instance, consider the following input: Q: What is a good way to achieve this? A: I am not so sure. I think I will use…
Legend
  • 113,822
  • 119
  • 272
  • 400
17
votes
1 answer

Difference between StandardTokenizerFactory and KeywordTokenizerFactory in Solr?

I am new to Solr.I want to know when to use StandardTokenizerFactory and KeywordTokenizerFactory? I read the docs on Apache Wiki, but I am not getting it. Can anybody explain the difference between StandardTokenizerFactory and…
ravidev
  • 2,708
  • 6
  • 26
  • 42
17
votes
2 answers

Tokenizing using Pandas and spaCy

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of text per 'cell') and have been using pandas to…
LMGagne
  • 1,636
  • 6
  • 24
  • 47
17
votes
2 answers

Nested strtok function problem in C

I have a string like this: a;b;c;d;e f;g;h;i;j 1;2;3;4;5 and i want to parse it element by element. I used nested strtok function but it just splits first line and makes null the token pointer. How can i overcome this? Here is the code: token =…
mausmust
  • 173
  • 1
  • 1
  • 4
17
votes
3 answers

Get bigrams and trigrams in word2vec Gensim

I am currently using uni-grams in my word2vec model as follows. def review_to_sentences( review, tokenizer, remove_stopwords=False ): #Returns a list of sentences, where each sentence is a list of words # #NLTK tokenizer to split the…
user8566323
17
votes
1 answer

Python re.split() vs nltk word_tokenize and sent_tokenize

I was going through this question. Am just wondering whether NLTK would be faster than regex in word/sentence tokenization.
lobjc
  • 2,751
  • 5
  • 24
  • 30