Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

1 answer

How to use NGramTokenizerFactory or NGramFilterFactory?

Recently, I am studying how to store and index using Solr. I want to do facet.prefix search. With whitespace tokenizer, "Where are you" will be splited into three words and indexed. If I search facet.prefix="where are", no result will be returned. I…

solr tokenize lucene

asked Jan 12 '11 at 09:51

user572485

votes

1 answer

Is there a way to highlight function calls using Pygments (or another library)?

I was quite disappointed to discover that functions calls were not highlighted using Pygments. See it online (I tested it with all available styles) Builtin functions are highlighted but not mine. I looked at the tokens list but there is no…

python syntax-highlighting tokenize lexer pygments

asked Sep 19 '17 at 09:30

Delgan

18,571
11
90
141

votes

2 answers

Tokenize() in nltk.TweetTokenizer returning integers by splitting

Tokenize() in nltk.TweetTokenizer returning the 32-bit integers by dividing them into digits. It is only happening to some certain numbers, and I don't see any reason why? >>> from nltk.tokenize import TweetTokenizer >>> tw = TweetTokenizer() >>>…

python nltk tokenize

asked Jul 31 '17 at 21:59

Jim Mirzakhalov

votes

2 answers

Splitting text to sentences and sentence to words: BreakIterator vs regular expressions

I accidentally answered a question where the original problem involved splitting sentence to separate words. And the author suggested to use BreakIterator to tokenize input strings and some people liked this idea. I just don't get that madness: how…

java regex string comparison tokenize

asked Dec 19 '10 at 10:12

Roman

64,384
92
238
332

votes

5 answers

How could spacy tokenize hashtag as a whole?

In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens: import spacy nlp = spacy.load('en') doc = nlp(u'This is a #sentence.') [t for t in doc] output: [This, is, a, #, sentence, .] I'd like to have…

python tokenize spacy hashtag

asked Apr 13 '17 at 09:28

jmague

votes

5 answers

Split string by a substring

I have following string: char str[] = "A/USING=B)"; I want to split to get separate A and B values with /USING= as a delimiter How can I do it? I known strtok() but it just split by one character as delimiter.

c string tokenize strtok

asked Jan 22 '16 at 08:53

Ryo

votes

4 answers

Syntax-aware substring replacement

I have a string containing a valid Clojure form. I want to replace a part of it, just like with assoc-in, but processing the whole string as tokens. => (assoc-in [:a [:b :c]] [1 0] :new) [:a [:new :c]] => (assoc-in [:a [:b,, :c]]…

string syntax clojure replace tokenize

asked Aug 12 '10 at 21:52

Adam Schmideg

10,590
10
53
83

votes

1 answer

What characters does the standard tokenizer delimit on?

I was wondering which characters are used to delimit a string for elastic search's standard tokenizer?

elasticsearch tokenize delimiter

asked Sep 23 '15 at 14:23

David Carek

1,103
1
12
26

votes

1 answer

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect =…

python nlp scikit-learn tokenize n-gram

asked Aug 20 '15 at 21:35

Franck Dernoncourt

77,520
72
342
501

votes

1 answer

Sentence tokenization for texts that contains quotes

Code: from nltk.tokenize import sent_tokenize pprint(sent_tokenize(unidecode(text))) Output: [After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat…

python nlp nltk tokenize

asked Aug 14 '15 at 06:03

Abhishek Bhatia

9,404
26
87
142

votes

1 answer

PYTHON: How to pass tokenizer with keyword arguments to scikit's CountVectorizer?

I have a custom tokenizer function with some keyword arguments: def tokenizer(text, stem=True, lemmatize=False, char_lower_limit=2, char_upper_limit=30): do things... return tokens Now, how can I can pass this tokenizer with all its…

python scikit-learn tokenize feature-extraction keyword-argument

asked Aug 05 '15 at 22:39

JRun

votes

5 answers

Tokenizing numbers for a parser

I am writing my first parser and have a few questions conerning the tokenizer. Basically, my tokenizer exposes a nextToken() function that is supposed to return the next token. These tokens are distinguished by a token-type. I think it would make…

parsing tokenize

asked Jun 11 '10 at 12:06

René Nyffenegger

39,402
33
158
293

votes

1 answer

Elasticsearch "pattern_replace", replacing whitespaces while analyzing

Basically I want to remove all whitespaces and tokenize the whole string as a single token. (I will use nGram on top of that later on.) This is my index settings: "settings": { "index": { "analysis": { "filter": { "whitespace_remove":…

elasticsearch whitespace tokenize removing-whitespace

asked Apr 26 '15 at 03:23

Sagar Chandarana

votes

5 answers

split char string with multi-character delimiter in C

I want to split a char *string based on multiple-character delimiter. I know that strtok() is used to split a string but it works with single character delimiter. I want to split char *string based on a substring such as "abc" or any other…

c string parsing tokenize delimiter

asked Apr 22 '15 at 06:02

Sadia Bashir

votes

3 answers

Parsing pipe delimited string into columns?

I have a column with pipe separated values such as: '23|12.1| 450|30|9|78|82.5|92.1|120|185|52|11' I want to parse this column to fill a table with 12 corresponding columns: month1, month2, month3...month12. So month1 will have the value 23, month2…

oracle plsql tokenize

asked May 31 '10 at 07:38

DMS

Prev 1 2 3

…

99 100 Next