Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
6
votes
1 answer

How to use NGramTokenizerFactory or NGramFilterFactory?

Recently, I am studying how to store and index using Solr. I want to do facet.prefix search. With whitespace tokenizer, "Where are you" will be splited into three words and indexed. If I search facet.prefix="where are", no result will be returned. I…
user572485
  • 103
  • 2
  • 5
6
votes
1 answer

Is there a way to highlight function calls using Pygments (or another library)?

I was quite disappointed to discover that functions calls were not highlighted using Pygments. See it online (I tested it with all available styles) Builtin functions are highlighted but not mine. I looked at the tokens list but there is no…
Delgan
  • 18,571
  • 11
  • 90
  • 141
6
votes
2 answers

Tokenize() in nltk.TweetTokenizer returning integers by splitting

Tokenize() in nltk.TweetTokenizer returning the 32-bit integers by dividing them into digits. It is only happening to some certain numbers, and I don't see any reason why? >>> from nltk.tokenize import TweetTokenizer >>> tw = TweetTokenizer() >>>…
6
votes
2 answers

Splitting text to sentences and sentence to words: BreakIterator vs regular expressions

I accidentally answered a question where the original problem involved splitting sentence to separate words. And the author suggested to use BreakIterator to tokenize input strings and some people liked this idea. I just don't get that madness: how…
Roman
  • 64,384
  • 92
  • 238
  • 332
6
votes
5 answers

How could spacy tokenize hashtag as a whole?

In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens: import spacy nlp = spacy.load('en') doc = nlp(u'This is a #sentence.') [t for t in doc] output: [This, is, a, #, sentence, .] I'd like to have…
jmague
  • 572
  • 1
  • 6
  • 13
6
votes
5 answers

Split string by a substring

I have following string: char str[] = "A/USING=B)"; I want to split to get separate A and B values with /USING= as a delimiter How can I do it? I known strtok() but it just split by one character as delimiter.
Ryo
  • 995
  • 2
  • 25
  • 41
6
votes
4 answers

Syntax-aware substring replacement

I have a string containing a valid Clojure form. I want to replace a part of it, just like with assoc-in, but processing the whole string as tokens. => (assoc-in [:a [:b :c]] [1 0] :new) [:a [:new :c]] => (assoc-in [:a [:b,, :c]]…
Adam Schmideg
  • 10,590
  • 10
  • 53
  • 83
6
votes
1 answer

What characters does the standard tokenizer delimit on?

I was wondering which characters are used to delimit a string for elastic search's standard tokenizer?
David Carek
  • 1,103
  • 1
  • 12
  • 26
6
votes
1 answer

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect =…
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
6
votes
1 answer

Sentence tokenization for texts that contains quotes

Code: from nltk.tokenize import sent_tokenize pprint(sent_tokenize(unidecode(text))) Output: [After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat…
Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142
6
votes
1 answer

PYTHON: How to pass tokenizer with keyword arguments to scikit's CountVectorizer?

I have a custom tokenizer function with some keyword arguments: def tokenizer(text, stem=True, lemmatize=False, char_lower_limit=2, char_upper_limit=30): do things... return tokens Now, how can I can pass this tokenizer with all its…
JRun
  • 669
  • 1
  • 10
  • 17
6
votes
5 answers

Tokenizing numbers for a parser

I am writing my first parser and have a few questions conerning the tokenizer. Basically, my tokenizer exposes a nextToken() function that is supposed to return the next token. These tokens are distinguished by a token-type. I think it would make…
René Nyffenegger
  • 39,402
  • 33
  • 158
  • 293
6
votes
1 answer

Elasticsearch "pattern_replace", replacing whitespaces while analyzing

Basically I want to remove all whitespaces and tokenize the whole string as a single token. (I will use nGram on top of that later on.) This is my index settings: "settings": { "index": { "analysis": { "filter": { "whitespace_remove":…
6
votes
5 answers

split char string with multi-character delimiter in C

I want to split a char *string based on multiple-character delimiter. I know that strtok() is used to split a string but it works with single character delimiter. I want to split char *string based on a substring such as "abc" or any other…
Sadia Bashir
  • 75
  • 1
  • 1
  • 7
6
votes
3 answers

Parsing pipe delimited string into columns?

I have a column with pipe separated values such as: '23|12.1| 450|30|9|78|82.5|92.1|120|185|52|11' I want to parse this column to fill a table with 12 corresponding columns: month1, month2, month3...month12. So month1 will have the value 23, month2…
DMS
  • 105
  • 2
  • 2
  • 5