Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

1 answer

How to tokenize continuous words with no whitespace delimiters?

I'm using Python with nltk. I need to process some text in English without any whitespace, but word_tokenize function in nltk couldn't deal with problems like this. So how to tokenize text without any whitespace. Is there any tools in Python?

python nltk tokenize

asked Jul 14 '13 at 06:42

VcamX

votes

3 answers

Tokenize, remove stop words using Lucene with Java

I am trying to tokenize and remove stop words from a txt file with Lucene. I have this: public String removeStopWords(String string) throws IOException { Set stopWords = new HashSet(); stopWords.add("a"); …

java lucene nlp tokenize stop-words

asked Jul 12 '13 at 23:17

whyname

votes

2 answers

Java/clojure: Multiple character delimiter, and keep the delimiter

I'm working on a project in clojure, which can interop with any java classes, so the answer to my question could be for either java or clojure. Basically I need to be able to split a string into components based on a given delimiter (which will be…

java string clojure split tokenize

asked Mar 08 '13 at 04:25

Mediocre Gopher

2,274
1
22
39

votes

1 answer

How to build a tokenizer in PHP?

I'm building a site to learn basic programming, I'm going to use a pseudolanguage in which users can submit their code and I need to interpret it. However I'm not sure how to build a tokenizer in PHP. Having a snippet such as this one: a = 1 b = 2 c…

php tokenize

asked Feb 21 '13 at 03:57

lisovaccaro

32,502
98
258
410

votes

3 answers

Parsing Classes, Functions and Arguments in PHP

I want to create a function which receives a single argument that holds the path to a PHP file and then parses the given file and returns something like this: class NameOfTheClass function Method1($arg1, $arg2, $arg2) private function…

php parsing tokenize code-analysis

asked Sep 12 '09 at 11:56

Alix Axel

151,645
95
393
500

votes

3 answers

Parsing URL string in Ruby

I have a pretty simple string I want to parse in ruby and trying to find the most elegant solution. The string is of format /xyz/mov/exdaf/daeed.mov?arg1=blabla&arg2=3bla3bla What I would like to have is : string1: /xyz/mov/exdaf/daeed.mov string2:…

ruby string url uri tokenize

asked Jul 30 '09 at 17:35

Arash Saff

votes

1 answer

How to stop the result in solr, when phrase containing a stopword?

I have a problem while searching with Solr a phrase which has stopwords. Solr send result with stopword and this is not my expected output. I added a word "test" in stopwords.txt file. In schema.xml file, I have the field like

search solr tokenize stop-words

asked Nov 26 '11 at 10:41

Sriram M

votes

3 answers

strsep() usage and its alternative

#include #include int main() { char *slogan = "together{kaliya} [namak]"; char *slow_gun = strdup(slogan); char *token = strsep(&slow_gun, "{"); printf ("\n slow_gun: %s\n token: %s\n", slow_gun, token); return 0; } when I…

c string tokenize strsep

asked Jul 28 '11 at 21:29

hari

9,439
27
76
110

votes

1 answer

Order of precedence for token matching in Flex

My apologies if the title of this thread is a little confusing. What I'm asking about is how does Flex (the lexical analyzer) handle issues of precedence? For example, let's say I have two tokens with similar regular expressions, written in the…

tokenize flex-lexer lexical-analysis

asked Jul 18 '11 at 17:11

Casey Patton

4,021
9
41
54

votes

3 answers

How to stop BERT from breaking apart specific words into word-piece

I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example: tokenizer =…

python text nlp tokenize bert-language-model

asked May 29 '20 at 09:37

parvaneh shayegh

votes

3 answers

processing before or after train test split

I am using this excellent article to learn Machine learning. https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/ The author has tokenized the X and y data after splitting it up. X_train, X_test, y_train, y_test =…

keras scikit-learn nlp tokenize train-test-split

asked Aug 28 '19 at 13:15

shantanuo

31,689
78
245
403

votes

1 answer

Is it possible to change the token split rules for a Spacy tokenizer?

The (German) spacy tokenizer does not split on slashes, underscores, or asterisks by default, which is just what I need (so "der/die" results in a single token). However it does split on parentheses so "dies(und)das" gets split into 5 tokens. Is…

python regex token tokenize spacy

asked Jul 31 '19 at 17:16

jpp1

2,019
3
22
43

votes

1 answer

Does keras-tokenizer perform the task of lemmatization and stemming?

Does keras tokenizer provide the functions such as stemming and lemmetization? If it does, then how is it done? Need an intuitive understanding. Also, what does text_to_sequence do in that?

keras nlp tokenize stemming lemmatization

asked Jun 12 '19 at 07:33

ASingh

votes

2 answers

How to prevent splitting specific words or phrases and numbers in NLTK?

I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like "run in my family" ,"30 minute walk" or "4x a day" from splitting at the time of tokenizing words in NLTK? They…

python nltk tokenize phrase

asked Apr 10 '19 at 18:39

mm7

votes

4 answers

Recursive Descent Parser for something simple?

I'm writing a parser for a templating language which compiles into JS (if that's relevant). I started out with a few simple regexes, which seemed to work, but regexes are very fragile, so I decided to write a parser instead. I started by writing a…

javascript parsing templates lexer tokenize

asked Apr 03 '11 at 19:35

ltimer

Prev 1 2 3

…

99 100 Next