Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
7
votes
1 answer

How to tokenize continuous words with no whitespace delimiters?

I'm using Python with nltk. I need to process some text in English without any whitespace, but word_tokenize function in nltk couldn't deal with problems like this. So how to tokenize text without any whitespace. Is there any tools in Python?
VcamX
  • 81
  • 1
  • 4
7
votes
3 answers

Tokenize, remove stop words using Lucene with Java

I am trying to tokenize and remove stop words from a txt file with Lucene. I have this: public String removeStopWords(String string) throws IOException { Set stopWords = new HashSet(); stopWords.add("a"); …
whyname
  • 93
  • 1
  • 1
  • 4
7
votes
2 answers

Java/clojure: Multiple character delimiter, and keep the delimiter

I'm working on a project in clojure, which can interop with any java classes, so the answer to my question could be for either java or clojure. Basically I need to be able to split a string into components based on a given delimiter (which will be…
Mediocre Gopher
  • 2,274
  • 1
  • 22
  • 39
7
votes
1 answer

How to build a tokenizer in PHP?

I'm building a site to learn basic programming, I'm going to use a pseudolanguage in which users can submit their code and I need to interpret it. However I'm not sure how to build a tokenizer in PHP. Having a snippet such as this one: a = 1 b = 2 c…
lisovaccaro
  • 32,502
  • 98
  • 258
  • 410
7
votes
3 answers

Parsing Classes, Functions and Arguments in PHP

I want to create a function which receives a single argument that holds the path to a PHP file and then parses the given file and returns something like this: class NameOfTheClass function Method1($arg1, $arg2, $arg2) private function…
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
7
votes
3 answers

Parsing URL string in Ruby

I have a pretty simple string I want to parse in ruby and trying to find the most elegant solution. The string is of format /xyz/mov/exdaf/daeed.mov?arg1=blabla&arg2=3bla3bla What I would like to have is : string1: /xyz/mov/exdaf/daeed.mov string2:…
Arash Saff
6
votes
1 answer

How to stop the result in solr, when phrase containing a stopword?

I have a problem while searching with Solr a phrase which has stopwords. Solr send result with stopword and this is not my expected output. I added a word "test" in stopwords.txt file. In schema.xml file, I have the field like
Sriram M
  • 482
  • 3
  • 12
6
votes
3 answers

strsep() usage and its alternative

#include #include int main() { char *slogan = "together{kaliya} [namak]"; char *slow_gun = strdup(slogan); char *token = strsep(&slow_gun, "{"); printf ("\n slow_gun: %s\n token: %s\n", slow_gun, token); return 0; } when I…
hari
  • 9,439
  • 27
  • 76
  • 110
6
votes
1 answer

Order of precedence for token matching in Flex

My apologies if the title of this thread is a little confusing. What I'm asking about is how does Flex (the lexical analyzer) handle issues of precedence? For example, let's say I have two tokens with similar regular expressions, written in the…
Casey Patton
  • 4,021
  • 9
  • 41
  • 54
6
votes
3 answers

How to stop BERT from breaking apart specific words into word-piece

I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example: tokenizer =…
parvaneh shayegh
  • 507
  • 5
  • 13
6
votes
3 answers

processing before or after train test split

I am using this excellent article to learn Machine learning. https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/ The author has tokenized the X and y data after splitting it up. X_train, X_test, y_train, y_test =…
shantanuo
  • 31,689
  • 78
  • 245
  • 403
6
votes
1 answer

Is it possible to change the token split rules for a Spacy tokenizer?

The (German) spacy tokenizer does not split on slashes, underscores, or asterisks by default, which is just what I need (so "der/die" results in a single token). However it does split on parentheses so "dies(und)das" gets split into 5 tokens. Is…
jpp1
  • 2,019
  • 3
  • 22
  • 43
6
votes
1 answer

Does keras-tokenizer perform the task of lemmatization and stemming?

Does keras tokenizer provide the functions such as stemming and lemmetization? If it does, then how is it done? Need an intuitive understanding. Also, what does text_to_sequence do in that?
ASingh
  • 133
  • 1
  • 4
6
votes
2 answers

How to prevent splitting specific words or phrases and numbers in NLTK?

I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like "run in my family" ,"30 minute walk" or "4x a day" from splitting at the time of tokenizing words in NLTK? They…
mm7
  • 63
  • 1
  • 5
6
votes
4 answers

Recursive Descent Parser for something simple?

I'm writing a parser for a templating language which compiles into JS (if that's relevant). I started out with a few simple regexes, which seemed to work, but regexes are very fragile, so I decided to write a parser instead. I started by writing a…
ltimer
  • 61
  • 1
  • 2