Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
1
vote
1 answer

C++ tokenizer separator not compiling

I want to split by comma, and I have the following class which is instantiated with a comma-separated line. The class is as follows: #include #include #include #include #include #include…
David Villasmil
  • 395
  • 2
  • 19
1
vote
4 answers

Strtok removes first character in token C

I have this little problem when trying to tokenize a string from a http request directed at my "home made" Http server. Basicly I am using these lines of code to tokneize. token = strtok(bufptr, "\n"); while(token != NULL){ …
1
vote
1 answer

SWI-Prolog tokenize_atom/2 replacement?

What I need to do is to break atom to tokens. E. g.: tokenize_string('Hello, World!', L). would unify L=['Hello',',','World','!']. Exactly as tokenize_atom/2 do. But when I try to use tokenize_atom/2 with non-latin letters it fails. Is there any…
evgeniuz
  • 2,599
  • 5
  • 30
  • 36
1
vote
2 answers

Solr: Combining PatternTokenizerFactory and PathHierarchyTokenizerFactory?

In short: In schema.xml I want to declare an analyzer to break apart a field with the PatternTokenizer, and then I want to have those values to be processed by PathHierarchyTokenizer. (Path Tokenizer breaks up paths like "a/b/c" into [a, a/b,…
Mark Bennett
  • 1,446
  • 2
  • 19
  • 37
1
vote
2 answers

How to tokenize cpp source?

For example, there is a cpp source code like: #include using namespace std; int main() { int counta=0; int countb=0; while(cin.get()!='*') count++; cout<
Jialin
  • 2,415
  • 2
  • 14
  • 10
1
vote
2 answers

Tokenizing a String to Pass as char * into execve()

My knowledge of C is very limited. I'm trying to tokenize a String passed to a server from a client, because I want to use passed arguments toexecve. The arguments passed viabufferneeds to be copied to*argv and tokenized such thatbuffer's tokens can…
rice2007
  • 115
  • 12
1
vote
1 answer

Tokenizing japanese string and converting to hiragana

I am using string tokenizer and transform APIs to convert kanji characters to hiragana. The code in query (What is the replacement for Language Analysis framework's Morpheme analysis deprecated APIs) converts most of kanji characters to hiragana but…
Nitesh
  • 2,681
  • 4
  • 27
  • 45
1
vote
1 answer

Creating a syntax tree from tokens

I'm trying to create a tiny interpreter for TI-BASIC syntax. This is a snippet of TI-BASIC I'm trying to interpret A->(2+(3*3)) I've tokenized the code above into this sequence of tokens: Token{type=VARIABLE, content='A'} Token{type=ASSIGN,…
August
  • 12,410
  • 3
  • 35
  • 51
1
vote
1 answer

Error in Keyword extraction using lucene

I'm completely new to the text extraction concept. When I was searching for an example I found one which has implemented using Lucene. I just tried to run it in eclipse but it gave an error. This is the error I'm getting : (TokenStream contract…
1
vote
2 answers

feed treetagger in R with text in string rather than text in file

I use TreeTagger from R, through the Korpus package. Calling the treetag function requires me to indicate a filename, which contains the text to be processed. However, I would like to provide a string rather than a filename, because I have a do some…
Marc G.
  • 141
  • 1
  • 9
1
vote
2 answers

Recognition of first and last name as one entity

I am interested in Natural Language processing. I am wondering if there is a good known algorithm that in a text one can determine first and last name as one entity. For example If we have this: Last week John Wayne went to Europe. I want to have a…
TJ1
  • 7,578
  • 19
  • 76
  • 119
1
vote
1 answer

How to tokenize only certain words in Lucene

I'm using Lucene for my project and I need a custom Analyzer. Code is: public class MyCommentAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents( String fieldName, Reader reader ) { Tokenizer source =…
1
vote
1 answer

Tokenizing a string in Clojure

I am trying to tokenize a string using clojure. The basic tokenization rules require the string to be split into separate symbols as follows: String literals of the form "hello world" are a single token Every word that is not part of a string…
1
vote
2 answers

XSLT: Splitting element contents on a text delimiter, keeping elements

I'm trying to parse element contents and split them on a delimiter, while keeping all elements in the parent. I don't need -- don't want -- to find the delimiter inside the child elements. Some text more text;…
Keith Davies
  • 215
  • 2
  • 11
1
vote
1 answer

Tokenizing a string with strtok() causes crash in c

Im trying to create a function that tokenizes a given string with given delimeters, puts the tokens in a 2D char array and returns it. Below the code is displayed: char** stringTokenizer(const char* str, const char* delims){ char** tokens; …
Chris Kon
  • 27
  • 5
1 2 3
99
100