Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
6
votes
1 answer

Parsing/Tokenizing a String Containing a SQL Command

Are there any open source libraries (any language, python/PHP preferred) that will tokenize/parse an ANSI SQL string into its various components? That is, if I had the following string SELECT a.foo, b.baz, a.bar FROM TABLE_A a LEFT JOIN TABLE_B…
Alana Storm
  • 164,128
  • 91
  • 395
  • 599
6
votes
4 answers

Add multiValued field to a SolrInputDocument

We are using a solr embeded instance for Java SolrJ. I want to add a multivalued field to a document. The multivalued field is a coma separated String. In Java I want to do: solrInputDocument.addField(Field1, "value1,value2,value3"); The…
Sal81
  • 101
  • 1
  • 1
  • 7
6
votes
1 answer

Amazon like search with Solr

We have an online store where we use Solr for searching products. The basic setup works fine, but currently it's lacking some features. I looked up some online shops like Amazon, and I liked the features they are offering. So I thought, how could I…
23tux
  • 14,104
  • 15
  • 88
  • 187
6
votes
1 answer

Incorrect Tokenization with Marpa

I have a rather large Marpa grammar (for parsing XPath), and I ran into a problem with tokenization. I created a minimal breaking example below: use strict; use warnings; use Marpa::R2; my $grammar = Marpa::R2::Scanless::G->new( { …
Nate Glenn
  • 6,455
  • 8
  • 52
  • 95
6
votes
3 answers

C++/Boost split a string on more than one character

This is probably really simple once I see an example, but how do I generalize boost::tokenizer or boost::split to deal with separators consisting of more than one character? For example, with "__", neither of these standard splitting solutions seems…
daj
  • 6,962
  • 9
  • 45
  • 79
6
votes
1 answer

korean language tokenizer

What is the best tokenizer exist for processing Korean language? I have tried CJKTokenizer in Solr4.0. It is doing the tokenization, but accuracy is very low.
gangatharan
  • 781
  • 1
  • 12
  • 28
6
votes
1 answer

stanford nlp tokenizer

How can i tokenize a string in java class using stanford parser? I am only able to find examples of documentProcessor and PTBTokenizer taking text from external file. DocumentPreprocessor dp = new DocumentPreprocessor("hello.txt"); for (List…
Naveen
  • 773
  • 3
  • 17
  • 40
6
votes
2 answers

Tokenize byte array

I have a array of raw bytes which i need to tokenize to a list of byte array in java. Explained better by the following method declaration. public static List splitMessage(byte[] rawByte, String tokenDelimiter) Example runs. Example Run 1:…
user813063
  • 63
  • 1
  • 3
6
votes
3 answers

How to tokenize Chinese language document

I will be getting document written in Chinese language for which I have to tokenize and keep it in database table. I was trying the CJKBigramFilter of Lucene but all it does is unite the 2 character together for which the meaning is different then…
Pradeep
  • 99
  • 1
  • 8
6
votes
1 answer

ElasticSearch Stemming

I am using ElasticSerach and I want to setup basic stemming for English. So basically, fighter returns fight or any word that contains the fight root. I am a little confused how to implement this. I was reading through the analyzers, tokenizers and…
Gabbar
  • 4,006
  • 7
  • 41
  • 78
6
votes
2 answers

Tokenization, and indexing with Lucene, how to handle external tokenize and part-of-speech?

i would like to build my own - here am not sure which one - tokenizer (from Lucene point of view) or my own analyzer. I already write a code that tokenize my documents in word (as a List < String > or a List < Word > where Word is a class with only…
user1340802
  • 1,157
  • 4
  • 17
  • 36
6
votes
4 answers

bash parse filename

Is there any way in bash to parse this filename : $file = dos1-20120514104538.csv.3310686 into variables like $date = 2012-05-14 10:45:38 and $id = 3310686 ? Thank you
pufos
  • 2,890
  • 8
  • 34
  • 38
5
votes
2 answers

Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors

First of all, I am new to python/nltk so my apologies if the question is too basic. I have a large file that I am trying to tokenize; I get memory errors. One solution I've read about is to read the file one line at a time, which makes sense,…
Luis Miguel
  • 5,057
  • 8
  • 42
  • 75
5
votes
6 answers

Tokenizing strings using regular expression in Javascript

Suppose I've a long string containing newlines and tabs as: var x = "This is a long string.\n\t This is another one on next line."; So how can we split this string into tokens, using regular expression? I don't want to use .split(' ') because I…
Nawaz
  • 353,942
  • 115
  • 666
  • 851
5
votes
1 answer

Solr(Lucene) is indexing only the first document after adding a custom TokenFilter

I created a custom token filter which concatenates all the tokens in the stream. This is my incrementToken() function public boolean incrementToken() throws IOException { if (finished) { …
Jithin
  • 1,108
  • 2
  • 12
  • 26