Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

1 answer

Parsing/Tokenizing a String Containing a SQL Command

Are there any open source libraries (any language, python/PHP preferred) that will tokenize/parse an ANSI SQL string into its various components? That is, if I had the following string SELECT a.foo, b.baz, a.bar FROM TABLE_A a LEFT JOIN TABLE_B…

php python sql parsing tokenize

asked Mar 13 '10 at 19:04

Alana Storm

164,128
91
395
599

votes

4 answers

Add multiValued field to a SolrInputDocument

We are using a solr embeded instance for Java SolrJ. I want to add a multivalued field to a document. The multivalued field is a coma separated String. In Java I want to do: solrInputDocument.addField(Field1, "value1,value2,value3"); The…

java tokenize solrj

asked Jan 29 '14 at 17:16

Sal81

votes

1 answer

Amazon like search with Solr

We have an online store where we use Solr for searching products. The basic setup works fine, but currently it's lacking some features. I looked up some online shops like Amazon, and I liked the features they are offering. So I thought, how could I…

search solr lucene full-text-search tokenize

asked Nov 08 '13 at 08:30

23tux

14,104
15
88
187

votes

1 answer

Incorrect Tokenization with Marpa

I have a rather large Marpa grammar (for parsing XPath), and I ran into a problem with tokenization. I created a minimal breaking example below: use strict; use warnings; use Marpa::R2; my $grammar = Marpa::R2::Scanless::G->new( { …

perl parsing tokenize marpa

asked Jun 20 '13 at 00:51

Nate Glenn

6,455
8
52
95

votes

3 answers

C++/Boost split a string on more than one character

This is probably really simple once I see an example, but how do I generalize boost::tokenizer or boost::split to deal with separators consisting of more than one character? For example, with "__", neither of these standard splitting solutions seems…

c++ string parsing boost tokenize

asked Apr 03 '13 at 13:58

daj

6,962
9
45
79

votes

1 answer

korean language tokenizer

What is the best tokenizer exist for processing Korean language? I have tried CJKTokenizer in Solr4.0. It is doing the tokenization, but accuracy is very low.

localization solr nlp tokenize

asked Nov 20 '12 at 04:25

gangatharan

votes

1 answer

stanford nlp tokenizer

How can i tokenize a string in java class using stanford parser? I am only able to find examples of documentProcessor and PTBTokenizer taking text from external file. DocumentPreprocessor dp = new DocumentPreprocessor("hello.txt"); for (List…

tokenize stanford-nlp

asked Oct 11 '12 at 20:06

Naveen

votes

2 answers

Tokenize byte array

I have a array of raw bytes which i need to tokenize to a list of byte array in java. Explained better by the following method declaration. public static List splitMessage(byte[] rawByte, String tokenDelimiter) Example runs. Example Run 1:…

java arrays tokenize

asked Sep 19 '12 at 10:19

user813063

votes

3 answers

How to tokenize Chinese language document

I will be getting document written in Chinese language for which I have to tokenize and keep it in database table. I was trying the CJKBigramFilter of Lucene but all it does is unite the 2 character together for which the meaning is different then…

java tokenize

asked Sep 18 '12 at 19:52

Pradeep

votes

1 answer

ElasticSearch Stemming

I am using ElasticSerach and I want to setup basic stemming for English. So basically, fighter returns fight or any word that contains the fight root. I am a little confused how to implement this. I was reading through the analyzers, tokenizers and…

lucene tokenize elasticsearch analyzer stemming

asked Jul 11 '12 at 14:57

Gabbar

4,006
7
41
78

votes

2 answers

Tokenization, and indexing with Lucene, how to handle external tokenize and part-of-speech?

i would like to build my own - here am not sure which one - tokenizer (from Lucene point of view) or my own analyzer. I already write a code that tokenize my documents in word (as a List < String > or a List < Word > where Word is a class with only…

java lucene nlp tokenize

asked May 18 '12 at 08:55

user1340802

1,157
4
17
36

votes

4 answers

bash parse filename

Is there any way in bash to parse this filename : $file = dos1-20120514104538.csv.3310686 into variables like $date = 2012-05-14 10:45:38 and $id = 3310686 ? Thank you

bash parsing tokenize

asked May 16 '12 at 11:58

pufos

2,890
8
34
38

votes

2 answers

Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors

First of all, I am new to python/nltk so my apologies if the question is too basic. I have a large file that I am trying to tokenize; I get memory errors. One solution I've read about is to read the file one line at a time, which makes sense,…

python nltk tokenize

asked Mar 24 '12 at 16:12

Luis Miguel

5,057
8
42
75

votes

6 answers

Tokenizing strings using regular expression in Javascript

Suppose I've a long string containing newlines and tabs as: var x = "This is a long string.\n\t This is another one on next line."; So how can we split this string into tokens, using regular expression? I don't want to use .split(' ') because I…

javascript regex string tokenize stringtokenizer

asked Dec 09 '11 at 06:32

Nawaz

353,942
115
666
851

votes

1 answer

Solr(Lucene) is indexing only the first document after adding a custom TokenFilter

I created a custom token filter which concatenates all the tokens in the stream. This is my incrementToken() function public boolean incrementToken() throws IOException { if (finished) { …

search lucene solr tokenize

asked Oct 01 '11 at 06:08

Jithin

1,108
2
12
26

Prev 1 2 3

…

99 100 Next