Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

vote

1 answer

Tokenize an NSString for filtering data (search)

I'm trying to implement search filtering for a data source that is used to populate a UITableView. Basically, I'm trying to allow people to put in multiple words and split that one string into tokens and then iterate through each object in the…

ios objective-c nsstring tokenize

asked Oct 22 '14 at 17:53

mbm29414

11,558
6
56
87

vote

1 answer

Regex and NLTK for latin1

I want to tokenize some texts in portuguese. I think I'm doing almost everything right, but I have a problem that I couldn't realize what could be wrong. I'm trying this code: text = '''Família S.A. dispõe de $12.400 milhões para concorrência. A…

regex nltk tokenize

asked Oct 16 '14 at 22:03

Marcelo

vote

1 answer

Tokenize string and store result in boost::iterator_range

I need to tokenize (' ','\n','\t' as delimiter) a text with somethink like std::string text = "foo bar"; boost::iterator_range r = some_func_i_dont_know(text); Later I want to get output with: for (auto i: result) …

c++ boost tokenize iterator-range

asked Oct 15 '14 at 14:56

user1587451

vote

1 answer

index field with not_analyzed in elastic search java

I am indexing city names (e.g. "New York") in elasticsearch which obviously cannot be white space tokenized. How do I index terms using java api? Currently I have code as below.. bulkRequest.add(client.prepareIndex("myIndex", "collection", if) …

java lucene elasticsearch tokenize

asked Oct 15 '14 at 00:58

user3549576

vote

2 answers

SQL query to translate a list of numbers matched against several ranges, to a list of values

I need to convert a list of numbers that fall within certain ranges into a list of values, ordered by a priority column. The table has the following values: | YEAR | R_MIN | R_MAX | VAL | PRIO | ------------------------------------ 2010 18000 …

sql oracle plsql tokenize

asked Apr 14 '10 at 08:35

Claes Mogren

2,126
1
26
34

vote

2 answers

String tokening in C

I have strings like "− · · · −" (Morse code) in an array, and want to tokenize each string to get each individual dot(.) and dash(−). A part of my code is given below: char *code, *token; char x; char ch[4096]; code = &ch[0]; …

c string tokenize strtok

asked Sep 29 '14 at 15:39

user3033194

1,775
7
42
63

vote

2 answers

R stringr and str_extract_all: capturing contractions

I am doing a bit of NLP with R and am using the stringr package to tokenize some text. I would like be able to capture contractions, for example, won't so that it is tokenized into "wo" and "n't". Here is a sample of what I've…

regex r tokenize

asked Sep 29 '14 at 07:48

buruzaemon

3,847
1
23
44

vote

1 answer

ANTLR lexer mismatches tokens

I have a simple ANTLR grammar, which I have stripped down to its bare essentials to demonstrate this problem I'm having. I am using ANTLRworks 1.3.1. grammar sample; assignment : IDENT ':=' NUM ';' ; IDENT : ('a'..'z')+ ; NUM : …

compiler-construction parsing antlr tokenize lexer

asked Apr 09 '10 at 06:18

Barry Brown

20,233
15
69
105

vote

1 answer

I want to refactor, parse the xml values and compare them in xsl

Some of the elements in have an associated element in a . I was able to compare the object and title and get the output but couldn't output the desired output when there is a…

xml xslt xpath xslt-2.0 tokenize

asked Sep 23 '14 at 17:27

user4071541

vote

1 answer

Are there any other sentence tokenizers in NLTK, other than punkt tokenizer

I am using NLTK to tokenize articles from wikipedia into sentences. But the punkt tokenizer is not giving very good results as sometime it is creating problems like sentences are getting tokenized when etc. appears, or problems occurs when double…

python nlp nltk tokenize

asked Sep 17 '14 at 15:20

Amrith Krishna

2,768
3
31
65

vote

2 answers

Tokenize JSON data with Javascript

I am having one JSON as : {\"A\":\"1.354534634,\",\"B\":\"-0.432335,\",\"C\":\"0.234123423,\"} I need to tokenize this with Javascript and I need to assign values like that: Accel_X = value of A, ie. 1.354534634 Accel_Y = value of B, ie.…

javascript json tokenize

asked Sep 16 '14 at 10:56

ninja.stop

vote

1 answer

NSString:componentsSeparatedByCharactersInSet inclusive

NSString *infix = @"4+23-54/543*23"; NSCharacterSet *operatorSet = [NSCharacterSet characterSetWithCharactersInString:@"+-*/"]; NSArray *tokens = [infix componentsSeparatedByCharactersInSet:operatorSet]; tokens returns: [@"4", @"23", @"54",…

objective-c tokenize shunting-yard nscharacterset

asked Sep 14 '14 at 17:49

Daniel Node.js

6,734
9
35
57

vote

1 answer

Optimize NLTK Code To Make Predictions From Text

I am trying to build a model to predict if the salary of a job description is above or below the 75th percentile (above 1, below 0) My data has about 250,000 rows and its very hard to tokenize all the text from the job descriptions. My code seems to…

python performance nltk tokenize text-mining

asked Sep 12 '14 at 21:32

Rodolfo Soto

vote

1 answer

Collapsing whitespace in ANTLR4

In my grammar, I have a whitespace token that is sent to the HIDDEN channel: SP : [ \u00A0\u000B\t\r\n] -> channel(HIDDEN); I know that I can get the text of a parsed rule, including hidden tokens, with TokenStream#getText(Context). I'd like to…

antlr tokenize antlr4

asked Sep 12 '14 at 15:57

NickAldwin

11,584
12
52
67

vote

1 answer

Scanner Regex Delimiter Issue

I set a scanner's delimiter like: scanner.useDelimiter("(\\s*?)(#.*?\n)(\\s*?)"); The goal is to ignore comments of the form #comment \n Thus: Hello#inline comment world. becomes: Hello world. By setting the delimiter as I did, I would…

java regex java.util.scanner tokenize

asked Sep 06 '14 at 16:48

AaronF

2,841
3
22
32

Prev 1 2 3

…

100 Next