Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
1
vote
1 answer

Tokenize an NSString for filtering data (search)

I'm trying to implement search filtering for a data source that is used to populate a UITableView. Basically, I'm trying to allow people to put in multiple words and split that one string into tokens and then iterate through each object in the…
mbm29414
  • 11,558
  • 6
  • 56
  • 87
1
vote
1 answer

Regex and NLTK for latin1

I want to tokenize some texts in portuguese. I think I'm doing almost everything right, but I have a problem that I couldn't realize what could be wrong. I'm trying this code: text = '''Família S.A. dispõe de $12.400 milhões para concorrência. A…
Marcelo
  • 438
  • 5
  • 16
1
vote
1 answer

Tokenize string and store result in boost::iterator_range

I need to tokenize (' ','\n','\t' as delimiter) a text with somethink like std::string text = "foo bar"; boost::iterator_range r = some_func_i_dont_know(text); Later I want to get output with: for (auto i: result) …
user1587451
  • 978
  • 3
  • 15
  • 30
1
vote
1 answer

index field with not_analyzed in elastic search java

I am indexing city names (e.g. "New York") in elasticsearch which obviously cannot be white space tokenized. How do I index terms using java api? Currently I have code as below.. bulkRequest.add(client.prepareIndex("myIndex", "collection", if) …
user3549576
  • 99
  • 1
  • 4
  • 17
1
vote
2 answers

SQL query to translate a list of numbers matched against several ranges, to a list of values

I need to convert a list of numbers that fall within certain ranges into a list of values, ordered by a priority column. The table has the following values: | YEAR | R_MIN | R_MAX | VAL | PRIO | ------------------------------------ 2010 18000 …
Claes Mogren
  • 2,126
  • 1
  • 26
  • 34
1
vote
2 answers

String tokening in C

I have strings like "− · · · −" (Morse code) in an array, and want to tokenize each string to get each individual dot(.) and dash(−). A part of my code is given below: char *code, *token; char x; char ch[4096]; code = &ch[0]; …
user3033194
  • 1,775
  • 7
  • 42
  • 63
1
vote
2 answers

R stringr and str_extract_all: capturing contractions

I am doing a bit of NLP with R and am using the stringr package to tokenize some text. I would like be able to capture contractions, for example, won't so that it is tokenized into "wo" and "n't". Here is a sample of what I've…
buruzaemon
  • 3,847
  • 1
  • 23
  • 44
1
vote
1 answer

ANTLR lexer mismatches tokens

I have a simple ANTLR grammar, which I have stripped down to its bare essentials to demonstrate this problem I'm having. I am using ANTLRworks 1.3.1. grammar sample; assignment : IDENT ':=' NUM ';' ; IDENT : ('a'..'z')+ ; NUM : …
Barry Brown
  • 20,233
  • 15
  • 69
  • 105
1
vote
1 answer

I want to refactor, parse the xml values and compare them in xsl

Some of the elements in have an associated element in a . I was able to compare the object and title and get the output but couldn't output the desired output when there is a…
1
vote
1 answer

Are there any other sentence tokenizers in NLTK, other than punkt tokenizer

I am using NLTK to tokenize articles from wikipedia into sentences. But the punkt tokenizer is not giving very good results as sometime it is creating problems like sentences are getting tokenized when etc. appears, or problems occurs when double…
Amrith Krishna
  • 2,768
  • 3
  • 31
  • 65
1
vote
2 answers

Tokenize JSON data with Javascript

I am having one JSON as : {\"A\":\"1.354534634,\",\"B\":\"-0.432335,\",\"C\":\"0.234123423,\"} I need to tokenize this with Javascript and I need to assign values like that: Accel_X = value of A, ie. 1.354534634 Accel_Y = value of B, ie.…
ninja.stop
  • 410
  • 1
  • 10
  • 24
1
vote
1 answer

NSString:componentsSeparatedByCharactersInSet inclusive

NSString *infix = @"4+23-54/543*23"; NSCharacterSet *operatorSet = [NSCharacterSet characterSetWithCharactersInString:@"+-*/"]; NSArray *tokens = [infix componentsSeparatedByCharactersInSet:operatorSet]; tokens returns: [@"4", @"23", @"54",…
Daniel Node.js
  • 6,734
  • 9
  • 35
  • 57
1
vote
1 answer

Optimize NLTK Code To Make Predictions From Text

I am trying to build a model to predict if the salary of a job description is above or below the 75th percentile (above 1, below 0) My data has about 250,000 rows and its very hard to tokenize all the text from the job descriptions. My code seems to…
1
vote
1 answer

Collapsing whitespace in ANTLR4

In my grammar, I have a whitespace token that is sent to the HIDDEN channel: SP : [ \u00A0\u000B\t\r\n] -> channel(HIDDEN); I know that I can get the text of a parsed rule, including hidden tokens, with TokenStream#getText(Context). I'd like to…
NickAldwin
  • 11,584
  • 12
  • 52
  • 67
1
vote
1 answer

Scanner Regex Delimiter Issue

I set a scanner's delimiter like: scanner.useDelimiter("(\\s*?)(#.*?\n)(\\s*?)"); The goal is to ignore comments of the form #comment \n Thus: Hello#inline comment world. becomes: Hello world. By setting the delimiter as I did, I would…
AaronF
  • 2,841
  • 3
  • 22
  • 32
1 2 3
99
100