Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
60
votes
4 answers

Tokenizing Error: java.util.regex.PatternSyntaxException, dangling metacharacter '*'

I am using split() to tokenize a String separated with * following this format: name*lastName*ID*school*age % name*lastName*ID*school*age % name*lastName*ID*school*age I'm reading this from a file named "entrada.al" using this code: static void…
andandandand
  • 21,946
  • 60
  • 170
  • 271
60
votes
12 answers

Is there a function to split a string in Oracle PL/SQL?

I need to write a procedure to normalize a record that have multiple tokens concatenated by one char. I need to obtain these tokens splitting the string and insert each one as a new record in a table. Does Oracle have something like a "split"…
Sam
  • 6,437
  • 6
  • 33
  • 41
57
votes
9 answers

How do I read input character-by-character in Java?

I am used to the c-style getchar(), but it seems like there is nothing comparable for java. I am building a lexical analyzer, and I need to read in the input character by character. I know I can use the scanner to scan in a token or line and parse…
Jamison Dance
  • 19,896
  • 25
  • 97
  • 99
51
votes
17 answers

Convert comma separated string to array in PL/SQL

How do I convert a comma separated string to a array? I have the input '1,2,3' , and I need to convert it into an array.
Suvonkar
  • 2,440
  • 12
  • 34
  • 44
46
votes
5 answers

How does a parser (for example, HTML) work?

For argument's sake lets assume a HTML parser. I've read that it tokenizes everything first, and then parses it. What does tokenize mean? Does the parser read every character each, building up a multi dimensional array to store the structure? For…
alex
  • 479,566
  • 201
  • 878
  • 984
40
votes
3 answers

Retrieve analyzed tokens from ElasticSearch documents

Trying to access the analyzed/tokenized text in my ElasticSearch documents. I know you can use the Analyze API to analyze arbitrary text according your analysis modules. So I could copy and paste data from my documents into the Analyze API to see…
Clay Wardell
  • 14,846
  • 13
  • 44
  • 65
39
votes
4 answers

How to use a Lucene Analyzer to tokenize a String?

Is there a simple way I could use any subclass of Lucene's Analyzer to parse/tokenize a String? Something like: String to_be_parsed = "car window seven"; Analyzer analyzer = new StandardAnalyzer(...); List tokenized_string =…
Felipe Hummel
  • 4,674
  • 5
  • 32
  • 35
38
votes
5 answers

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error

def split_data(path): df = pd.read_csv(path) return train_test_split(df , test_size=0.1, random_state=100) train, test = split_data(DATA_DIR) train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() test_texts,…
37
votes
3 answers

Is it a Lexer's Job to Parse Numbers and Strings?

Is it a lexer's job to parse numbers and strings? This may or may not sound dumb, given that fact that I'm asking whether a lexer should parse input. However, I'm not sure whether that's in fact the lexer's job or the parser's job, because in order…
user541686
  • 205,094
  • 128
  • 528
  • 886
36
votes
6 answers

What is more efficient a switch case or an std::map

I'm thinking about the tokenizer here. Each token calls a different function inside the parser. What is more efficient: A map of std::functions/boost::functions A switch case
the_drow
  • 18,571
  • 25
  • 126
  • 193
36
votes
1 answer

How do you extract only the date from a python datetime?

I have a dataframe in python. One of its columns is labelled time, which is a timestamp. Using the following code, I have converted the timestamp to datetime: milestone['datetime'] = milestone.apply(lambda x:…
SZA
  • 463
  • 1
  • 4
  • 5
32
votes
1 answer

PHP namespace removal / mapping and rewriting identifiers

I'm attempting to automate the removal of namespaces from a PHP class collection to make them PHP 5.2 compatible. (Shared hosting providers do not fancy rogue PHP 5.3 installations. No idea why. Also the code in question doesn't use any 5.3 feature…
mario
  • 144,265
  • 20
  • 237
  • 291
32
votes
2 answers

Tokenizer vs token filters

I'm trying to implement autocomplete using Elasticsearch thinking that I understand how to do it... I'm trying to build multi-word (phrase) suggestions by using ES's edge_n_grams while indexing crawled data. What is the difference between a…
user3125823
  • 1,846
  • 2
  • 18
  • 46
31
votes
6 answers

Securing my API to only work with my front-end

I'm building a node/express backend. I want to create an API that only work with my reactjs frontend (private API). Imagine if this is an e-commerce website, my users will browse products and will then choose what to buy and at the time of order…
Amin F
  • 331
  • 1
  • 3
  • 4
31
votes
6 answers

how to get data between quotes in java?

I have this lines of text the number of quotes could change like: Here just one "comillas" But I also could have more "mas" values in "comillas" and that "is" the "trick" I was thinking in a method that return "a" list of "words" that "are"…
atomsfat
  • 2,863
  • 6
  • 34
  • 36