Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
30
votes
10 answers

Python - RegEx for splitting text into sentences (sentence-tokenizing)

I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence…
user3590149
  • 1,525
  • 7
  • 22
  • 25
29
votes
4 answers

Keras Tokenizer num_words doesn't seem to work

>>> t = Tokenizer(num_words=3) >>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"] >>> t.fit_on_texts(l) >>> t.word_index {'fantastic': 6, 'like': 10, 'no': 8, 'this': 2, 'is': 3, 'there': 7, 'one': 11,…
max_max_mir
  • 1,494
  • 3
  • 20
  • 36
29
votes
1 answer

ElasticSearch Analyzer and Tokenizer for Emails

I could not find a perfect solution either in Google or ES for the following situation, hope someone could help here. Suppose there are five email addresses stored under field "email": 1. {"email": "john.doe@gmail.com"} 2. {"email":…
LYu
  • 2,316
  • 4
  • 21
  • 38
29
votes
11 answers

How split a file in words in unix command line?

I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into other file with a word per line. Example my file…
jaundavid
  • 385
  • 1
  • 5
  • 16
27
votes
6 answers

How do you parse a filename in bash?

I have a filename in a format like: system-source-yyyymmdd.dat I'd like to be able to parse out the different bits of the filename using the "-" as a delimiter.
Nick Pierpoint
  • 17,641
  • 9
  • 46
  • 74
26
votes
8 answers

Tokenizing strings in C

I have been trying to tokenize a string using SPACE as delimiter but it doesn't work. Does any one have suggestion on why it doesn't work? Edit: tokenizing using: strtok(string, " "); The code is like the following pch = strtok (str," "); while…
kombo
  • 279
  • 1
  • 3
  • 4
25
votes
4 answers

Tokenization of Arabic words using NLTK

I'm using NLTK word_tokenizer to split a sentence into words. I want to tokenize this sentence: في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء The code I'm writing is: import re import nltk lex = u"…
Hady Elsahar
  • 2,121
  • 4
  • 29
  • 47
24
votes
8 answers

Splitting comma separated string in a PL/SQL stored proc

I've CSV string 100.01,200.02,300.03 which I need to pass to a PL/SQL stored procedure in Oracle. Inside the proc,I need to insert these values in a Number column in the table. For this, I got a working approach from over here: How to best split csv…
Jimmy
  • 2,106
  • 12
  • 39
  • 53
24
votes
3 answers

How do I implement a custom UITextInputTokenizer?

I have a UITextView and am using its tokenizer to check which words the user taps on. My goal is to change what the tokenizer thinks of as a word. Currently it seems to define words as consecutive alphanumeric characters, I want a word to be defined…
Joshua Burr
  • 241
  • 1
  • 6
24
votes
5 answers

Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?

I recent added source file parsing to an existing tool that generated output files from complex command line arguments. The command line arguments got to be so complex that we started allowing them to be supplied as a file that was parsed as if it…
John Knoeller
  • 33,512
  • 4
  • 61
  • 92
24
votes
9 answers

Split string by a character?

How can I split a string such as "102:330:3133:76531:451:000:12:44412 by the ":" character, and put all of the numbers into an int array (number sequence will always be 8 elements long)? Preferably without using an external library such as…
user2705775
  • 461
  • 1
  • 7
  • 14
21
votes
3 answers

NLTK tokenize - faster way?

I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dictionary of the frequency of each word. import…
user3280193
  • 450
  • 1
  • 6
  • 13
21
votes
4 answers

How to parse / tokenize an SQL statement in Node.js

I'm looking for a way to parse / tokenize SQL statement within a Node.js application, in order to: Tokenize all the "basics" SQL keywords defined in the ISO/IEC 9075 standard or here. Validate the SQL syntax. Find out what the query is gonna do…
Yves M.
  • 29,855
  • 23
  • 108
  • 144
20
votes
3 answers

Tokenizing unicode using nltk

I have textfiles that use utf-8 encoding that contain characters like 'ö', 'ü', etc. I would like to parse the text form these files, but I can't get the tokenizer to work properly. If I use standard nltk tokenizer: f = open('C:\Python26\text.txt',…
root
  • 76,608
  • 25
  • 108
  • 120
20
votes
2 answers

Tokens to Words mapping in the tokenizer decode step huggingface?

Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode() function? For example: from transformers.tokenization_roberta import RobertaTokenizer tokenizer =…
DsCpp
  • 2,259
  • 3
  • 18
  • 46