Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

10 answers

Python - RegEx for splitting text into sentences (sentence-tokenizing)

I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence…

python regex nlp tokenize

asked Sep 09 '14 at 01:55

user3590149

1,525
7
22
25

votes

4 answers

Keras Tokenizer num_words doesn't seem to work

>>> t = Tokenizer(num_words=3) >>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"] >>> t.fit_on_texts(l) >>> t.word_index {'fantastic': 6, 'like': 10, 'no': 8, 'this': 2, 'is': 3, 'there': 7, 'one': 11,…

machine-learning neural-network keras deep-learning tokenize

asked Sep 13 '17 at 16:24

max_max_mir

1,494
3
20
36

votes

1 answer

ElasticSearch Analyzer and Tokenizer for Emails

I could not find a perfect solution either in Google or ES for the following situation, hope someone could help here. Suppose there are five email addresses stored under field "email": 1. {"email": "john.doe@gmail.com"} 2. {"email":…

email elasticsearch lucene tokenize analyzer

asked May 08 '15 at 04:38

LYu

2,316
4
21
38

votes

11 answers

How split a file in words in unix command line?

I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into other file with a word per line. Example my file…

unix command-line awk tokenize

asked Mar 19 '13 at 14:03

jaundavid

votes

6 answers

How do you parse a filename in bash?

I have a filename in a format like: system-source-yyyymmdd.dat I'd like to be able to parse out the different bits of the filename using the "-" as a delimiter.

bash shell parsing tokenize cut

asked Sep 08 '08 at 10:07

Nick Pierpoint

17,641
9
46
74

votes

8 answers

Tokenizing strings in C

I have been trying to tokenize a string using SPACE as delimiter but it doesn't work. Does any one have suggestion on why it doesn't work? Edit: tokenizing using: strtok(string, " "); The code is like the following pch = strtok (str," "); while…

c string tokenize

asked Nov 05 '08 at 19:46

kombo

votes

4 answers

Tokenization of Arabic words using NLTK

I'm using NLTK word_tokenizer to split a sentence into words. I want to tokenize this sentence: في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء The code I'm writing is: import re import nltk lex = u"…

python tokenize nltk

asked Oct 23 '12 at 16:59

Hady Elsahar

2,121
4
29
47

votes

8 answers

Splitting comma separated string in a PL/SQL stored proc

I've CSV string 100.01,200.02,300.03 which I need to pass to a PL/SQL stored procedure in Oracle. Inside the proc,I need to insert these values in a Number column in the table. For this, I got a working approach from over here: How to best split csv…

oracle plsql tokenize

asked Oct 23 '10 at 14:09

Jimmy

2,106
12
39
53

votes

3 answers

How do I implement a custom UITextInputTokenizer?

I have a UITextView and am using its tokenizer to check which words the user taps on. My goal is to change what the tokenizer thinks of as a word. Currently it seems to define words as consecutive alphanumeric characters, I want a word to be defined…

ios swift uitextview tokenize

asked Apr 03 '15 at 15:19

Joshua Burr

votes

5 answers

Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?

I recent added source file parsing to an existing tool that generated output files from complex command line arguments. The command line arguments got to be so complex that we started allowing them to be supplied as a file that was parsed as if it…

c++ windows lex tokenize

asked Jan 30 '10 at 23:01

John Knoeller

33,512
4
61
92

votes

9 answers

Split string by a character?

How can I split a string such as "102:330:3133:76531:451:000:12:44412 by the ":" character, and put all of the numbers into an int array (number sequence will always be 8 elements long)? Preferably without using an external library such as…

c++ arrays string split tokenize

asked Dec 24 '13 at 05:05

user2705775

votes

3 answers

NLTK tokenize - faster way?

I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dictionary of the frequency of each word. import…

python time-complexity nltk tokenize frequency

asked Jan 28 '17 at 16:26

user3280193

votes

4 answers

How to parse / tokenize an SQL statement in Node.js

I'm looking for a way to parse / tokenize SQL statement within a Node.js application, in order to: Tokenize all the "basics" SQL keywords defined in the ISO/IEC 9075 standard or here. Validate the SQL syntax. Find out what the query is gonna do…

sql node.js parsing tokenize sql-parser

asked Aug 06 '14 at 09:16

Yves M.

29,855
23
108
144

votes

3 answers

Tokenizing unicode using nltk

I have textfiles that use utf-8 encoding that contain characters like 'ö', 'ü', etc. I would like to parse the text form these files, but I can't get the tokenizer to work properly. If I use standard nltk tokenizer: f = open('C:\Python26\text.txt',…

python unicode nltk tokenize

asked Feb 10 '12 at 13:00

root

76,608
25
108
120

votes

2 answers

Tokens to Words mapping in the tokenizer decode step huggingface?

Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode() function? For example: from transformers.tokenization_roberta import RobertaTokenizer tokenizer =…

pytorch tokenize huggingface-transformers

asked Jun 11 '20 at 05:33

DsCpp

2,259
3
18
46

Prev 1 2

…

99 100 Next