Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

1 answer

Solr Tokenizer Injection

As an example I have a text field that might contain the following string: "d7199^^==^^81^^==^^A sentence or two!!" I want to tokenize this data but have each token contain the first part of the string. So, I'd like the tokens to look like this for…

solr tokenize

asked Aug 25 '11 at 18:53

Jason Palmer

votes

3 answers

TorchText Vocab TypeError: Vocab.init() got an unexpected keyword argument 'min_freq'

I am working on a CNN Sentiment analysis machine learning model which uses the IMDb dataset provided by the Torchtext library. On one of my lines of code vocab = Vocab(counter, min_freq = 1, specials=('\', '\', '\', '\')) I…

python conv-neural-network tokenize imdb torchtext

asked Mar 28 '22 at 19:41

James B

votes

1 answer

Tokenize .htaccess files

Bet you didn't see this coming? ;) So, a project of mine requires that I specifically read and make sense out of .htaccess files. Sadly, searching on Google only yields the infinite woes of people trying to get their own .htaccess to work (sorry,…

parsing .htaccess tokenize

asked Aug 18 '11 at 18:12

Christian

27,509
17
111
155

votes

2 answers

How to untokenize BERT tokens?

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word. from transformers import BertTokenizer tz = BertTokenizer.from_pretrained("bert-base-cased") sentence = "The Natural Science…

python tokenize bert-language-model huggingface-transformers huggingface-tokenizers

asked Feb 16 '21 at 22:14

JayJay

votes

0 answers

How to slice string depending on length of tokens

When I use (with a long test_text and short question): from transformers import BertTokenizer import torch from transformers import BertForQuestionAnswering tokenizer =…

python python-3.x tokenize huggingface-transformers bert-language-model

asked Jun 21 '20 at 18:20

user12975267

votes

1 answer

PHP, Tokenizer, find all the arguments of the function

Help me find all the arguments of the function "funcname" using the function token_get_all() in the source code. It sounds simple, but there are many special options, such as arrays as parameters or call static methods as parameters. Maybe there's a…

php tokenize

asked Jun 06 '11 at 06:07

Anton

votes

1 answer

Catching errors thrown by token_get_all (Tokenizer)

PHPs token_get_all function (which allows converting a PHP source code into tokens) can throw two errors: One if an unterminated multiline comment is encountered, the other if an unexpected char is found. I would like to catch those errors and throw…

php error-handling tokenize

asked Jun 03 '11 at 15:17

NikiC

100,734
37
191
225

votes

1 answer

What does merge.txt file mean in BERT-based models in HuggingFace library?

I am trying to understand what merge.txt file infers in tokenizers for RoBERTa model in HuggingFace library. However, nothing is said about it on their website. Any help is appreciated.

nlp tokenize huggingface-transformers bert-language-model

asked May 31 '20 at 16:30

Akim

votes

2 answers

How to tokenize (words) classifying punctuation as space

Based on this question which was closed rather quickly: Trying to create a program to read a users input then break the array into seperate words are my pointers all valid? Rather than closing I think some extra work could have gone into helping the…

c++ locale tokenize

asked May 27 '11 at 15:10

Martin York

257,169
86
333
562

votes

2 answers

BERT training with character embeddings

Does it make sense to change the tokenization paradigm in the BERT model, to something else? Maybe just a simple word tokenization or character level tokenization?

nlp pytorch tokenize transformer-model

asked Mar 31 '20 at 02:30

user3741951

votes

1 answer

Is my usage of fgets() and strtok() incorrect for parsing a multi-line input?

I'm writing an implementation of the Moore Voting algorithm for finding the majority element (i.e. the element which occurs more than size/2 times) in an array. The code should return the majority element if it exists or else it should return -1.…

c scanf tokenize fgets strtok

asked Aug 23 '19 at 12:08

user10648668

votes

1 answer

StandardTokenizer behaviour

Given the code is running under Lucene 3.0.1 import java.io.*; import org.apache.lucene.analysis.*; import org.apache.lucene.util.Version; public class MyAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader…

java search lucene tokenize

asked Apr 19 '11 at 08:58

mindas

26,463
15
97
154

votes

2 answers

How to avoid tokenize words with underscore?

I am trying to tokenize my texts by using "nltk.word_tokenize()" function, but it would split words connected by "_". For example, the text "A,_B_C! is a movie!" would be split into: ['a', ',', '_b_c', '!', 'is','a','movie','!'] The result I want…

python nltk tokenize

asked Jul 05 '19 at 04:14

Sirui Li

votes

2 answers

How to apply tokenization to a TensorFlow Dataset?

I am working with the cnn_dailymail dataset which is part of the TensorFlow Datasets. My goal is to tokenize the dataset after applying some text preprocessing steps to it. I access and preprocess the dataset as follows: !pip install…

python-3.x tensorflow tokenize tensorflow-datasets

asked May 28 '19 at 10:07

Nadja Herger

votes

1 answer

How to treat a phrase containing stopwords as a single token with Python nltk.tokenize

A string can be tokenized by removing some unnecessary stopwords using nltk.tokenize. But how can I tokenize a phrase containing stopwords as a single token, while removing other stopwords? For example: Input: Trump is the President of the United…

python nltk tokenize stop-words

asked Apr 15 '19 at 18:12

followpassion

Prev 1 2 3

…

99 100 Next