Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
5
votes
1 answer

Solr Tokenizer Injection

As an example I have a text field that might contain the following string: "d7199^^==^^81^^==^^A sentence or two!!" I want to tokenize this data but have each token contain the first part of the string. So, I'd like the tokens to look like this for…
Jason Palmer
  • 731
  • 4
  • 17
5
votes
3 answers

TorchText Vocab TypeError: Vocab.__init__() got an unexpected keyword argument 'min_freq'

I am working on a CNN Sentiment analysis machine learning model which uses the IMDb dataset provided by the Torchtext library. On one of my lines of code vocab = Vocab(counter, min_freq = 1, specials=('\', '\', '\', '\')) I…
James B
  • 53
  • 1
  • 4
5
votes
1 answer

Tokenize .htaccess files

Bet you didn't see this coming? ;) So, a project of mine requires that I specifically read and make sense out of .htaccess files. Sadly, searching on Google only yields the infinite woes of people trying to get their own .htaccess to work (sorry,…
Christian
  • 27,509
  • 17
  • 111
  • 155
5
votes
2 answers

How to untokenize BERT tokens?

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word. from transformers import BertTokenizer tz = BertTokenizer.from_pretrained("bert-base-cased") sentence = "The Natural Science…
5
votes
0 answers

How to slice string depending on length of tokens

When I use (with a long test_text and short question): from transformers import BertTokenizer import torch from transformers import BertForQuestionAnswering tokenizer =…
user12975267
5
votes
1 answer

PHP, Tokenizer, find all the arguments of the function

Help me find all the arguments of the function "funcname" using the function token_get_all() in the source code. It sounds simple, but there are many special options, such as arrays as parameters or call static methods as parameters. Maybe there's a…
Anton
  • 811
  • 9
  • 13
5
votes
1 answer

Catching errors thrown by token_get_all (Tokenizer)

PHPs token_get_all function (which allows converting a PHP source code into tokens) can throw two errors: One if an unterminated multiline comment is encountered, the other if an unexpected char is found. I would like to catch those errors and throw…
NikiC
  • 100,734
  • 37
  • 191
  • 225
5
votes
1 answer

What does merge.txt file mean in BERT-based models in HuggingFace library?

I am trying to understand what merge.txt file infers in tokenizers for RoBERTa model in HuggingFace library. However, nothing is said about it on their website. Any help is appreciated.
5
votes
2 answers

How to tokenize (words) classifying punctuation as space

Based on this question which was closed rather quickly: Trying to create a program to read a users input then break the array into seperate words are my pointers all valid? Rather than closing I think some extra work could have gone into helping the…
Martin York
  • 257,169
  • 86
  • 333
  • 562
5
votes
2 answers

BERT training with character embeddings

Does it make sense to change the tokenization paradigm in the BERT model, to something else? Maybe just a simple word tokenization or character level tokenization?
user3741951
  • 189
  • 1
  • 11
5
votes
1 answer

Is my usage of fgets() and strtok() incorrect for parsing a multi-line input?

I'm writing an implementation of the Moore Voting algorithm for finding the majority element (i.e. the element which occurs more than size/2 times) in an array. The code should return the majority element if it exists or else it should return -1.…
user10648668
5
votes
1 answer

StandardTokenizer behaviour

Given the code is running under Lucene 3.0.1 import java.io.*; import org.apache.lucene.analysis.*; import org.apache.lucene.util.Version; public class MyAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader…
mindas
  • 26,463
  • 15
  • 97
  • 154
5
votes
2 answers

How to avoid tokenize words with underscore?

I am trying to tokenize my texts by using "nltk.word_tokenize()" function, but it would split words connected by "_". For example, the text "A,_B_C! is a movie!" would be split into: ['a', ',', '_b_c', '!', 'is','a','movie','!'] The result I want…
Sirui Li
  • 245
  • 3
  • 15
5
votes
2 answers

How to apply tokenization to a TensorFlow Dataset?

I am working with the cnn_dailymail dataset which is part of the TensorFlow Datasets. My goal is to tokenize the dataset after applying some text preprocessing steps to it. I access and preprocess the dataset as follows: !pip install…
5
votes
1 answer

How to treat a phrase containing stopwords as a single token with Python nltk.tokenize

A string can be tokenized by removing some unnecessary stopwords using nltk.tokenize. But how can I tokenize a phrase containing stopwords as a single token, while removing other stopwords? For example: Input: Trump is the President of the United…