Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
10
votes
2 answers

The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1

I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condition to check the length of the test senetence in my…
Mee
  • 1,413
  • 5
  • 24
  • 40
10
votes
5 answers

What Javascript constructs does JsLex incorrectly lex?

JsLex is a Javascript lexer I've written in Python. It does a good job for a day's work (or so), but I'm sure there are cases it gets wrong. In particular, it doesn't understand anything about semicolon insertion, and there are probably ways…
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
10
votes
3 answers

Python: Tokenizing with phrases

I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the…
yavoh
  • 2,645
  • 5
  • 24
  • 21
10
votes
3 answers

How can I fix "Error tokenizing data" on pandas csv reader?

I'm trying to read a csv file with pandas. This file actually has only one row but it causes an error whenever I try to read it. Something wrong seems happening in line 8 but I could hardly find the 8th line since there's clearly only one row on…
user9191983
  • 505
  • 1
  • 4
  • 20
10
votes
1 answer

Custom sentence segmentation using Spacy

I am new to Spacy and NLP. I'm facing the below issue while doing sentence segmentation using Spacy. The text I am trying to tokenise into sentences contains numbered lists (with space between numbering and actual text), like below. import spacy nlp…
Satheesh K
  • 501
  • 1
  • 3
  • 16
10
votes
2 answers

Explain bpe (Byte Pair Encoding) with examples?

Can somebody help to explain the basic concept behind the bpe model? Except this paper, there is no so many explanations about it yet. What I have known so far is that it enables NMT model translation on open-vocabulary by encoding rare and…
lifang
  • 1,485
  • 3
  • 16
  • 23
10
votes
2 answers

Using Keras Tokenizer to generate n-grams

Is it possible to use n-grams in Keras? E.g., sentences contain in X_train dataframe with "sentences" column. I use tokenizer from Keras in the following manner: tokenizer = Tokenizer(lower=True, split='…
Simplex
  • 1,723
  • 2
  • 17
  • 26
10
votes
3 answers

What is the difference between fit_transform and transform in sklearn countvectorizer?

I was recently practicing bag of words introduction : kaggle , I want to clear few things : using vectorizer.fit_transform( " * on the list of *cleaned* reviews* " ) Now when we were preparing the bag of words array on train reviews we used…
Anurag Pandey
  • 373
  • 2
  • 5
  • 21
10
votes
4 answers

Implicit Declaration of Function ‘strtok_r’ Despite Including

I have the following code to tokenize a string containing lines separated by \n and each line has integers separated by a \t: void string_to_int_array(char file_contents[BUFFER_SIZE << 5], int array[200][51]) { char *saveptr1, *saveptr2; char…
jobin
  • 2,600
  • 7
  • 32
  • 59
10
votes
3 answers

How to tokenize markdown using Node.js?

Im building an iOS app that have a view that is going to have its source from markdown. My idea is to be able to parse markdown stored in MongoDB into a JSON-object that looks something like: { "h1": "This is the heading", "p" : "Heres the…
bobmoff
  • 2,415
  • 3
  • 25
  • 32
10
votes
3 answers

listunagg function?

is there such thing in oracle like listunagg function? For example, if I have a data like: user_id degree_fi degree_en degree_sv 3601464 3700 1600 2200 1020 100 0 0 3600520 100,3200,400 1300, 800, 3000 1400, 600,…
Jaanna
  • 1,620
  • 9
  • 26
  • 46
10
votes
3 answers

C - Determining which delimiter used - strtok()

Let's say I'm using strtok() like this.. char *token = strtok(input, ";-/"); Is there a way to figure out which token actually gets used? For instance, if the inputs was something like: Hello there; How are you? / I'm good - End Can I figure out…
Andrew Backes
  • 1,884
  • 4
  • 21
  • 37
10
votes
3 answers

How to prevent Facet Terms from tokenizing

I am using Facet Terms to get all the unique values and their count for a field. And I am getting wrong results. term: web Count: 1191979 term: misc Count: 1191979 term: passwd Count: 1191979 term: etc Count: 1191979 While the actual…
jmnwong
  • 1,577
  • 6
  • 22
  • 33
9
votes
7 answers

Trim string to length ignoring HTML

This problem is a challenging one. Our application allows users to post news on the homepage. That news is input via a rich text editor which allows HTML. On the homepage we want to only display a truncated summary of the news item. For example,…
steve_c
  • 6,235
  • 4
  • 32
  • 42
9
votes
1 answer

How nltk.TweetTokenizer different from nltk.word_tokenize?

I am unable to understand the difference between the two. Though, I come to know that word_tokenize uses Penn-Treebank for tokenization purposes. But nothing on TweetTokenizer is available. For which sort of data should I be using TweetTokenizer…
Mehul Gupta
  • 1,829
  • 3
  • 17
  • 33