Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

2 answers

The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1

I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condition to check the length of the test senetence in my…

python tensorflow pytorch tokenize bert-language-model

asked Oct 12 '20 at 15:34

Mee

1,413
5
24
40

votes

5 answers

What Javascript constructs does JsLex incorrectly lex?

JsLex is a Javascript lexer I've written in Python. It does a good job for a day's work (or so), but I'm sure there are cases it gets wrong. In particular, it doesn't understand anything about semicolon insertion, and there are probably ways…

javascript python tokenize lexical-analysis

asked Apr 04 '11 at 02:09

Ned Batchelder

364,293
75
561
662

votes

3 answers

Python: Tokenizing with phrases

I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the…

python nlp tokenize nltk

asked Apr 03 '11 at 20:42

yavoh

2,645
5
24
21

votes

3 answers

How can I fix "Error tokenizing data" on pandas csv reader?

I'm trying to read a csv file with pandas. This file actually has only one row but it causes an error whenever I try to read it. Something wrong seems happening in line 8 but I could hardly find the 8th line since there's clearly only one row on…

python pandas csv tokenize

asked Nov 12 '18 at 04:45

user9191983

votes

1 answer

Custom sentence segmentation using Spacy

I am new to Spacy and NLP. I'm facing the below issue while doing sentence segmentation using Spacy. The text I am trying to tokenise into sentences contains numbered lists (with space between numbering and actual text), like below. import spacy nlp…

nlp tokenize spacy sentence

asked Sep 06 '18 at 13:43

Satheesh K

votes

2 answers

Explain bpe (Byte Pair Encoding) with examples?

Can somebody help to explain the basic concept behind the bpe model? Except this paper, there is no so many explanations about it yet. What I have known so far is that it enables NMT model translation on open-vocabulary by encoding rare and…

algorithm nlp tokenize

asked May 29 '18 at 11:28

lifang

1,485
3
16
23

votes

2 answers

Using Keras Tokenizer to generate n-grams

Is it possible to use n-grams in Keras? E.g., sentences contain in X_train dataframe with "sentences" column. I use tokenizer from Keras in the following manner: tokenizer = Tokenizer(lower=True, split='…

nlp keras tokenize text-processing n-gram

asked Sep 12 '17 at 10:02

Simplex

1,723
2
17
26

votes

3 answers

What is the difference between fit_transform and transform in sklearn countvectorizer?

I was recently practicing bag of words introduction : kaggle , I want to clear few things : using vectorizer.fit_transform( " * on the list of *cleaned* reviews* " ) Now when we were preparing the bag of words array on train reviews we used…

python scikit-learn tokenize text-processing

asked Aug 01 '16 at 06:46

Anurag Pandey

votes

4 answers

Implicit Declaration of Function ‘strtok_r’ Despite Including

I have the following code to tokenize a string containing lines separated by \n and each line has integers separated by a \t: void string_to_int_array(char file_contents[BUFFER_SIZE << 5], int array[200][51]) { char *saveptr1, *saveptr2; char…

c string tokenize strtok gcc-warning

asked May 30 '14 at 18:27

jobin

2,600
7
32
59

votes

3 answers

How to tokenize markdown using Node.js?

Im building an iOS app that have a view that is going to have its source from markdown. My idea is to be able to parse markdown stored in MongoDB into a JSON-object that looks something like: { "h1": "This is the heading", "p" : "Heres the…

javascript ios node.js markdown tokenize

asked Feb 26 '14 at 12:43

bobmoff

2,415
3
25
32

votes

3 answers

listunagg function?

is there such thing in oracle like listunagg function? For example, if I have a data like: user_id degree_fi degree_en degree_sv 3601464 3700 1600 2200 1020 100 0 0 3600520 100,3200,400 1300, 800, 3000 1400, 600,…

string oracle csv oracle11g tokenize

asked Nov 02 '12 at 05:02

Jaanna

1,620
9
26
46

votes

3 answers

C - Determining which delimiter used - strtok()

Let's say I'm using strtok() like this.. char *token = strtok(input, ";-/"); Is there a way to figure out which token actually gets used? For instance, if the inputs was something like: Hello there; How are you? / I'm good - End Can I figure out…

c tokenize strtok

asked Sep 17 '12 at 13:28

Andrew Backes

1,884
4
21
37

votes

3 answers

How to prevent Facet Terms from tokenizing

I am using Facet Terms to get all the unique values and their count for a field. And I am getting wrong results. term: web Count: 1191979 term: misc Count: 1191979 term: passwd Count: 1191979 term: etc Count: 1191979 While the actual…

tokenize elasticsearch

asked Apr 10 '12 at 17:36

jmnwong

1,577
6
22
33

votes

7 answers

Trim string to length ignoring HTML

This problem is a challenging one. Our application allows users to post news on the homepage. That news is input via a rich text editor which allows HTML. On the homepage we want to only display a truncated summary of the news item. For example,…

html string truncate tokenize

asked Apr 09 '09 at 22:50

steve_c

6,235
4
32
42

votes

1 answer

How nltk.TweetTokenizer different from nltk.word_tokenize?

I am unable to understand the difference between the two. Though, I come to know that word_tokenize uses Penn-Treebank for tokenization purposes. But nothing on TweetTokenizer is available. For which sort of data should I be using TweetTokenizer…

python nlp artificial-intelligence nltk tokenize

asked May 20 '20 at 17:53

Mehul Gupta

1,829
3
17
33

Prev 1 2 3

…

99 100 Next