Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

3 answers

BertTokenizer - when encoding and decoding sequences extra spaces appear

When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method. I have a the following string: test_string = 'text with percentage%' Then I am running the following code: import torch from transformers import…

python pytorch tokenize torch bert-language-model

asked Nov 21 '19 at 16:43

Henryk Borzymowski

votes

1 answer

Split string every n characters but without splitting a word

Let's suppose that I have this in python: orig_string = 'I am a string in python' and if we suppose that I want to split this string every 10 characters but without splitting a word then I want to have this: strings = ['I am a ', 'string in ',…

python python-3.x text split tokenize

asked Jun 18 '19 at 16:59

Outcast

4,967
5
44
99

votes

6 answers

tokenizer.texts_to_sequences Keras Tokenizer gives almost all zeros

I am working to create a text classification code but I having problems in encoding documents using the tokenizer. 1) I started by fitting a tokenizer on my document as in here: vocabulary_size = 20000 tokenizer = Tokenizer(num_words=…

python keras nlp deep-learning tokenize

asked Aug 05 '18 at 23:28

Wanderer

1,065
5
18
40

votes

4 answers

Generating PHP code (from Parser Tokens)

Is there any available solution for (re-)generating PHP code from the Parser Tokens returned by token_get_all? Other solutions for generating PHP code are welcome as well, preferably with the associated lexer/parser (if any).

php code-generation tokenize

asked Feb 21 '11 at 16:11

wen

3,782
9
34
54

votes

2 answers

Does spacy take as input a list of tokens?

I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the user's tokenization. Is this possible at all, either…

python-2.7 tokenize spacy dependency-parsing

asked Jan 09 '18 at 13:43

dada

1,390
2
17
40

votes

4 answers

what does regular in regex/"regular expression" mean?

What does the "regular" in the phrase "regular expression" mean? I have heard that regexes were regular at one time, but no more

regex perl parsing tokenize

asked Jan 26 '11 at 14:47

barlop

12,887
8
80
109

votes

1 answer

How to improve NLTK sentence segmentation?

Given the paragraph from Wikipedia: An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the…

python nlp nltk tokenize text-segmentation

asked Nov 13 '17 at 22:21

Abdulrahman Bres

2,603
1
20
39

votes

1 answer

How to define special "untokenizable" words for nltk.word_tokenize

I'm using nltk.word_tokenize for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized. For example: >>> tokenize.word_tokenize("I work with C#.") ['I', 'work', 'with', 'C', '#', '.'] Is…

nltk tokenize

asked Aug 10 '17 at 16:03

Suilan Estévez

votes

3 answers

tokenizing a string twice in c with strtok()

I'm using strtok() in c to parse a csv string. First I tokenize it to just find out how many tokens there are so I can allocate a string of the correct size. Then I go through using the same variable I used last time for tokenization. Every time I…

c csv tokenize strtok

asked Dec 28 '10 at 06:08

SummerCodin

votes

2 answers

XML / Java: Precise line and character positions whilst parsing tags and attributes?

I’m trying to find a way to precisely determine the line number and character position of both tags and attributes whilst parsing an XML document. I want to do this so that I can report accurately to the author of the XML document (via a web…

java xml parsing tokenize sax

asked Jan 31 '17 at 22:02

Paul

3,009
16
33

votes

1 answer

Parser vs. lexer and XML

I'm reading about compilers and parsers architecture now and I wonder about one thing... When you have XML, XHTML, HTML or any SGML-based language, what would be the role of a lexer here and what would be the tokens? I've read that tokens are like…

xml parsing tokenize lexer dfa

asked Sep 02 '10 at 02:07

SasQ

14,009
7
43
43

votes

4 answers

Is there anything like PPI or Perl::Critic for C?

PPI and Perl::Critic allow programmers to detect certain things in the syntax of their Perl programs. Is there anything like it that will tokenize/parse C and give you a chance to write a script to do something with that information?

c perl parsing tokenize perl-critic

asked Dec 17 '09 at 19:59

Jake

votes

11 answers

Tokenize a string with a space in java

I want to tokenize a string like this String line = "a=b c='123 456' d=777 e='uij yyy'"; I cannot split based like this String [] words = line.split(" "); Any idea how can I split so that I get tokens like a=b c='123 456' d=777 e='uij yyy';

java tokenize

asked Oct 01 '09 at 00:21

kal

28,545
49
129
149

votes

3 answers

Matlab split string multiple delimiters

I have a cell list of strings like this: cellArr = 'folderName_fileName_no.jpg', 'folderName2_fileName2_no2.jpg' I want to get it like this {folderName, fileName, no}, {folderName2, fileName2, no2} How to do it in matlab? I know I…

regex string matlab split tokenize

asked Oct 31 '12 at 22:06

user570593

3,420
12
56
91

votes

2 answers

Split tokens on string using Regex in c#

I have some "tokenized" templates, for example (I call tokens the part between double braces): var template1 = "{{TOKEN1}} is a {{TOKEN2}} and it has some {{TOKEN3}}"; I want to extract an array from this sentence, in order to have something…

c# regex split tokenize

asked Oct 13 '12 at 17:57

tyron

3,715
1
22
36

Prev 1 2 3

…

99 100 Next