Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
9
votes
3 answers

BertTokenizer - when encoding and decoding sequences extra spaces appear

When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method. I have a the following string: test_string = 'text with percentage%' Then I am running the following code: import torch from transformers import…
Henryk Borzymowski
  • 988
  • 1
  • 10
  • 22
9
votes
1 answer

Split string every n characters but without splitting a word

Let's suppose that I have this in python: orig_string = 'I am a string in python' and if we suppose that I want to split this string every 10 characters but without splitting a word then I want to have this: strings = ['I am a ', 'string in ',…
Outcast
  • 4,967
  • 5
  • 44
  • 99
9
votes
6 answers

tokenizer.texts_to_sequences Keras Tokenizer gives almost all zeros

I am working to create a text classification code but I having problems in encoding documents using the tokenizer. 1) I started by fitting a tokenizer on my document as in here: vocabulary_size = 20000 tokenizer = Tokenizer(num_words=…
Wanderer
  • 1,065
  • 5
  • 18
  • 40
9
votes
4 answers

Generating PHP code (from Parser Tokens)

Is there any available solution for (re-)generating PHP code from the Parser Tokens returned by token_get_all? Other solutions for generating PHP code are welcome as well, preferably with the associated lexer/parser (if any).
wen
  • 3,782
  • 9
  • 34
  • 54
9
votes
2 answers

Does spacy take as input a list of tokens?

I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the user's tokenization. Is this possible at all, either…
dada
  • 1,390
  • 2
  • 17
  • 40
9
votes
4 answers

what does regular in regex/"regular expression" mean?

What does the "regular" in the phrase "regular expression" mean? I have heard that regexes were regular at one time, but no more
barlop
  • 12,887
  • 8
  • 80
  • 109
9
votes
1 answer

How to improve NLTK sentence segmentation?

Given the paragraph from Wikipedia: An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the…
Abdulrahman Bres
  • 2,603
  • 1
  • 20
  • 39
9
votes
1 answer

How to define special "untokenizable" words for nltk.word_tokenize

I'm using nltk.word_tokenize for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized. For example: >>> tokenize.word_tokenize("I work with C#.") ['I', 'work', 'with', 'C', '#', '.'] Is…
9
votes
3 answers

tokenizing a string twice in c with strtok()

I'm using strtok() in c to parse a csv string. First I tokenize it to just find out how many tokens there are so I can allocate a string of the correct size. Then I go through using the same variable I used last time for tokenization. Every time I…
SummerCodin
  • 263
  • 1
  • 6
  • 10
9
votes
2 answers

XML / Java: Precise line and character positions whilst parsing tags and attributes?

I’m trying to find a way to precisely determine the line number and character position of both tags and attributes whilst parsing an XML document. I want to do this so that I can report accurately to the author of the XML document (via a web…
Paul
  • 3,009
  • 16
  • 33
9
votes
1 answer

Parser vs. lexer and XML

I'm reading about compilers and parsers architecture now and I wonder about one thing... When you have XML, XHTML, HTML or any SGML-based language, what would be the role of a lexer here and what would be the tokens? I've read that tokens are like…
SasQ
  • 14,009
  • 7
  • 43
  • 43
9
votes
4 answers

Is there anything like PPI or Perl::Critic for C?

PPI and Perl::Critic allow programmers to detect certain things in the syntax of their Perl programs. Is there anything like it that will tokenize/parse C and give you a chance to write a script to do something with that information?
Jake
  • 211
  • 3
  • 5
9
votes
11 answers

Tokenize a string with a space in java

I want to tokenize a string like this String line = "a=b c='123 456' d=777 e='uij yyy'"; I cannot split based like this String [] words = line.split(" "); Any idea how can I split so that I get tokens like a=b c='123 456' d=777 e='uij yyy';
kal
  • 28,545
  • 49
  • 129
  • 149
9
votes
3 answers

Matlab split string multiple delimiters

I have a cell list of strings like this: cellArr = 'folderName_fileName_no.jpg', 'folderName2_fileName2_no2.jpg' I want to get it like this {folderName, fileName, no}, {folderName2, fileName2, no2} How to do it in matlab? I know I…
user570593
  • 3,420
  • 12
  • 56
  • 91
9
votes
2 answers

Split tokens on string using Regex in c#

I have some "tokenized" templates, for example (I call tokens the part between double braces): var template1 = "{{TOKEN1}} is a {{TOKEN2}} and it has some {{TOKEN3}}"; I want to extract an array from this sentence, in order to have something…
tyron
  • 3,715
  • 1
  • 22
  • 36