Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

2 answers

Lucene - Exact string matching

I'm trying to create a Lucene 4.10 index. I just want to save in the index the exact strings that I put into the document, witout tokenization. I'm using the StandardAnalyzer. Directory dir = FSDirectory.open(new File("myDire")); Analyzer…

java lucene tokenize

asked Sep 12 '14 at 13:40

LucaT

votes

4 answers

Elasticsearch wildcard search on not_analyzed field

I have an index like following settings and mapping; { "settings":{ "index":{ "analysis":{ "analyzer":{ "analyzer_keyword":{ "tokenizer":"keyword", "filter":"lowercase" …

search lucene elasticsearch tokenize

asked Jan 16 '14 at 11:53

Hüseyin BABAL

15,400
4
51
73

votes

1 answer

Postgresql full text search tokenizer

Just run into an issue. I'm trying to set up full text search on localized content (Russian in particular). The problem is default configuration (as well as my custom) does not deal with letter cases. Example: SELECT * from…

postgresql full-text-search tokenize

asked Aug 08 '13 at 08:44

Tommi

3,199
1
24
38

votes

3 answers

how does the String.Split method determine separator precedence when passed multiple multi-character separators?

If you have this code: "......".Split(new String[]{"...", ".."}, StringSplitOptions.None); The resulting array elements are: 1. "" 2. "" 3. "" Now if you reverse the order of the separators, "......".Split(new String[]{"..", "..."},…

c# string tokenize stringtokenizer

asked Feb 07 '13 at 22:40

John Smith

4,416
7
41
56

votes

4 answers

How get each character from a word with special encoding

I need to get an array with all the characters from a word, but the word has letters with special encoding like á, when I execute the follow code: $word = 'withá'; $word_arr = array(); for ($i=0;$i

php encoding character-encoding tokenize

asked Nov 21 '12 at 20:42

leticia

2,390
5
30
41

votes

1 answer

Control order of token filters in ElasticSearch

Trying to control the order that token filters are applied in ElasticSearch. I know from the docs that the tokenizer is applied first, then the token filters, but they do not mention how the order of the token filters is determined. Here's a YAML…

search elasticsearch tokenize

asked Sep 27 '12 at 19:04

Clay Wardell

14,846
13
44
65

votes

5 answers

How to best split csv strings in oracle 9i

I want to be able to split csv strings in Oracle 9i I've read the following article http://www.oappssurd.com/2009/03/string-split-in-oracle.html But I didn't understand how to make this work. Here are some of my questions pertaining to it Would…

oracle csv tokenize

asked Jul 06 '09 at 22:31

Joyce

1,431
2
18
33

votes

4 answers

Solr: exact phrase query with a EdgeNGramFilterFactory

In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory and also sensitive to phrase queries? By example, I'm looking for a field that, if containing "contrat informatique", will be found if the…

solr tokenize phrase

asked Sep 30 '11 at 15:47

Xavier Portebois

3,354
6
33
53

votes

0 answers

Converting Hugging Face Transformer Text Embeddings Back to Text

Is there a method for converting Hugging Face Transformer embeddings back to text? Suppose that I have text embeddings created using Hugging Face's ClipTextModel using the following method: import torch from transformers import CLIPTokenizer,…

python pipeline tokenize huggingface-transformers

asked Nov 06 '22 at 11:45

john_mon

votes

3 answers

Is SQLite on Android built with the ICU tokenizer enabled for FTS?

Like the title says: can we use ...USING fts3(tokenizer icu th_TH, ...). If we can, does anyone know what locales are suported, and whether it varies by platform version?

android sqlite locale tokenize full-text-search

asked Aug 15 '11 at 20:09

Ted Hopp

232,168
48
399
521

votes

3 answers

TRANSFORMERS: Asking to pad but the tokenizer does not have a padding token

In trying to evaluate several transformers models sequentially with the same dataset to check which one performs better. The list of models is this one: MODELS = [ ('xlm-mlm-enfr-1024' ,"XLMModel"), ('distilbert-base-cased',…

python tensorflow pytorch tokenize huggingface-transformers

asked Dec 31 '21 at 16:39

Pablo Cordon

votes

1 answer

Token indices sequence length error when using encode_plus method

I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library. I am using data from this Kaggle competition. Given a question title, question body and answer, the model…

nlp tokenize huggingface-transformers bert-language-model

asked Apr 20 '20 at 12:12

Niels

1,023
1
16
13

votes

3 answers

ParserError: Error tokenizing data. C error: Expected 7 fields in line 4, saw 10 error in reading csv file pandas

I am trying to read a csv file using pandas df1 = pd.read_csv('panda_error.csv', header=None, sep=',') But I am getting this error: ParserError: Error tokenizing data. C error: Expected 7 fields in line 4, saw 10 For reproducibility, here is the…

python pandas dataframe tokenize

asked Dec 20 '19 at 04:42

Atia Amin

votes

1 answer

Wordpiece tokenization versus conventional lemmatization?

I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for…

nlp tokenize lemmatization

asked Jul 16 '19 at 13:07

Keshinko

votes

6 answers

Best way to parse Space Separated Text

I have string like this /c SomeText\MoreText "Some Text\More Text\Lol" SomeText I want to tokenize it, however I can't just split on the spaces. I've come up with somewhat ugly parser that works, but I'm wondering if anyone has a more elegant…

c# string tokenize

asked Sep 10 '08 at 18:00

FlySwat

172,459
74
246
311

Prev 1 2 3

…

99 100 Next