Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
8
votes
2 answers

Lucene - Exact string matching

I'm trying to create a Lucene 4.10 index. I just want to save in the index the exact strings that I put into the document, witout tokenization. I'm using the StandardAnalyzer. Directory dir = FSDirectory.open(new File("myDire")); Analyzer…
LucaT
  • 173
  • 1
  • 2
  • 6
8
votes
4 answers

Elasticsearch wildcard search on not_analyzed field

I have an index like following settings and mapping; { "settings":{ "index":{ "analysis":{ "analyzer":{ "analyzer_keyword":{ "tokenizer":"keyword", "filter":"lowercase" …
Hüseyin BABAL
  • 15,400
  • 4
  • 51
  • 73
8
votes
1 answer

Postgresql full text search tokenizer

Just run into an issue. I'm trying to set up full text search on localized content (Russian in particular). The problem is default configuration (as well as my custom) does not deal with letter cases. Example: SELECT * from…
Tommi
  • 3,199
  • 1
  • 24
  • 38
8
votes
3 answers

how does the String.Split method determine separator precedence when passed multiple multi-character separators?

If you have this code: "......".Split(new String[]{"...", ".."}, StringSplitOptions.None); The resulting array elements are: 1. "" 2. "" 3. "" Now if you reverse the order of the separators, "......".Split(new String[]{"..", "..."},…
John Smith
  • 4,416
  • 7
  • 41
  • 56
8
votes
4 answers

How get each character from a word with special encoding

I need to get an array with all the characters from a word, but the word has letters with special encoding like á, when I execute the follow code: $word = 'withá'; $word_arr = array(); for ($i=0;$i
leticia
  • 2,390
  • 5
  • 30
  • 41
8
votes
1 answer

Control order of token filters in ElasticSearch

Trying to control the order that token filters are applied in ElasticSearch. I know from the docs that the tokenizer is applied first, then the token filters, but they do not mention how the order of the token filters is determined. Here's a YAML…
Clay Wardell
  • 14,846
  • 13
  • 44
  • 65
8
votes
5 answers

How to best split csv strings in oracle 9i

I want to be able to split csv strings in Oracle 9i I've read the following article http://www.oappssurd.com/2009/03/string-split-in-oracle.html But I didn't understand how to make this work. Here are some of my questions pertaining to it Would…
Joyce
  • 1,431
  • 2
  • 18
  • 33
7
votes
4 answers

Solr: exact phrase query with a EdgeNGramFilterFactory

In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory and also sensitive to phrase queries? By example, I'm looking for a field that, if containing "contrat informatique", will be found if the…
Xavier Portebois
  • 3,354
  • 6
  • 33
  • 53
7
votes
0 answers

Converting Hugging Face Transformer Text Embeddings Back to Text

Is there a method for converting Hugging Face Transformer embeddings back to text? Suppose that I have text embeddings created using Hugging Face's ClipTextModel using the following method: import torch from transformers import CLIPTokenizer,…
john_mon
  • 487
  • 1
  • 3
  • 13
7
votes
3 answers

Is SQLite on Android built with the ICU tokenizer enabled for FTS?

Like the title says: can we use ...USING fts3(tokenizer icu th_TH, ...). If we can, does anyone know what locales are suported, and whether it varies by platform version?
Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
7
votes
3 answers

TRANSFORMERS: Asking to pad but the tokenizer does not have a padding token

In trying to evaluate several transformers models sequentially with the same dataset to check which one performs better. The list of models is this one: MODELS = [ ('xlm-mlm-enfr-1024' ,"XLMModel"), ('distilbert-base-cased',…
7
votes
1 answer

Token indices sequence length error when using encode_plus method

I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library. I am using data from this Kaggle competition. Given a question title, question body and answer, the model…
Niels
  • 1,023
  • 1
  • 16
  • 13
7
votes
3 answers

ParserError: Error tokenizing data. C error: Expected 7 fields in line 4, saw 10 error in reading csv file pandas

I am trying to read a csv file using pandas df1 = pd.read_csv('panda_error.csv', header=None, sep=',') But I am getting this error: ParserError: Error tokenizing data. C error: Expected 7 fields in line 4, saw 10 For reproducibility, here is the…
Atia Amin
  • 379
  • 1
  • 4
  • 10
7
votes
1 answer

Wordpiece tokenization versus conventional lemmatization?

I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for…
Keshinko
  • 318
  • 1
  • 11
7
votes
6 answers

Best way to parse Space Separated Text

I have string like this /c SomeText\MoreText "Some Text\More Text\Lol" SomeText I want to tokenize it, however I can't just split on the spaces. I've come up with somewhat ugly parser that works, but I'm wondering if anyone has a more elegant…
FlySwat
  • 172,459
  • 74
  • 246
  • 311