Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
1
vote
1 answer

Using Schematron QuickFixes to tag individual words in mixed content elements

I have an xml file that looks like this (simplifed): Pure text Mixed content, cuz there is also another: element inside and more. Text nodes within elements other than def are…
Tench
  • 485
  • 3
  • 18
1
vote
1 answer

How to parse a sequence of commands line in the same way bash would?

Input I have the following example input (each of those is a bash executable command): client-properties create mode "publisher" "version" "mode" client-properties set "publisher" "version" "mode" "prop1" "value value value" client-properties set…
Madara's Ghost
  • 172,118
  • 50
  • 264
  • 308
1
vote
2 answers

Textblob word tokenization into array

from textblob import TextBlob import nltk array=("i have a bunch of grapes","i like to eat apple","this is a laptop") array2=[] for i in array: c=TextBlob(i) array2.append(c.words) print array2 the result printed out will…
kent chan
  • 13
  • 3
1
vote
1 answer

What's the first element in my trigrams?

Using a trigram-tokenizer from the RWeka class > TriGramTokenizer <- function(x){NGramTokenizer(x, Weka_control(min=3, max=3))} I tokenized a corpus. Inspection shows that the trigrams look like this: > inspect(tdm_trigram[1:10, 1:3]) A…
TMOTTM
  • 3,286
  • 6
  • 32
  • 63
1
vote
1 answer

Including currency symbols in solr / lucene indexes

Is it possible to index a text field considering currency symbols as separate tokens? For example in a text field I have this: "16 €" and I need to build an index with this entries: 16 € In order to search for "€" and finding the document. Now I'm…
Zac
  • 2,180
  • 2
  • 23
  • 36
1
vote
0 answers

How do I sandbox the evaluation of user supplied patterns?

I want users to be able to provide relatively simple patterns for matching (for now) different kinds of IDs. I need to evaluate those patterns server side, potentially against a big number of short strings. The obvious solution is to user regular…
brightbyte
  • 971
  • 4
  • 10
1
vote
1 answer

How can i add a delete button to delete token

I am trying to create a control where it will accept tag kind of functionality in which we use in stackoverflow. I am trying to customize RichTextBox to achieve this functionality. I have refer the below link as…
1
vote
2 answers

Need help executing perl tokening script

I'm a Perl amateur. Recently I was given a Perl script that takes a text file and removes all formatting except for the individual words follows by a space. The problem is that the script is unclear how to input a file location. I've set up some…
1
vote
4 answers

Tokenizing a phone number in C

I'm trying to tokenize a phone number and split it into two arrays. It starts out in a string in the form of "(515) 555-5555". I'm looking to tokenize the area code, the first 3 digits, and the last 4 digits. The area code I would store in one…
David Tamrazov
  • 567
  • 1
  • 5
  • 16
1
vote
3 answers

How to remove a custom word pattern from a text using NLTK with Python

I am currently working on a project of analyzing the quality examination paper questions.In here I am using Python 3.4 with NLTK. So first I want to take out each question separately from the text.The question paper format is given below. (Q1).…
Punuth
  • 417
  • 3
  • 6
  • 19
1
vote
2 answers

Tokenizer not working

I am trying to tokenize a string to give an array of strings but it seems like my code is wrong. Here is my code: asmInstruction *tokenizeLine(char *charLine) { int words = countTokens(charLine); char *tokens = (char*)…
Nubcake
  • 449
  • 1
  • 8
  • 21
1
vote
1 answer

Lucene Analyzer tokenizer for substring search

I need a Lucene Tokenizer that can do the following. Given the string "wines bottle caps", the following queries should succeed wine bott cap ottl aps wine bottl Here is what I have so far. How might I modify it to work? No query less than three…
Katedral Pillon
  • 14,534
  • 25
  • 99
  • 199
1
vote
1 answer

Lucene TextField not tokenized

I am saving the following title to index doc.add(new TextField(TITLE, "Button",Field.Store.YES )); Then when I search for it with say "butto", nothing returns. I must search for "button" to get anything back. What do I have to do so that any…
Katedral Pillon
  • 14,534
  • 25
  • 99
  • 199
1
vote
1 answer

chinese tokenizer stanford core nlp

Can somebody help me use the stanford core nlp to tokenize chinese text in java. This is my code so far: File file = new File("example.txt"); file.createNewFile(); FileWriter fileWriter = new FileWriter(file); fileWriter.write("这是很好"); …
1
vote
1 answer

token_get_all and mathematical operators

I payed around with token_get_all() once again and came across something "special": Given the following line of PHP code: array(3) { [0]=> …
TiMESPLiNTER
  • 5,741
  • 2
  • 28
  • 64