Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
1
vote
2 answers

Preserve punctuation characters when using Lucene's StandardTokenizer

I'm thinking of leveraging Lucene's StandardTokenizer for word tokenization in a non-IR context. I understand that this tokenizer removes punctuation characters. Would anybody know (or happen to have experience with) making it output punctuation…
sam
  • 1,406
  • 2
  • 15
  • 25
1
vote
5 answers

How to count white spaces in a given argument?

I find it strange why spaceCount doesn't add up when the expression is "12 + 1". I get an output 0 for spaceCount even though it should be 2. Any insight would be appreciated! public int countSpaces(String expr) { String tok = expr; int…
apebeast
  • 368
  • 3
  • 15
1
vote
4 answers

parsing a string of ascii text into separate variables

I have a piece of text that gets handed to me like: here is line one\n\nhere is line two\n\nhere is line three What I would like to do is break this string up into three separate variables. I'm not quite sure how one would go about accomplishing…
jml
  • 1,745
  • 6
  • 29
  • 55
1
vote
1 answer

How to add phrase as a stopword while using lucene analyzer?

I am using Lucene 4.6.1 libraries. I am trying to add the word - hip hop in my stopword exclusion list. I can exclude it if its written as - hiphop (as one word) but when its written like hip hop (with space in between) i cannot exclude it. Below…
VP10
  • 127
  • 1
  • 2
  • 13
1
vote
0 answers

getline stops reading files after text file has multiple spaces

Hello I am new to stackoverflow so please pardon any newbie mistakes I make. I have a program I am trying to build in c++ and I am running into some problems. This program is supposed to let the user input a file name and then read the file and…
ejs
  • 11
  • 2
1
vote
1 answer

Regex / "token_pattern" for scikit-learn text Vectorizer

I'm using sklearn to do some NLP vectorizing with a tf-idf Vectorizer object. This object can be constructed with a keyword, "token_pattern". I want to avoid hashtags (#foobar), numerics (and strings that begin with a number(s), i.e. 10mg), any line…
wbg
  • 866
  • 3
  • 14
  • 34
1
vote
2 answers

bad zip file error in POS tagging in NLTK in python

I am new to python and NLTK ..I want to do word tokenization and POS Tagging in this.I installed Nltk 3.0 in my Ubuntu 14.04 having a default python 2.7.6.First I tried to do tokenization of a simple sentence.But I am getting an error,telling that…
PRINCY
  • 13
  • 3
1
vote
1 answer

finding token probabilies in a text in nlp

I came across this class TokenizerME in opennlp documentation page(http://opennlp.apache.org/documentation/manual/opennlp.html). I am not getting how is it calculating the probabilies. I tested it with different inputs, still not understanding. Can…
Akash
  • 85
  • 1
  • 12
1
vote
1 answer

Balanced regular expression

so I got my hands on regular expressions and tried to match the outer {% tag xyz %}{% endtag %} tags of the following text using regular expressions: {% tag xyz %} {% tag abc %} {% endtag %} {% endtag %} My regular expression looks as follows…
techworker
  • 13
  • 2
1
vote
2 answers

C++ Tokenizing mathematical expression using classes

I'm attempting to re-learn bits about C++ inheritance as well as write a program that evaluates simple mathematical expressions (as strings) from scratch for practice, but I'm running into a lot of problems. My only prior experience with lexing…
chetlin
  • 23
  • 7
1
vote
2 answers

Tokenize c++ statements

I'm working in a software of formal verification of programs, where the user defines an algorithm written in C ++ to be verified. Without going too much into details of the subject matter, I will try to express as clearly as possible what I and my…
1
vote
2 answers

How to tokenize using regular expression such that regex for "everything else" does not match regex for "special tokens"?

I have the following kind of text that I want to tokenize. Text: Text1 Text2 I want to tokenize it into three kinds of tokens, COMMENT_START, COMMENT_END and OTHER. For example, for the above text, I want the…
Lone Learner
  • 18,088
  • 20
  • 102
  • 200
1
vote
1 answer

Validate the contents of uploaded files

I'm developing a "plug 'n play" system in which individual components can registered and associated with an uploaded file using the Application GUI. But to be really "plug 'n play" the Application must recognize the component and since each…
user753531
1
vote
2 answers

how to identify a end of a sentence

String x=" i am going to the party at 6.00 in the evening. are you coming with me?"; if i have the above string, i need that to be broken to sentences by using sentence boundry punctuations(like . and ?) but it should not split the sentence at 6…
Chirath
  • 57
  • 1
  • 10
1
vote
4 answers

Tokenizer skipping blank values before the split - Java

I used Tokenizer to split a text file which was separated like so: FIRST NAME, MIDDLE NAME, LAST NAME harry, rob, park tom,,do while (File.hasNext()) { StringTokenizer strTokenizer = new StringTokenizer(File.nextLine(), ","); …
Axelotl
  • 13
  • 2