Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

votes

1 answer

How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?

This is the Code that I am using for semantic analysis of twitter:- import pandas as pd import datetime import numpy as np import re from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem.wordnet import…

python pandas twitter nltk tokenize

asked May 25 '17 at 06:21

Vic13

votes

1 answer

How to use StandardTokenizer from lucene 5.x.x

There are a lot of examples that show how to use the StandardTokenizer like this: TokenStream tokenStream = new StandardTokenizer( Version.LUCENE_36, new StringReader(input)); But in newer Lucene versions this constructor is…

java lucene tokenize

asked May 29 '15 at 07:27

samy

1,396
2
19
41

votes

4 answers

How do I use NLTK's default tokenizer to get spans instead of strings?

NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box. >>> nltk.word_tokenize("(Dr. Edwards is my friend.)") ['(',…

python nltk tokenize

asked Feb 23 '15 at 16:22

W.P. McNeill

16,336
12
75
111

votes

3 answers

Search for name(text) with spaces in elasticsearch

Searching for names(text) with spaces in it, causing problem to me, I have mapping similar to "{"user":{"properties":{"name":{"type":"string"}}}}" Ideally what it should return and rank results as follows 1) Bring on top names that exact match…

search elasticsearch tokenize analyzer

asked May 23 '13 at 08:17

maaz

4,371
2
30
48

votes

7 answers

Split a string using whitespace in Javascript?

I need a tokenizer that given a string with arbitrary white-space among words will create an array of words without empty sub-strings. For example, given a string: " I dont know what you mean by glory Alice said." I use: str2.split(" ") This also…

javascript tokenize

asked Feb 22 '12 at 19:50

dokondr

3,389
12
38
62

votes

2 answers

Input line by line from an input file and tokenize using strtok() and the output into an output file

What I am trying to do is to input a file LINE BY LINE and tokenize and output into an output file.What I have been able to do is input the first line in the file but my problem is that i am unable to input the next line to tokenize so that it could…

c++ string visual-c++ iostream tokenize

asked Dec 01 '10 at 20:19

Shadi Al Mahallawy

votes

8 answers

Split a string with multiple delimiters using only String methods

I want to split a string into tokens. I ripped of another Stack Overflow question - Equivalent to StringTokenizer with multiple characters delimiters, but I want to know if this can be done with only string methods (.equals(), .startsWith(),…

java tokenize

asked Oct 31 '15 at 17:07

Aditya Ramkumar

votes

4 answers

String Tokenizer with multiple delimiters including delimiter without Boost

I need to create string parser in C++. I tried using vector Tokenize(const string& strInput, const string& strDelims) { vector vS; string strOne = strInput; string delimiters = strDelims; int startpos = 0; int pos =…

c++ string tokenize

asked Jul 01 '15 at 04:56

user2473015

1,392
3
22
47

votes

3 answers

How can I split a string into tokens?

If I have a string 'x+13.5*10x-4e1' how can I split it into the following list of tokens? ['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1'] Currently I'm using the shlex module: str = 'x+13.5*10x-4e1' lexer =…

python token tokenize equation shlex

asked Aug 19 '13 at 11:15

Martin Thetford

votes

4 answers

Java Lucene NGramTokenizer

I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokenized. In fact I only see two methods in the NGramTokenizer class that return…

java lucene tokenize n-gram

asked Nov 17 '12 at 18:50

CodeKingPlusPlus

15,383
51
135
216

votes

4 answers

C++ tokenize a string using a regular expression

I'm trying to learn myself some C++ from scratch at the moment. I'm well-versed in python, perl, javascript but have only encountered C++ briefly, in a classroom setting in the past. Please excuse the naivete of my question. I would like to…

c++ regex split tokenize

asked Jun 14 '09 at 04:37

Anonymous

votes

3 answers

How do I tokenize this string in Ruby?

I have this string: %{Children^10 Health "sanitation management"^5} And I want to convert it to tokenize this into an array of hashes: [{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management",…

ruby parsing tokenize text-parsing

asked Apr 03 '09 at 11:28

Radamanthus

votes

12 answers

Pythonic way to implement a tokenizer

I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice? I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best…

python coding-style tokenize

asked Mar 27 '09 at 19:20

Peter

votes

2 answers

What are all the Japanese whitespace characters?

I need to split a string and extract words separated by whitespace characters.The source may be in English or Japanese. English whitespace characters include tab and space, and Japanese text uses these too. (IIRC, all widely-used Japanese character…

text unicode whitespace tokenize cjk

asked Nov 29 '10 at 05:01

Mason

5,071
4
25
24

votes

4 answers

Getting rid of stop words and document tokenization using NLTK

I’m having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following AttributeError: 'list' object has no attribute 'lower'. I just can’t figure out what I’m doing wrong, although it’s my first time of doing…

python nltk tokenize stop-words

asked Jun 30 '13 at 12:24

Tiger1

1,327
5
19
40

Prev 1 2 3

…

99 100 Next