Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
14
votes
1 answer

How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?

This is the Code that I am using for semantic analysis of twitter:- import pandas as pd import datetime import numpy as np import re from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem.wordnet import…
Vic13
  • 451
  • 1
  • 4
  • 16
14
votes
1 answer

How to use StandardTokenizer from lucene 5.x.x

There are a lot of examples that show how to use the StandardTokenizer like this: TokenStream tokenStream = new StandardTokenizer( Version.LUCENE_36, new StringReader(input)); But in newer Lucene versions this constructor is…
samy
  • 1,396
  • 2
  • 19
  • 41
14
votes
4 answers

How do I use NLTK's default tokenizer to get spans instead of strings?

NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box. >>> nltk.word_tokenize("(Dr. Edwards is my friend.)") ['(',…
W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111
14
votes
3 answers

Search for name(text) with spaces in elasticsearch

Searching for names(text) with spaces in it, causing problem to me, I have mapping similar to "{"user":{"properties":{"name":{"type":"string"}}}}" Ideally what it should return and rank results as follows 1) Bring on top names that exact match…
maaz
  • 4,371
  • 2
  • 30
  • 48
13
votes
7 answers

Split a string using whitespace in Javascript?

I need a tokenizer that given a string with arbitrary white-space among words will create an array of words without empty sub-strings. For example, given a string: " I dont know what you mean by glory Alice said." I use: str2.split(" ") This also…
dokondr
  • 3,389
  • 12
  • 38
  • 62
13
votes
2 answers

Input line by line from an input file and tokenize using strtok() and the output into an output file

What I am trying to do is to input a file LINE BY LINE and tokenize and output into an output file.What I have been able to do is input the first line in the file but my problem is that i am unable to input the next line to tokenize so that it could…
13
votes
8 answers

Split a string with multiple delimiters using only String methods

I want to split a string into tokens. I ripped of another Stack Overflow question - Equivalent to StringTokenizer with multiple characters delimiters, but I want to know if this can be done with only string methods (.equals(), .startsWith(),…
Aditya Ramkumar
  • 377
  • 1
  • 13
13
votes
4 answers

String Tokenizer with multiple delimiters including delimiter without Boost

I need to create string parser in C++. I tried using vector Tokenize(const string& strInput, const string& strDelims) { vector vS; string strOne = strInput; string delimiters = strDelims; int startpos = 0; int pos =…
user2473015
  • 1,392
  • 3
  • 22
  • 47
13
votes
3 answers

How can I split a string into tokens?

If I have a string 'x+13.5*10x-4e1' how can I split it into the following list of tokens? ['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1'] Currently I'm using the shlex module: str = 'x+13.5*10x-4e1' lexer =…
Martin Thetford
  • 133
  • 1
  • 1
  • 6
13
votes
4 answers

Java Lucene NGramTokenizer

I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokenized. In fact I only see two methods in the NGramTokenizer class that return…
CodeKingPlusPlus
  • 15,383
  • 51
  • 135
  • 216
12
votes
4 answers

C++ tokenize a string using a regular expression

I'm trying to learn myself some C++ from scratch at the moment. I'm well-versed in python, perl, javascript but have only encountered C++ briefly, in a classroom setting in the past. Please excuse the naivete of my question. I would like to…
Anonymous
12
votes
3 answers

How do I tokenize this string in Ruby?

I have this string: %{Children^10 Health "sanitation management"^5} And I want to convert it to tokenize this into an array of hashes: [{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management",…
Radamanthus
  • 690
  • 2
  • 5
  • 17
12
votes
12 answers

Pythonic way to implement a tokenizer

I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice? I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best…
Peter
  • 435
  • 1
  • 4
  • 9
12
votes
2 answers

What are all the Japanese whitespace characters?

I need to split a string and extract words separated by whitespace characters.The source may be in English or Japanese. English whitespace characters include tab and space, and Japanese text uses these too. (IIRC, all widely-used Japanese character…
Mason
  • 5,071
  • 4
  • 25
  • 24
12
votes
4 answers

Getting rid of stop words and document tokenization using NLTK

I’m having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following AttributeError: 'list' object has no attribute 'lower'. I just can’t figure out what I’m doing wrong, although it’s my first time of doing…
Tiger1
  • 1,327
  • 5
  • 19
  • 40