Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions

vote

2 answers

Is it legitimate for a tokenizer to have a stack?

I have designed a new language for that I want to write a reasonable lexer and parser. For the sake of brevity, I have reduced this language to a minimum so that my questions are still open. The language has implicit and explicit strings, arrays…

parsing tokenize lexer

asked Mar 25 '15 at 00:37

Henning

vote

1 answer

Tokenizing tweets in Python

enneg3clear.txt is a file with Tweets without punctuation and stopwords on every line. import re, string import sys #this code tokenizes input_file = 'enneg3clear.txt' with open(input_file) as f: lines = f.readlines() results = [] texts =…

python twitter tokenize

asked Mar 20 '15 at 17:48

mvh

vote

1 answer

Hadoop Map Reduce Query

I was trying to use HADOOP MadReduce to calculate the sum of the weights of all incoming edges for each node in a graph.The input is in a .tsv format and it looks like: src tgt weight X 102 1 X 200 1 X 123 5 Y 245 1 Y 101 1 Z 99 2 X …

java hadoop mapreduce tokenize

asked Mar 17 '15 at 18:21

Gautam

vote

1 answer

Using python rdflib parsers without the graph object

Loading RDF data in Python looks like this: from rdflib import Graph g = Graph() g.parse("demo.nt", format="nt") But what about using the format parsers standalone as streaming parsers, getting a stream of parsed tokens? Can someone give me a…

python parsing tokenize rdflib

asked Mar 13 '15 at 11:17

chiborg

26,978
14
97
115

vote

1 answer

Parsing an expression with tokens

I want to write a recursive descent parser for the following grammar term ---> FINAL | FUNCTION_A (term, term) | FUNCTION_B (term, term) Currently I am struggeling with the FUNCTION part, since I don't know, how to handle cases, where a command…

java parsing recursion compiler-construction tokenize

asked Mar 10 '15 at 15:27

Kyuuri

vote

1 answer

StanfordNLP Tokenizer

I use StanfordNLP to tokenize a set of messages written with smartphones. These texts have a lot of typos and do not respect the punctuation rules. Very often the blank spaces are missing affecting the tokenization. For instance, the following…

tokenize stanford-nlp misspelling

asked Feb 27 '15 at 10:27

Silvia Necsulescu

vote

1 answer

R - Tokenization - single and two letter words in a TermDocumentMatrix

I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix. The issue is that it seems to display only 3 letter words and more. library(tm) library(RWeka) …

r nlp tokenize tm

asked Feb 24 '15 at 19:02

Robert

vote

3 answers

Reversed offset tokenizer

I have a string to tokenize. It's form is HHmmssff where H, m, s, f are digits. It's supposed to be tokenized into four 2-digit numbers, but I need it to also accept short-hand forms, like sff so it interprets it as 00000sff. I wanted to use…

c++ boost tokenize

asked Nov 13 '08 at 13:06

macbirdie

16,086
6
47
54

vote

4 answers

Tokenize method: Split string into array

I've been really struggling with a programming assignment. Basically, we have to write a program that translates a sentence in English into one in Pig Latin. The first method we need is one to tokenize the string, and we are not allowed to use the…

java arrays string split tokenize

asked Feb 23 '15 at 02:59

IH9522

vote

1 answer

Ragel - how to return one token at a time

I want to build a one-token-per-call ragel grammar / thing. I'm relatively new to Ragel (but not new to compilers, etc). I've written a grammar for a json-like notation (three levels deep). It emits C code. My input comes in complete strings (no…

parsing tokenize ragel

asked Feb 12 '15 at 15:37

paul tarvydas

vote

1 answer

How to fix token pattern in scikit-learn?

I am using TfidfVectorizer from scikit-learn to extract features, And the settings are: def tokenize(text): tokens = nltk.word_tokenize(text) stems = [] for token in tokens: token = re.sub("[^a-zA-Z]","", token) …

python regex scikit-learn tokenize

asked Feb 12 '15 at 03:18

James

vote

1 answer

How to use Start States in ML-Lex?

I am creating a tokeniser in ML-Lex a part of the definition of which is datatype lexresult = STRING | STRINGOP | EOF val error = fn x => TextIO.output(TextIO.stdOut,x ^ "\n") val eof = fn () =>…

tokenize sml lex ml-lex

asked Feb 11 '15 at 12:53

Chandan

vote

0 answers

Regular Expression for Validating & Tokenizing String

I am trying to develop a regular expression for a special case: It should accept strings separated by a '.' (dot): First four parts are always STG.FTG.ADG.Common Fifth part can be one or more words(Alphanumeric) separated by a '.', so it could be…

regex validation tokenize

asked Feb 09 '15 at 14:53

FredericoPSC

vote

1 answer

Tokenized output of C source code

I want to look at the tokenized output my c-source code. The cpp processor first process the cpp-directives and then it tokenizes the c source code. Then the this tokenized output is parsed. After that assembler does the job and process continues. I…

c compiler-construction tokenize

asked Feb 07 '15 at 11:32

Rishit Sanmukhani

2,159
16
26

vote

1 answer

Why is my vector empty?

I want to create a simple inverted index. I have a file with with docIds and keywords that are in each document. So the first step is to try and read the file and tokenize the text file. I found a tokenize function online that was supposed to work…

c++ c++11 tokenize

asked Feb 05 '15 at 18:21

captain

1,747
5
20
32

Prev 1 2 3

…

99 100 Next