Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
1
vote
2 answers

Is it legitimate for a tokenizer to have a stack?

I have designed a new language for that I want to write a reasonable lexer and parser. For the sake of brevity, I have reduced this language to a minimum so that my questions are still open. The language has implicit and explicit strings, arrays…
Henning
  • 579
  • 6
  • 17
1
vote
1 answer

Tokenizing tweets in Python

enneg3clear.txt is a file with Tweets without punctuation and stopwords on every line. import re, string import sys #this code tokenizes input_file = 'enneg3clear.txt' with open(input_file) as f: lines = f.readlines() results = [] texts =…
mvh
  • 189
  • 1
  • 2
  • 20
1
vote
1 answer

Hadoop Map Reduce Query

I was trying to use HADOOP MadReduce to calculate the sum of the weights of all incoming edges for each node in a graph.The input is in a .tsv format and it looks like: src tgt weight X 102 1 X 200 1 X 123 5 Y 245 1 Y 101 1 Z 99 2 X …
Gautam
  • 11
  • 3
1
vote
1 answer

Using python rdflib parsers without the graph object

Loading RDF data in Python looks like this: from rdflib import Graph g = Graph() g.parse("demo.nt", format="nt") But what about using the format parsers standalone as streaming parsers, getting a stream of parsed tokens? Can someone give me a…
chiborg
  • 26,978
  • 14
  • 97
  • 115
1
vote
1 answer

Parsing an expression with tokens

I want to write a recursive descent parser for the following grammar term ---> FINAL | FUNCTION_A (term, term) | FUNCTION_B (term, term) Currently I am struggeling with the FUNCTION part, since I don't know, how to handle cases, where a command…
1
vote
1 answer

StanfordNLP Tokenizer

I use StanfordNLP to tokenize a set of messages written with smartphones. These texts have a lot of typos and do not respect the punctuation rules. Very often the blank spaces are missing affecting the tokenization. For instance, the following…
1
vote
1 answer

R - Tokenization - single and two letter words in a TermDocumentMatrix

I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix. The issue is that it seems to display only 3 letter words and more. library(tm) library(RWeka) …
Robert
  • 13
  • 4
1
vote
3 answers

Reversed offset tokenizer

I have a string to tokenize. It's form is HHmmssff where H, m, s, f are digits. It's supposed to be tokenized into four 2-digit numbers, but I need it to also accept short-hand forms, like sff so it interprets it as 00000sff. I wanted to use…
macbirdie
  • 16,086
  • 6
  • 47
  • 54
1
vote
4 answers

Tokenize method: Split string into array

I've been really struggling with a programming assignment. Basically, we have to write a program that translates a sentence in English into one in Pig Latin. The first method we need is one to tokenize the string, and we are not allowed to use the…
IH9522
  • 11
  • 2
1
vote
1 answer

Ragel - how to return one token at a time

I want to build a one-token-per-call ragel grammar / thing. I'm relatively new to Ragel (but not new to compilers, etc). I've written a grammar for a json-like notation (three levels deep). It emits C code. My input comes in complete strings (no…
1
vote
1 answer

How to fix token pattern in scikit-learn?

I am using TfidfVectorizer from scikit-learn to extract features, And the settings are: def tokenize(text): tokens = nltk.word_tokenize(text) stems = [] for token in tokens: token = re.sub("[^a-zA-Z]","", token) …
James
  • 153
  • 3
  • 15
1
vote
1 answer

How to use Start States in ML-Lex?

I am creating a tokeniser in ML-Lex a part of the definition of which is datatype lexresult = STRING | STRINGOP | EOF val error = fn x => TextIO.output(TextIO.stdOut,x ^ "\n") val eof = fn () =>…
Chandan
  • 166
  • 2
  • 18
1
vote
0 answers

Regular Expression for Validating & Tokenizing String

I am trying to develop a regular expression for a special case: It should accept strings separated by a '.' (dot): First four parts are always STG.FTG.ADG.Common Fifth part can be one or more words(Alphanumeric) separated by a '.', so it could be…
1
vote
1 answer

Tokenized output of C source code

I want to look at the tokenized output my c-source code. The cpp processor first process the cpp-directives and then it tokenizes the c source code. Then the this tokenized output is parsed. After that assembler does the job and process continues. I…
Rishit Sanmukhani
  • 2,159
  • 16
  • 26
1
vote
1 answer

Why is my vector empty?

I want to create a simple inverted index. I have a file with with docIds and keywords that are in each document. So the first step is to try and read the file and tokenize the text file. I found a tokenize function online that was supposed to work…
captain
  • 1,747
  • 5
  • 20
  • 32