Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
12
votes
1 answer

Negation handling in NLP

I'm currently working on a project, where I want to extract emotion from text. As I'm using conceptnet5 (a semantic network), I can't however simply prefix words in a sentence that contains a negation-word, as those words would simply not show up in…
Tim Daubenschütz
  • 2,053
  • 6
  • 23
  • 39
12
votes
7 answers

Python: Best Way to remove duplicate character from string

How can I remove duplicate characters from a string using Python? For example, let's say I have a string: foo = "SSYYNNOOPPSSIISS" How can I make the string: foo = SYNOPSIS I'm new to python and What I have tired and it's working. I knew there is…
Rahul Patil
  • 1,014
  • 3
  • 14
  • 30
11
votes
2 answers

How To Use Backreference in Grep

I have a regular expression with a backreference. How can use it in a bash script? Such as I want to print what matches to (.*) grep -E "CONSTRAINT \`(.*)\` FOREIGN KEY" temp.txt If apply it to CONSTRAINT `fk_dm` FOREIGN KEY I want to…
metdos
  • 13,411
  • 17
  • 77
  • 120
11
votes
4 answers

Java text classification problem

I have a set of Books objects, classs Book is defined as following : Class Book{ String title; ArrayList taglist; } Where title is the title of the book, example : Javascript for dummies. and taglist is a list of tags for our example :…
Youssef
  • 1,310
  • 1
  • 14
  • 24
11
votes
2 answers

Nltk stanford pos tagger error : Java command failed

I'm trying to use nltk.tag.stanford module for tagging a sentence (first like wiki's example) but i keep getting the following error : Traceback (most recent call last): File "test.py", line 28, in print st.tag(word_tokenize('What is…
Mazdak
  • 105,000
  • 18
  • 159
  • 188
11
votes
5 answers

Extract words surrounding a search word

I have this script that does a word search in text. The search goes pretty good and results work as expected. What I'm trying to achieve is extract n words close to the match. For example: The world is a small place, we should try to take care of…
PepperoniPizza
  • 8,842
  • 9
  • 58
  • 100
11
votes
1 answer

Python: PyEnchant and 64 bit Python

I am doing text processing. I need the PyEnchant library for verifying if a particular word in the text is a valid English word. However, it's only available for the 32 bit installation of Python. I need the 64 bit Python for handling memory issues…
user1839897
  • 425
  • 1
  • 10
  • 14
11
votes
1 answer

Effects of Stemming on the term frequency?

How are the term frequencies (TF), and inverse document frequency (IDF), affected by stop-word removal and stemming? Thanks!
Ataman
  • 2,530
  • 3
  • 22
  • 34
10
votes
2 answers

Using Keras Tokenizer to generate n-grams

Is it possible to use n-grams in Keras? E.g., sentences contain in X_train dataframe with "sentences" column. I use tokenizer from Keras in the following manner: tokenizer = Tokenizer(lower=True, split='…
Simplex
  • 1,723
  • 2
  • 17
  • 26
10
votes
3 answers

What is the difference between fit_transform and transform in sklearn countvectorizer?

I was recently practicing bag of words introduction : kaggle , I want to clear few things : using vectorizer.fit_transform( " * on the list of *cleaned* reviews* " ) Now when we were preparing the bag of words array on train reviews we used…
Anurag Pandey
  • 373
  • 2
  • 5
  • 21
10
votes
1 answer

Using Stanford NER for extracting Address from a text document?

I was looking Stanford NER and thinking of using JAVA Apis it to extract postal address from a text document. The document may be any document where there is an postal address section e.g. Utility Bills, electricity bills. So what I am thinking as…
yadab
  • 2,063
  • 1
  • 16
  • 24
10
votes
1 answer

Extract emoticons from a text

I need to extract text emoticons from a text using Python and I've been looking for some solutions to do this but most of them like this or this only cover simple emoticons. I need to parse all of them. Currently I'm using a list of emoticons that I…
David Moreno García
  • 4,423
  • 8
  • 49
  • 82
10
votes
1 answer

Given a document, select a relevant snippet

When I ask a question here, the tool tips for the question returned by the auto search given the first little bit of the question, but a decent percentage of them don't give any text that is any more useful for understanding the question than the…
BCS
  • 75,627
  • 68
  • 187
  • 294
10
votes
1 answer

Which function should I use to read unstructured text file into R?

This is my first ever question here and I'm new to R, trying to figure out my first step in how to do data processing, please keep it easy : ) I'm wondering what would be the best function and a useful data structure in R to load unstructured text…
user2942656
  • 117
  • 1
  • 1
  • 6
10
votes
10 answers

Finding dictionary words

I have a lot of compound strings that are a combination of two or three English words. e.g. "Spicejet" is a combination of the words "spice" and "jet" I need to separate these individual English words from such compound strings. My dictionary…
Manas
  • 589
  • 8
  • 18