Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
26
votes
4 answers

Eliminate partially duplicate lines by column and keep the last one

I have a file that looks like this: 2011-03-21 name001 line1 2011-03-21 name002 line2 2011-03-21 name003 line3 2011-03-22 name002 line4 2011-03-22 name001 line5 for each name, I only want its last appearance. So, I expect the result to…
Dagang
  • 24,586
  • 26
  • 88
  • 133
26
votes
2 answers

paste text with a newline/return in formatted text

I want to do a column that is formatted to use for a mailing address and I can not get the newline/return carriage or
to work when making a new column. name = c("John Smith", "Patty Smith", "Sam Smith") address = c("111 Main St.", "222 Main…
Spruce Island
  • 425
  • 1
  • 4
  • 10
25
votes
3 answers

How to proceed with NLP task for recognizing intent and slots

I wanted to write a program for asking questions about weather. What are the algorithms and techniques I should start looking at. ex: Will it be sunny this weekend in Chicago. I wanted to know the intent = weather query, date = this weekend,…
24
votes
2 answers

How to check if string only contains set of characters in Rust?

What is the idiomatic way in Rust to check if a string only contains a certain set of characters?
Aart Stuurman
  • 3,188
  • 4
  • 26
  • 44
24
votes
4 answers

How to display /proc/meminfo in Megabytes?

I want to thank you for helping me my related issue. I know if I do a cat /proc/meminfo it will only display in kB. How can I display in MB? I really want to use cat or awk for this please.
javanoob17
  • 243
  • 1
  • 3
  • 10
23
votes
5 answers

Using SQL to determine word count stats of a text field

I've recently been working on some database search functionality and wanted to get some information like the average words per document (e.g. text field in the database). The only thing I have found so far (without processing in language of choice…
Rob
  • 7,377
  • 7
  • 36
  • 38
23
votes
3 answers

How to find out if a sentence is a question (interrogative)?

Is there an open source Java library/algorithm for finding if a particular piece of text is a question or not? I am working on a question answering system that needs to analyze if the text input by user is a question. I think the problem can…
nabeelmukhtar
  • 1,371
  • 15
  • 24
22
votes
7 answers

BufferedReader: read multiple lines into a single string

I'm reading numbers from a txt file using BufferedReader for analysis. The way I'm going about this now is- reading a line using .readline, splitting this string into an array of strings using .split public InputFile () { fileIn = null; …
S_Wheelan
  • 267
  • 2
  • 5
  • 8
21
votes
3 answers

What is the preferred way to implement 'yield' in Scala?

I am doing writing code for PhD research and starting to use Scala. I often have to do text processing. I am used to Python, whose 'yield' statement is extremely useful for implementing complex iterators over large, often irregularly structured…
Urban Vagabond
  • 7,282
  • 3
  • 28
  • 31
21
votes
4 answers

How to strip trailing whitespace in CMake variable?

We are trying to improve the makefiles produced by CMake. For Clang, GCC and ICC, we want to add -march=native. The block to do so looks like: # -march=native for GCC, Clang and ICC on i386, i486, i586, i686 and x86_64. message(STATUS,…
jww
  • 97,681
  • 90
  • 411
  • 885
21
votes
4 answers

How to strip headers/footers from Project Gutenberg texts?

I've tried various methods to strip the license from Project Gutenberg texts, for use as a corpus for a language learning project, but I can't seem to come up with an unsupervised, reliable approach. The best heuristic I've come up with so far is…
heartpunk
  • 2,235
  • 1
  • 21
  • 26
20
votes
4 answers

User Warning: Your stop_words may be inconsistent with your preprocessing

I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning: ,,UserWarning:…
20
votes
7 answers

Algorithm for Negating Sentences

I was wondering if anyone was familiar with any attempts at algorithmic sentence negation. For example, given a sentence like "This book is good" provide any number of alternative sentences meaning the opposite like "This book is not good" or even…
Kevin Dolan
  • 421
  • 1
  • 4
  • 7
20
votes
9 answers

Output text file with line breaks in PHP

I'm trying to open a text file and output its contents with the code below. The text file includes line breaks but when I echo the file its unformatted. How do I fix this? Thanks. $fh = fopen("filename.txt",…
usertest
  • 27,132
  • 30
  • 72
  • 94
20
votes
5 answers

How to get bag of words from textual data?

I am working on prediction problem using a large textual dataset. I am implementing Bag of Words Model. What should be the best way to get the bag of words? Right now, I have tf-idf of the various words and the number of words is too large to use it…
hshed
  • 657
  • 2
  • 8
  • 21