Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
5
votes
3 answers

How can I remove lines in one file that exist in another?

I have a file I get every day that has 10,000 records in it, 99% of which were in the last day's file. How can I use the macOS command line to remove the lines in the newer file that exist in the previous day's file? remove_duplicates newfile…
Chuck
  • 4,662
  • 2
  • 33
  • 55
5
votes
2 answers

C# Regex performance pure relative JS

I had a good experience at the speed of regex in JS. And I decided to make a small comparison. I ran the following code: var str = "A regular expression is a pattern that the regular expression engine attempts to match in input text."; var re = new…
dovid
  • 6,354
  • 3
  • 33
  • 73
5
votes
5 answers

How to remove the last line from a variable in bash or sh?

I have a variable that has few lines. I would like to remove the last line from the contents of the variable. I searched the internet but all the links talk about removing the last line from a file. Here is the content of my variable $echo…
Alex Raj Kaliamoorthy
  • 2,035
  • 3
  • 29
  • 46
5
votes
0 answers

How to predict next word in sentence using ngram model in R

I have pre-processed text data into a corpus I would now like to build a prediction model based on the previous 2 words (so I think a 3-gram model?). Based on my understanding of the articles I have read, here is how I am thinking of doing it: step…
heyydrien
  • 971
  • 1
  • 11
  • 28
5
votes
4 answers

Estimating the word count of a file without reading the full file

I have a program to process very large files. Now I need to show a progress bar to show the progress of the processing. The program works on a word level, read one line at a time, splitting it into words and processing the words one by one. So while…
Abhinav Sarkar
  • 23,534
  • 11
  • 81
  • 97
5
votes
2 answers

Remove characters from a string after a certain word - excel

Ive got a list of imported data that is formatted as the following in a excel / google spreadsheet. In column A i have the full data and in B im trying to strip out the data to the left of the word ON. FULL DATA | STRIPPED…
sam
  • 9,486
  • 36
  • 109
  • 160
5
votes
1 answer

What's the difference between indicative summarization and informative summarization?

I have trouble in distinguishing between indicative summarization and informative summarization. Can you give me a clear example to show the difference between them? Thanks in advance!
Chelsea_cole
  • 1,055
  • 3
  • 15
  • 21
5
votes
1 answer

Exploding UpperCasedCamelCase to Upper Cased Camel Case in PHP

Right now, I am implementing this with a split, slice, and implosion: $exploded = implode(' ',array_slice(preg_split('/(?=[A-Z])/','ThisIsATest'),1)); //$exploded = "This Is A Test" Prettier version: $capital_split =…
Austin Hyde
  • 26,347
  • 28
  • 96
  • 129
5
votes
4 answers

How can I parse an email header with python?

Here's an example email header, header = """ From: Media Temple user (mt.kb.user@gmail.com) Subject: article: A sample header Date: January 25, 2011 3:30:58 PM PDT To: user@example.com Return-Path: Envelope-To:…
All Іѕ Vаиітy
  • 24,861
  • 16
  • 87
  • 111
5
votes
5 answers

What's the fastest way to strip and replace a document of high unicode characters using Python?

I am looking to replace from a large document all high unicode characters, such as accented Es, left and right quotes, etc., with "normal" counterparts in the low range, such as a regular 'E', and straight quotes. I need to perform this on a very…
Rhubarb
  • 3,893
  • 6
  • 41
  • 55
5
votes
2 answers

tm custom removePunctuation except hashtag

I have a Corpus of tweets from twitter. I clean this corpus (removeWords, tolower, delete URls) and finally also want to remove punctuation. Here is my code: tweetCorpus <- tm_map(tweetCorpus, removePunctuation, preserve_intra_word_dashes =…
feder80
  • 1,195
  • 3
  • 13
  • 34
5
votes
3 answers

tf-idf: am I understanding it right?

I am interested in doing some document clustering, and right now I am considering using TF-IDF for this. If I am not wrong, TF-IDF is particularly used for evaluating the relevance of a document given a query. If I do not have a particular query,…
5
votes
6 answers

Sed script to edit csv file Or Python

In our project we need to import the csv file to postgres. There are multiple types of files meaning the length of the file changes as some files are with fewer columns and some with all of them. We need a fast way to import this file to postgres. I…
Sujit
  • 2,403
  • 4
  • 30
  • 36
5
votes
2 answers

Unicode Strings in Ruby 1.9

I've written a Ruby script that is reading a file (File.read()) that contains unicode characters, and it works fine from the command line. However, when I try to put it into an Automator Workflow (Mac OS X), I get this error; 2009-12-23 17:55:15…
Jeffrey Aylesworth
  • 8,242
  • 9
  • 40
  • 57
5
votes
2 answers

Clustering algorithm appropriate for very small clusters

I am trying to find duplicates in a list of about 5000 records. Each record is a person's name and address, but all typed inconsistently into one field, so I'm trying a fuzzy matching approach. My methodology (using rapidminer) is to do some…
aquavitae
  • 17,414
  • 11
  • 63
  • 106