Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
8
votes
3 answers

Find the most common line in a file in bash

I have a file of strings: string-string-123 string-string-123 string-string-123 string-string-12345 string-string-12345 string-string-12345-123 How do I retrieve the most common line in bash (string-string-123)?
Alex
  • 83
  • 1
  • 5
8
votes
7 answers

How to select multiple lines from a file or from pipe in a script?

I'd like to have a script, called lines.sh that I can pipe data to to select a series of lines. For example, if I had the following file: test.txt a b c d Then I could run: cat test.txt | lines 2,4 and it would output b d I'm using zsh, but…
Brad Parks
  • 66,836
  • 64
  • 257
  • 336
8
votes
2 answers

How to improve text processing performance in Clojure?

I'm writing a simple desktop search engine in Clojure as a way to learn more about the language. Until now, the performance during the text processing phase of my program is really bad. During the text processing I've to: Clean up unwanted…
luisgabriel
  • 2,483
  • 1
  • 15
  • 10
8
votes
3 answers

Classifying Text Based on Groups of Keywords?

I have a list of requirements for a software project, assembled from the remains of its predecessor. Each requirement should map to one or more categories. Each of the categories consists of a group of keywords. What I'm trying to do is find an…
technomalogical
  • 2,982
  • 2
  • 26
  • 43
8
votes
3 answers

How to get nth column with regexp delimiter

Basically I get line from ls -la command: -rw-r--r-- 13 ondrejodchazel staff 442 Dec 10 16:23 some_file and want to get size of file (442). I have tried cut and sed commands, but was unsuccesfull. Using just basic UNIX tools (cut, sed, awk...),…
Ondra
  • 3,100
  • 5
  • 37
  • 44
7
votes
6 answers

BLEU score implementation for sentence similarity detection

I need to calculate BLEU score for identifying whether two sentences are similar or not.I have read some articles which are mostly about BLEU score for Measuring Machine translation accuracy.But i'm in need of a BLEU score to find out similarity…
KNsiva
  • 377
  • 2
  • 8
  • 19
7
votes
3 answers

String splitting data.table column produces NAs

This is my first question on SO so let me know if it can be improved. I am working on a natural language processing project in R and am trying to build a data.table that contains test cases. Here, I build a much simplified example: texts.dt <-…
7
votes
3 answers

Reading text values into matlab variables from ASCII files

Consider the following file var1 var2 variable3 1 2 3 11 22 33 I would like to load the numbers into a matrix, and the column titles into a variable that would be equivalent to: variable_names = char('var1', 'var2', 'variable3'); I…
Boris Gorelik
  • 29,945
  • 39
  • 128
  • 170
7
votes
3 answers

Loading text data in Octave with specific format

I have a data set that I would like to store and be able to load in Octave 18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu" 15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320" 18.0 …
user317706
  • 2,077
  • 3
  • 19
  • 18
7
votes
10 answers

How to remove all attributes from html?

I have raw html with some css classes inside for various tags. Example: Input:

Lorem ipsum dolor sit amet, consectetur adipisicing elit. Neque molestias natus iste labore a accusamus dolorum vel.

and I…
Pavel Binar
  • 2,096
  • 5
  • 16
  • 26
7
votes
7 answers

Can you really build a fast word processor with GoF Design Patterns?

The Gang of Four's Design Patterns uses a word processor as an example for at least a few of their patterns, particularly Composite and Flyweight. Other than by using C or C++, could you really use those patterns and the object-oriented overhead…
Mark Cidade
  • 98,437
  • 31
  • 224
  • 236
7
votes
4 answers

Is there a python module for regex matching in zip files

I have over a million text files compressed into 40 zip files. I also have a list of about 500 model names of phones. I want to find out the number of times a particular model was mentioned in the text files. Is there any python module which can do…
cnu
  • 36,135
  • 23
  • 65
  • 63
7
votes
5 answers

Extract text between two strings repeatedly using sed or awk?

I have a file called 'plainlinks' that looks like this: 13080. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94092-2012.gz 13081. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94094-2012.gz 13082.…
Mike Furlender
  • 3,869
  • 5
  • 47
  • 75
6
votes
3 answers

How to flip text horizontally?

i'm need to write a function that will flip all the characters of a string left-to-right. e.g.: Thė quiçk ḇrown fox jumṕềᶁ ovểr thë lⱥzy ȡog. should become .goȡ yzⱥl ëht rểvo ᶁềṕmuj xof nworḇ kçiuq ėhT i can limit the question to UTF-16 (which…
Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
6
votes
4 answers

How to extract data from a text file using R or PowerShell?

I have a text file containing data like this: This is just text ------------------------------- Username: SOMETHI C: [Text] Account: DFAG Finish time: 1-JAN-2011 00:31:58.91 Process…
jrara
  • 16,239
  • 33
  • 89
  • 120