Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
6
votes
4 answers

Low performance with BufferedReader

I am processing a number of text files line by line using BufferReader.readlLine(). Two files having same size 130MB but one take 40sec to get processed while other takes 75 sec. I noticed one file has 1.8 million of lines while other has 2.1…
samarth
  • 3,866
  • 7
  • 45
  • 60
6
votes
0 answers

What is the purpose of the Configurations2 directory inside of an ODF-Document?

The directory is not mentioned in the OASIS-Specification of ODF. Does anyone know the purpose of this directory? Its structure is as…
AlexTheBird
  • 677
  • 4
  • 16
6
votes
1 answer

Parallel Computation for Create_Matrix 'RTextTools' package

I am creating a DocumentTermMatrix using create_matrix() from RTextTools and create container and model based on that. It is for extremely large datasets. I do this for each category (factor levels). So for each category it has to run matrix,…
6
votes
1 answer

I cannot understand the skipgrams() function in keras

I am trying to understand the skipgrams() function in keras by using the following code from keras.preprocessing.text import * from keras.preprocessing.sequence import skipgrams text = "I love money" #My test sentence tokenizer =…
Raven Cheuk
  • 2,903
  • 4
  • 27
  • 54
6
votes
8 answers

"Absolute" string metric

I have a huge (but finite) set of natural language strings. I need a way to convert each string to a numeric value. For any given string the value must be the same every time. The more "different" two given strings are, the more different two…
Alexander Gladysh
  • 39,865
  • 32
  • 103
  • 160
6
votes
3 answers

How to efficiently parse large text files in Ruby

I'm writing an import script that processes a file that has potentially hundreds of thousands of lines (log file). Using a very simple approach (below) took enough time and memory that I felt like it would take out my MBP at any moment, so I killed…
localshred
  • 2,244
  • 1
  • 21
  • 33
6
votes
2 answers

Can "perl -a" somehow re-join @F using the original whitespace?

My input has a mix of tabs and spaces for readability. I want to modify a field using perl -a, then print out the line in its original form. (The data is from findup, showing me a count of duplicate files and the space they waste.) Input is: 2 *…
piojo
  • 6,351
  • 1
  • 26
  • 36
6
votes
4 answers

Big text file processing

I need to implement lazy loading in Mathematica. I have a 600 Mb CSV text file which I need to process. This file contains a lot of duplicated records: 1;0;0;13;6 1;0;0;13;6 .......... 2;0;0;13;6 2;0;0;13;6 .......... etc. So instead of loading…
Max
  • 19,654
  • 13
  • 84
  • 122
6
votes
8 answers

Read line by line and print matches line by line

I am new to shell scripting, it would be great if I can get some help with the question below. I want to read a text file line by line, and print all matched patterns in that line to a line in a new text file. For example: $ cat input.txt SYSTEM…
Dinesh Kumar
  • 105
  • 1
  • 8
6
votes
10 answers

How can I loop through blocks of lines in a file?

I have a text file that looks like this, with blocks of lines separated by blank lines: ID: 1 Name: X FamilyN: Y Age: 20 ID: 2 Name: H FamilyN: F Age: 23 ID: 3 Name: S FamilyN: Y Age: 13 ID: 4 Name: M FamilyN: Z Age: 25 How can I loop through…
Adia
  • 1,171
  • 5
  • 16
  • 33
6
votes
4 answers

Parse string into a tree structure?

I'm trying to figure out how to parse a string in this format into a tree like data structure of arbitrary depth. "{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}" [[["Hello big" "Hi" "Hey"] ["world" "earth"]] …
erikcw
  • 10,787
  • 15
  • 58
  • 75
6
votes
8 answers

Efficiently parsing a large text file in C#

I need to read a large space-seperated text file and count the number of instances of each code in the file. Essentially, these are the results of running some experiments hundreds of thousands of times. The system spits out a text file that looks…
ChrisCa
  • 10,876
  • 22
  • 81
  • 118
6
votes
3 answers

Randomizing text between delimiters

I have this simple input I have {red;green;orange} fruit and cup of {tea;coffee;juice} I use Perl to identify patterns between two external brace delimiters { and }, and randomize the fields inside with the internal delimiter ;. I'm getting this…
kempinski
  • 63
  • 3
6
votes
6 answers

Fast Text Preprocessing

In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this: get HTML page -> (To plain text -> stemming -> remove stop words) ->…
Ventus
  • 2,482
  • 4
  • 35
  • 41
6
votes
1 answer

Count word frequencies in list-of-lists-of-words

I have this large corpus data in dataframe res (dataframe) text.1 1 …
KRU
  • 291
  • 4
  • 18