Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
5
votes
1 answer

How can I read 2 consecutive lines of a text file and save them as temporary variables

I have files with an ID, model, and date. the files have a format similar to 10000_9999-99_10-01-2011.zip (where 10000 is the ID, 9999-99 is the model, and of course 10-01-2011 is the date). I would like to modify the dates of these files, but…
Jeff K
  • 243
  • 3
  • 10
5
votes
2 answers

Text features input format for classification algorithms in scikit-learn

I'm starting to use the scikit-learn to do some NLP. I've already used some classifiers from NLTK and now I want to try the ones implemented in scikit-learn. My data is basically sentences, and I extract features from some words of those sentences…
5
votes
3 answers

Given upper case names transform to Proper Case, handling "O'Hara", "McDonald" "van der Sloot" etc

I am provided a list of names in upper case. For the purpose of a salutation in an email I would like them them to be Proper Cased. Easy enough to do using PHP's ucwords. But I feel I need some regex function to handle common exceptions, such…
AllInOne
  • 1,450
  • 2
  • 14
  • 32
4
votes
2 answers

Replace Long list Words in a big Text File

i need a fast method to work with big text file i have 2 files, a big text file (~20Gb) and an another text file that contain ~12 million list of Combo words i want find all combo words in the first text file and replace it with an another Combo…
4
votes
2 answers

Save a specific part of a huge text file (over 2GB)

I have large log files which contains timestamps every one second.what I need is to cut a user defined part from this huge file and save it in another text file..i am confused since the fstream class can deal with a max file size of 2GB and reading…
user1096252
  • 127
  • 1
  • 2
  • 8
4
votes
1 answer

Match trigrams, bigrams, and unigrams to a text; if unigram or bigram a substring of already matched trigram, pass; python

main_text is a list of lists containing sentences that've been part-of-speech tagged: main_text = [[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN'), ('likes','VB'), ('tea','NN'), ('and','CC'), ('hats', 'NN')], [('the', 'DT'),…
Renklauf
  • 971
  • 1
  • 12
  • 27
4
votes
2 answers

Conditional Splitting in Perl

I have the following sentences: my $sent = 'D. discoideum and D. purpureum developmental programs revealed'; Is there a way I can split the lines so that two consecutive words that have '.' (dot) in between will be treated as one word? Hence we…
neversaint
  • 60,904
  • 137
  • 310
  • 477
4
votes
1 answer

How to deal with unicode character encoding issues while converting documents from PDF to Text

I am trying to extract text from a PDF. The PDF contains text in Hindi (Unicode). The utility for extraction I am using is Apache PDFBox ( http://pdfbox.apache.org/). The extractor extracts the text, but the text is not recognizable. I tried…
4
votes
2 answers

Regex doesn't matches

I've a trouble with regexp in such situation: I need to extract (and replace) all dots from such construction: any_symbols->white_space->x.(or xx. or Xx. or xX. or xy. or yy. etc.)->white_space->any_symbol_not_upper_case_and_not_a_digit for…
stemm
  • 5,960
  • 2
  • 34
  • 64
4
votes
1 answer

Efficient Way to Slice Strings in Pandas

I have a dataset that has over 100 million rows that I am trying to manipulate in pandas. I am trying to slice the string in a based on the values in b and c as the start and end points respectively. I can do this with list comprehension like…
Kyle
  • 2,543
  • 2
  • 16
  • 31
4
votes
2 answers

Find similar words by pronunciation - algorithms, approaches, libraries

By 'table' it should find 'cable', 'tabular' etc. E.g. like you type the word in type in dictionary and it says may be you wanted word1, word2 which are close in spell to the one I typed. What is the name of algorithms and approaches used? Any…
msorc
  • 907
  • 1
  • 7
  • 20
4
votes
1 answer

Bidirectional text in unity

I have been using Unity3D for about 2 years now, and one thing that I can't figure out is how to have bidirectional text. In my programs I write in Hebrew and English. The problem is that Textmesh Pro doesn't support it at all. It flips the Hebrew…
SagiZiv
  • 932
  • 1
  • 16
  • 38
4
votes
3 answers

Anomaly in text

Let me explain with an example. We have the following text: "Comme Il Faut was founded in 1927. The tobacco company is most well known for its reputation of producing customized private label brands for its partners worldwide". This is normal…
user348173
  • 8,818
  • 18
  • 66
  • 102
4
votes
4 answers

perl: how to remove particular word or pattern in between two patterns

I want to remove some words within two patterns using perl The following is my text .......... QWWK jhjh kljdfh jklh jskdhf jkh PQXY lhj ah jh sdlkjh PQXY jha slkdjh PQXY jh alkjh ljk kjhaksj dkjhsd KWWQ hahs dkj h PQXY ......... Now i want to…
Santhosh
  • 9,965
  • 20
  • 103
  • 243
4
votes
6 answers

Surround every line with single quote except empty lines

My goal is to add a single apostrophe to every line in the file and skip empty lines. file.txt: Quote1 Quote2 Quote3 So far I have used sed: sed -e "s/\(.*\)/'\1'/" Which does the job but creates apostrophes also in empty…
AndroidFreak
  • 866
  • 1
  • 10
  • 31