Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
6
votes
4 answers

Regex for finding an unterminated string

I need to search for lines in a CSV file that end in an unterminated, double-quoted string. For example: 1,2,a,b,"dog","rabbit would match whereas 1,2,a,b,"dog","rabbit","cat bird" 1,2,a,b,"dog",rabbit would not. I have very limited experience…
Austin Hyde
  • 26,347
  • 28
  • 96
  • 129
6
votes
3 answers

Linux join utility complains about input file not being sorted

I have two files: file1 has the format: field1;field2;field3;field4 (file1 is initially unsorted) file2 has the format: field1 (file2 is sorted) I run the 2 following commands: sort -t\; -k1 file1 -o file1 # to sort file 1 join -t\; -1 1 -2 1 -o…
Razvan
  • 9,925
  • 6
  • 38
  • 51
6
votes
4 answers

How do I read information from text files?

I have hundreds of text files with the following information in each file: *****Auto-Corelation Results****** 1 .09 -.19 .18 non-Significant *****STATISTICS FOR MANN-KENDELL TEST****** S= 609 VAR(S)= 162409.70 Z= …
Geekuna Matata
  • 1,349
  • 5
  • 19
  • 38
6
votes
5 answers

Replacing all GUIDs in a file with new GUIDs from the command line

I have a file containing a large number of occurrences of the string Guid="GUID HERE" (where GUID HERE is a unique GUID at each occurrence) and I want to replace every existing GUID with a new unique GUID. This is on a Windows development machine,…
user197015
6
votes
2 answers

How to perform Paragraph boundary detection in NLP frameworks?

I am working on extracting names of people from various ads appearing in English newspapers . However , i have noticed that I need to identify the boundary of an Ad , before extracting the names present in it ,since I need only the first occurring…
kiran
  • 339
  • 4
  • 18
6
votes
7 answers

Does an empty string contain an empty string in C++?

Just had an interesting argument in the comment to one of my questions. My opponent claims that the statement "" does not contain "" is wrong. My reasoning is that if "" contained another "", that one would also contain "" and so on. Who is…
Oleksiy
  • 37,477
  • 22
  • 74
  • 122
6
votes
1 answer

Split text on paragraphs where paragraph delimiters are non-standard

If I have text with standard paragraph formatting (a blank line followed by an indent) such as text 1 it's easy enough to extract the paragraphs using text.split("\n\n"). Text 1: Lorem ipsum dolor sit amet, consectetur adipiscing elit.…
Renklauf
  • 971
  • 1
  • 12
  • 27
6
votes
1 answer

How to ignore certain characters while doing diff in google-diff-match-patch?

I'm using google-diff-match-patch to compare plain text in natural languages. How can I make google-diff-match-patch to ignore certain characters? (Some tiny differences which I don't care.) For example, given text1: give me a cup of bean-milk.…
weakish
  • 28,682
  • 5
  • 48
  • 60
6
votes
1 answer

Use the :g command in vim with multiple actions

How can I use something like this? :g/^$/kJ Here kJ are two commands, instead of just one (like 'd') My concrete example: I have multiple lines looking like this queryBuilder .append("xyz"); and I want to make them look like…
kadrian
  • 4,761
  • 8
  • 39
  • 61
5
votes
1 answer

Segmenting words, and grouping hyphenated and apostrophe words from text

I need to segment words from a text. Some times the hyphenated words are written without hyphens, and apostrophe words are written without apostrophe. There are also similar issues like different spelling issues of same words (ex: color, colour), or…
phoxis
  • 60,131
  • 14
  • 81
  • 117
5
votes
5 answers

How can I get "grep -zoP" to display every match separately?

I have a file on this form: X/this is the first match/blabla X-this is the second match- and here we have some fluff. And I want to extract everything that appears after "X" and between the same markers. So if I have "X+match+", I want to get…
fedorqui
  • 275,237
  • 103
  • 548
  • 598
5
votes
2 answers

perl - split string into 2-character groups

Possible Duplicate: How can I split a string into chunks of two characters each in Perl? I wanted to split a string into an array grouping it by 2-character pieces: $input = "DEADBEEF"; @output = split(/(..)/,$input); This approach produces…
SF.
  • 13,549
  • 14
  • 71
  • 107
5
votes
3 answers

Split lines with multiple words in Python

I have a (very ugly) txt output from an SQL query which is performed by external system that I can't change. Here is the output example: FruitName Owner OwnerPhone ============= ================= ============ Red Apple Sr…
randms26
  • 137
  • 1
  • 6
  • 16
5
votes
5 answers

Remove unmatched parentheses from a string

I want to remove "un-partnered" parentheses from a string. I.e., all ('s should be removed unless they're followed by a ) somewhere in the string. Likewise, all )'s not preceded by a ( somewhere in the string should be removed. Ideally the algorithm…
Tom Lehman
  • 85,973
  • 71
  • 200
  • 272
5
votes
3 answers

Remove code between #if 0 and #endif when exporting a C file to a new one

I want to remove all comments in a toy.c file. From Remove comments from C/C++ code I see that I could use gcc -E -fpreprocessed -P -dD toy.c But some of my code (say deprecated functions that I don't want to compile) are wrapped up between #if 0…
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248