Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
19
votes
8 answers

How do I join pairs of consecutive lines in a large file (1 million lines) using vim, sed, or another similar tool?

I need to move the contents of every second line up to the line above such that line2's data is alongside line1's, either comma or space separated works. Input: line1 line2 line3 line4 Output: line1 line2 line3 line4 I've been doing it in vim with…
janeruthh
  • 193
  • 1
  • 1
  • 5
18
votes
1 answer

Apache Tika and character limit when parsing documents

Could please anybody help me to sort it out? It can be done like this Tika tika = new Tika(); tika.setMaxStringLength(10*1024*1024); But if you don't use Tika directly, like this: ContentHandler textHandler = new…
lisak
  • 21,611
  • 40
  • 152
  • 243
18
votes
5 answers

Is there any way to convert Wikitext to Markdown in python?

Is there a python library which takes wikitext (as used in mediawiki) input and converts it to markdown?
Ed L
  • 1,947
  • 2
  • 17
  • 30
18
votes
3 answers

Skip file lines until a match is found, then output the rest

I can write a trivial script to do this but in my ongoing quest to get more familliar with unix I'd like to learn efficient methods using built in commands instead. I need to deal with very large files that have a variable number of header lines.…
monty
17
votes
10 answers

Code for identifying programming language in a text file

i'm supposed to write code which when given a text file (source code) as input will output which programming language is it. This is the most basic definition of the problem. More constraints follow: I must write this in C++. A wide variety of…
PeterK
  • 6,287
  • 5
  • 50
  • 86
17
votes
5 answers

How do I read and parse a text file with numbers, fast (in C)?

The last time update: my classmate uses fread() to read about one third of the whole file into a string, this can avoid lacking of memory. Then process this string, separate this string into your data structure. Notice, you need to care about one…
beasone
  • 1,073
  • 1
  • 14
  • 32
17
votes
5 answers

How can I sum values in column based on the value in another column?

I have a text file which is: ABC 50 DEF 70 XYZ 20 DEF 100 MNP 60 ABC 30 I want an output which sums up individual values and shows them as a result. For example, total of all ABC values in the file are (50 + 30 = 80) and DEF is (100 + 70 = 170). So…
Sam
  • 171
  • 1
  • 1
  • 3
16
votes
6 answers

Balanced word wrap (Minimum raggedness) in PHP

I'm going to make a word wrap algorithm in PHP. I want to split small chunks of text (short phrases) in n lines of maximum m characters (n is not given, so there will be as much lines as needed). The peculiarity is that lines length (in characters)…
lorenzo-s
  • 16,603
  • 15
  • 54
  • 86
16
votes
3 answers

How to configure 'less' to show formatted markdown files?

I would like to have less display *.md markdown files with some formatting -- like I know less can, for manpages, etc. I am running Ubuntu 12.04. I am as far as putting a user defined filter into .lessfilter: #!/bin/sh case "$1" in *.md) …
towi
  • 21,587
  • 28
  • 106
  • 187
16
votes
2 answers

How to compute the number of times word appeared in a file or in some range

Sometimes I want to see how many times a certain function is called in a file or a code block. How do you do that? I am using Vim 7.2. I presume you have to use !wc or some such.
vehomzzz
  • 42,832
  • 72
  • 186
  • 216
15
votes
8 answers

Perl or Python: Convert date from dd/mm/yyyy to yyyy-mm-dd

I have lots of dates in a column in a CSV file that I need to convert from dd/mm/yyyy to yyyy-mm-dd format. For example 17/01/2010 should be converted to 2010-01-17. How can I do this in Perl or Python?
FunLovinCoder
  • 7,597
  • 11
  • 46
  • 57
14
votes
1 answer

Efficient text preprocessing using PySpark (clean, tokenize, stopwords, stemming, filter)

Recently, I began to learn the spark on the book "Learning Spark". In theory, everything is clear, in practice, I was faced with the fact that I first need to preprocess the text, but there were no actual tips on this topic. The first thing that I…
14
votes
5 answers

convert a `find` like output to a `tree` like output

This question is a generalized version of the Output of ZipArchive() in tree format question. Just before I am wasting time on writing this (*nix command line) utility, it will be a good idea to find out if someone already wrote it. I would like a…
Chen Levy
  • 15,438
  • 17
  • 74
  • 92
13
votes
1 answer

Determining frequency of an array in Python

I have a sample file filled with floating point numbers as follows: -0.02 3.04 3.04 3.02 3.02 3.06 3.04 3.02 3.04 3.02 3.04 3.02 3.04 3.02 3.04 3.04 3.04 3.02 3.04 3.02 3.04 3.02 3.04 3.02 3.06 3.02 3.04 3.02 …
y33t
  • 649
  • 4
  • 14
  • 23
13
votes
7 answers

Converting a \u escaped Unicode string to ASCII

After reading all about iconv and Encoding, I am still confused. I am scraping the source of a web page I have a string that looks like this: 'pretty\u003D\u003Ebig' (displayed in the R console as 'pretty\\\u003D\\\u003Ebig'). I want to convert this…
seancarmody
  • 6,182
  • 2
  • 34
  • 31