Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
0
votes
1 answer

Linux text file processing - join broken lines

I have problem with processing files. I except files which contains 8 columns separated by delimter pipe. The problem is that sometimes I get files with broken lines, example below. Every time should be: tst1|tst2|tst3|tst4|tst5|tst6|tst7|tst8 …
anton1009
  • 33
  • 2
  • 7
0
votes
1 answer

Notepad++ Column Editing Fit Multiline Uneven Width Selection In a Single Character Column Selection

I couldn't find a question which is answering my exact problem even if there are many questions with similar headline. When I am using Notepad++ in a column editing mode (alt + shift + drag to select + c to copy) and I select multiple lines with…
0
votes
0 answers

Counting frequency of words from predefined dictionary

I am performing text analysis for a document using mostly Pandas, NLTK, and TextBlob. I want to obtain the frequencies of only predefined terms. The rows in the document are reviews, and there is a predefined list of associations between words that…
S420L
  • 117
  • 10
0
votes
1 answer
0
votes
2 answers

What is faster/better practice between a for loop for greping a file & greping a file with a file query?

I used to have a script like the following for i in $(cat list.txt) do grep $i sales.txt done Where cat list.txt tomatoes peppers onions And cat sales.txt Price Products $8.88 bread $6.75 tomatoes $3.34 fish $5.57 peppers $0.95 beans $4.56…
MikeKatz45
  • 545
  • 5
  • 16
0
votes
1 answer

Perl record separator -

I'm stuck on a seemingly trivial problem but not sure what is it that I'm missing. Need help. I have a file that is delimited by the standard field separator (0x1f) and record separator (0x1e) characters.…
prabhu
  • 919
  • 2
  • 12
  • 28
0
votes
1 answer

Starting nested loop from current element position to the end of the list

I have a text file with the following structure: name1: sentence. [sentence. ...] # can be one or more name2: sentence. [sentence. ...] EDIT Input sample: Djohn: Hello. I am Djohn I am Djohn. Bot: Lorem ipsum dolor sit amet, consectetur adipiscing…
stackoverflower
  • 545
  • 1
  • 5
  • 21
0
votes
1 answer

sed replace line with multiline file or variable

I'm retrieving a section from a file and want to replace a line in another file with this multi-line data. Currently I'm outputting to a file but would prefer to use a variable. For instance R 0x00007d04 0x70040000 [OVERWRITE_1] C "- Starting…
gsmith
  • 43
  • 3
0
votes
3 answers

Replace everything except text between specific delimiters with whitespaces

I have following text file (file may contain up to few hundred lines): <% some important text %> something <% important stuff %> not important stuff <% some important text %> Basiclly I need to replace anything that is…
0
votes
3 answers

Removing columns in SQL file

I have a big SQL file (~ 200MB) with lots of INSERT instructions: insert into `films_genres` (`id`,`film_id`,`genre_id`,`num`) values (1,1,1,1), (2,1,17,2), (3,2,1,1), ... How could I remove or ignore columns id, num in…
akrisanov
  • 3,212
  • 6
  • 33
  • 56
0
votes
2 answers

How to iterate through very large text file separated by semicolons?

If I want to iterate through a text file line-by-line, here is how I do it: for curr_line in open('my_file.txt', 'r').readlines() print '|' + curr_line + '|' If I want to iterate through a text based on semi-colon separators, here is how I do…
Saqib Ali
  • 11,931
  • 41
  • 133
  • 272
0
votes
1 answer

Sequentially print data output as a formatted table in Python

I have written a Python script to execute data like My script : import os import os.path import re import smtplib from email.mime.text import MIMEText infile = r"D:\i2Build\i2SchedulerReport.txt" if os.path.isfile(infile) and os.access(infile,…
0
votes
1 answer

vlookup function between 2 files and append matches at EOL

Need to vlookup from two different files having multiple entries: cat file1.csv …
Adriano S.
  • 25
  • 5
0
votes
1 answer

Remove Part of Speech Tags after chunking

How to remove part of speech tags from the results of chunking ? I am using NLTK to do this. Currently I can only iterate to the chunks using this code: for i in sent_list: tagged = nltk.pos_tag(i) ChunkGram = r"""Chunk:…
Cua
  • 129
  • 9
0
votes
0 answers

Tesseract 4.0 OCR results inconsistent

We are trying to perform ocr on image with 2 characters and tesseract command returns incorrect output. Clearly expected result should be TV but we are getting AY. Result should have been S7 Ep7, but we are getting [Sa aes]. Which as you can see…