Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
0
votes
2 answers

Get the n-th range by pattern

My input is like this: start content A end garbage start content B end I want to extract the second (or first, or third ...) start .. end block. With sed -ne '/start/,/end/p' I can filter out the garbage, but how do I get just "start content B…
phihag
  • 278,196
  • 72
  • 453
  • 469
0
votes
1 answer

keep single space line in srt file by removing ctrl-m characters and double empty line

we process a lot of srt files in linux to generate derivatives , but some of them have ctrl-M characters since they were generated in windows. right now I put two commands to check and take out the hidden characters tr -d '\015' <${file}.srt…
Calvin
  • 407
  • 1
  • 5
  • 21
0
votes
2 answers

deleting words containing number(s)

Coming to the sed part of an assignment I am facing difficulty with a RegEx (on Ubuntu) which deletes every word that contains a one or more numbers. Here is the expression I got so far: echo sed /\w.*[0-9]+.*\w/g text > text Sample: asdkbasdnas…
user9610829
0
votes
4 answers

Move numbers at the beginning of the line to the end of the line

I have an output from Unix uniq -c command which prints the number of occurrences of a string at the beginning of each line. The string represents two authors separated by a pipe (e.g., Aabdel-Wahab S|Abdel-Hafeez EH). 1 Aabdel-Wahab…
Andrej
  • 3,719
  • 11
  • 44
  • 73
0
votes
1 answer

Text processing using awk when separator is part of word?

I have a CSV file included 11 columns with the similar content SE Australia|PRM|2017-09-07T16:11:33|2641|-5537383165259899960|2017-09-07T16:12:17|"AU en2|networking-locator"|-|SC7_Electricians_Installer (only provides labor)|p-0715125|1 I am…
pm1359
  • 622
  • 1
  • 10
  • 31
0
votes
2 answers

Extract value based on column header from Comma separated file using bash

I want to extract 1st value from a csv for a specific column name using bash. For example, i want to extract first value of column "bb". Columns can be in any order aa,bb,cc 1,2,3 4,5,6 The output should be 2.
0
votes
1 answer

How to get all opened chromium tabs list for Linux in CLI?

I try this : strings ~/'.config/chromium/Default/Current Session' | grep 'https?:' but I get only one match. What's going on ? oO The output is newline \n delimited I'm only be able to 'grep' with awk: strings ~/'.config/chromium/Default/Current…
0
votes
3 answers

pandas: extract specific text before or after hyphen, that ends in given substrings

I am very new to pandas and have a data frame similar to the below import pandas as pd df = pd.DataFrame({'id': ["1", "2", "3","4","5"], 'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd", …
Funkeh-Monkeh
  • 649
  • 6
  • 17
0
votes
0 answers

Machine Learning using Multiple Features - Text Processing

I have data like following: col1 col2 col3 2 14 text, text, some text I went through http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing but I could only find information to vectorize col3 and pass it on for…
0
votes
2 answers

What is the best approach OR pre-built web service for words-to-numbers conversion (for USD)?

For example: converting a string “three hundred dollars and twelve cents” to return “$300.12” back. Siri does this well, but I can’t find open tools to do this. I'd like to parse numerical and monetary values from textual input returned from a…
0
votes
0 answers

How to visualize bad predictions in keras?

I'm working on a text classification problem, I'd like to know exacly for which input I got wrong prediction at validation. Is there a way to do that in Keras? I imagine something like a column bar graph for bad predictions. (X sentence length, Y…
LagSurfer
  • 387
  • 4
  • 19
0
votes
2 answers

Perl Regex to Process Text Input

I currently have a perl script that imports HTML and converts it to plain text. I am using HTML::TagFilter to remove all the HTML tags and it is working almost perfectly except we've run into one issue. When the HTML contains non-stand HTML tags…
Russell C.
  • 1,649
  • 6
  • 33
  • 55
0
votes
1 answer

Iterating through a folder that's passed in as a paramter to a Bash script

I'm trying to iterate over a folder, running a grep on each file, and putting them into separate files, tagged with a .res extension. Here's what I have so far.... #!/bin/bash directory=$(pwd) searchterms="searchterms.txt" extension=".end" usage()…
Dycey
  • 4,767
  • 5
  • 47
  • 86
0
votes
0 answers

Text processing in bash - extracting information between multiple HTML tags and outputting it into CSV format

I have multiple .xml files. I need the information between tags, which I can easily get with grep, but I can't seem to be able to do anything with it after. grep -oP '(.*)|(.*)|(.*)'…
ajax23
  • 33
  • 4
0
votes
1 answer

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed. The file names are formatted like this:…
Y. Gf
  • 15
  • 4