Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
0
votes
1 answer

Python + dataframe : AttributeError: 'float' object has no attribute 'replace'

I am trying to write a function to do some text processing on the specified columns (description, event_name) of a Pandas dataframe. I wrote this code: #removal of unreadable chars, unwanted spaces, words of at most length two from 'description'…
Debbie
  • 911
  • 3
  • 20
  • 45
0
votes
0 answers

How to join two difrent lines in same line in linux

From Nagios I downloaded its html file by using the command wget, then I converted that html file to text file using following command: html2text -width 180 file.html >1.txt Alert Summary Report …
indra
  • 19
  • 6
0
votes
1 answer

How do I extract column from CSV with quoted commas, using the shell?

I have a CSV file, but unlike in related questions, it has some columns containing double-quoted strings with commas, e.g. foo,bar,baz,quux 11,"first line, second column",13.0,6 210,"second column of second line",23.1,5 (of course it's longer, and…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

Wrting each item in a list into a separate txt file with auto-assigned filename (python=3.6)

I'm using textract to get plain text from PDF files. For the plain text of each PDF file in the directory, I append it to the list filetext_list. I want to write each item of the list to a separate txt file with an auto-assigned filename like…
0
votes
2 answers

Spacy - preprocessing & lemmatization taking long time

I am working on text classification problem and I have tried WordNetLemmmatizer then followed by TF-IDF, CountVectorizer. Now, I am trying to clean up the text using Spacy before feeding to TF-IDF. Input file has around 20,000 records with each…
Chetan Ambi
  • 159
  • 3
  • 9
0
votes
2 answers

Bash Shellscript Column Check Error Handling

I am writing a Bash Shellscript. I need to check a file for if $value1 contains $value2. $value1 is the column number (1, 4, 5 as an example) and $value2 ($value2 can be '03', '04' , '09' etc) is the String I am looking for. If the column contains…
Defcon
  • 807
  • 3
  • 15
  • 36
0
votes
2 answers

Calling replace on replacement String in replaceAll() method

It seems like this should work, but it doesn't... I'm trying to call the replace() method on the replacement String passed to the replaceAll() method. For example I tried to get rid of any commas inside double quotes with this code: String string =…
spectrum
  • 379
  • 4
  • 11
0
votes
3 answers

Optimise looping through file contents

I have two files, file1 and file2. I need to check if all the contents in file1 are present in file2. Contents of the file1 will be as following: ABC1234 BFD7890 And contents of file2 will be as…
screenslaver
  • 563
  • 1
  • 8
  • 17
0
votes
1 answer

How to add items to a JComboBox from an External File?

Please i need some help in adding items to a JComboBox in Java from an external file. Here is my code so far: //Loading the Names: File Names_File = new File("Data" + File.separator + "Names.txt"); FileInputStream fis = null; …
CompilingCyborg
  • 4,760
  • 13
  • 44
  • 61
0
votes
2 answers

Batch script that replaces static string in file with filename

I have 3000 files in c:\data\, and I need to replace a static string in each of them with the name of the file. For example, in the file 12345678.txt there will be some records along with the string 99999999, and I want to replace 99999999 with the…
Kumar
0
votes
2 answers

Identifying the difference in Two files in unix

I have 2 files rec1.txt and rec2.txt. [gpadmin@subh ~]$cat ret1.txt emcas_fin_bi=324 emcas_fin_drr=3294 emcas_fin_exp=887 emcas_fin_optics=0 emcas_gbo_gs=3077 and [gpadmin@subh ~]$ cat…
0
votes
1 answer

Need help improving text processing program (Python 3)

I have written a python program to loop through a list of X files, open each one, read line by line, and write (append) to an output file. Being that these files are several GB each, it is taking very long.. I am looking for suggestions to improve…
0
votes
2 answers

Extracting utterance per line in python

I have text data containing one utterance per line. I want to extract it so I have a list containing the all utterance with the same length of the line. Here is an example of my data input.txt I am very happy today. Are you angry with me...? No? Oh…
sugab
  • 183
  • 15
0
votes
2 answers

Python: How remove punctuation in text corpus, but not remove it in special words (e.g. c++, c#, .net, etc)

I have a big pandas dataset with job descriptions. I want to tokenize it, but before this I should remove stopwords and punctuation. I have no problems with stopwords. If I will use regex for removing punctuation, I can lose very important words…
Carlo Pazolini
  • 315
  • 1
  • 8
0
votes
1 answer

Use LaF and grepl together

I would like to read in a possibly large text file and filter the relevant lines on the fly based on a regular expression. My first approach was using the package LaF which supports chunkwise reading and then grepl to filter. However, this seems not…
Karsten W.
  • 17,826
  • 11
  • 69
  • 103