Questions tagged [text-processing]

Mechanizing the creation or manipulation of electronic text.

Text processing includes basic processing jobs using filter, tokenization or normalization method to process text. This could be a pre-processing step for .

See also:

1959 questions
9
votes
2 answers

Deleting the last line of a file with Java

I have a .txt file, which I want to process in Java. I want to delete its last line. I need ideas on how to achieve this without having to copy the entire content into another file and ignoring the last line. Any suggestions?
Sergiu
  • 2,502
  • 6
  • 35
  • 57
9
votes
1 answer

AssertionError: Some objects had attributes which were not restored

I was training a basic LSTM on text prediction by following the official TensorFlow site here. I had managed to train my model up to 40 epochs on a GTX 1050ti and had saved the checkPoint file in a separate folder. However, when I now try to restore…
neel g
  • 1,138
  • 1
  • 11
  • 25
9
votes
6 answers

How do I identify language of a text document in Java?

Is there an existing Java library that could tell me whether a String contains English language text or not (e.g. I need to be able to distinguish French or Italian text -- the function needs to return false for French and Italian, and true for…
anonymous coward
9
votes
5 answers

How to remove YAML frontmatter from markdown files?

I have markdown files that contain YAML frontmatter metadata, like this: --- title: Something Somethingelse author: Somebody Sometheson --- But the YAML is of varying widths. Can I use a Posix command like sed to remove that frontmatter when it's…
Jonathan
  • 10,571
  • 13
  • 67
  • 103
9
votes
2 answers

Getting the basic form of the english word

I am trying to get the basic english word for an english word which is modified from its base form. This question had been asked here, but I didnt see a proper answer, so I am trying to put it this way. I tried 2 stemmers and one lemmatizer from…
Gunjan
  • 2,775
  • 27
  • 30
9
votes
6 answers

Optimize a list of text additions and deletions

I've got a list containing positions of text additions and deletions, like this: Type Position Text/Length 1. + 2 ab // 'ab' was added at position 2 2. + 1 cde // 'cde' was added at position…
Harmen
  • 22,092
  • 4
  • 54
  • 76
9
votes
2 answers

Identifying verb tenses in python

How can I use Python + NLTK to identify whether a sentence refers to the past/present/future ? Can I do this only using POS tagging? This seems a bit inaccurate, seems to me that I need to consider the sentence context and not only the words…
JohnTortugo
  • 6,356
  • 7
  • 36
  • 69
9
votes
2 answers

TFIDF calculating confusion

I found the following code on the internet for calculating TFIDF: https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error: return…
badc0re
  • 3,333
  • 6
  • 30
  • 46
9
votes
7 answers

sed how to delete first 17 lines and last 8 lines in a file

I have a big file 150GB CSV file and I would like to remove the first 17 lines and the last 8 lines. I have tried the following but seems that's not working right sed -i -n -e :a -e '1,8!{P;N;D;};N;ba' and sed -i '1,17d' I wonder if someone can…
Deano
  • 11,582
  • 18
  • 69
  • 119
8
votes
4 answers

phonetic spelling in Python and Java

I am trying to build a system that accepts text and outputs the phonetic spelling of the words of this text. Any ideas on what libraries can be used in Python and Java?
pacodelumberg
  • 2,214
  • 4
  • 25
  • 32
8
votes
1 answer

How to calculate tag-wise precision and recall for POS tagger?

I am using some rule-based and statistical POS taggers to tag a corpus(of around 5000 sentences) with Parts of Speech(POS). Following is a snippet of my test corpus where each word is seperated by its respective POS tag by '/'. No/RB ,/, it/PRP…
stressed_geek
  • 2,118
  • 8
  • 33
  • 45
8
votes
3 answers

How to replace only last match in a line with sed?

With sed, I can replace the first match in a line using sed 's/pattern/replacement/' And all matches using sed 's/pattern/replacement/g' How do I replace only the last match, regardless of how many matches there are before it?
Jan Warchoł
  • 1,063
  • 1
  • 9
  • 22
8
votes
4 answers

difference between similar() and concordance in nltk

I have read the text1.similar("monstrous") and text1.concordance("monstrous") from this. Where I couldn't get the satisfactory answer for the difference between text1.concordance('monstrous') and text1.similar('monstrous') of natural language…
dex
  • 121
  • 2
  • 5
8
votes
3 answers

Classify words with the same meaning

I have 50.000 subject lines from emails and i want to classify the words in them based on synonyms or words that can be used instead of others. For example: Top sales! Best sales I want them to be in the same group. I build the following function…
dapo
  • 697
  • 1
  • 12
  • 22
8
votes
0 answers

How to filter a text I/O stream in Python

Given a text I/O stream (e.g. from open() or StringIO()), how do I create another stream that filters out lines that match a certain pattern, without reading the entire input stream first? I know that I can easily get an iterable with (line for line…
Uri Granta
  • 1,814
  • 14
  • 25