0

I am reading a text file which was copied from a CSV file. When I read the file in python, I get a ton of unnecessary repeating lines as seen below. How can i strip away those three unwanted lines, including \cf0 and \cell\row at the beginning and end of each text?

Or should I read the text directly from the csv file itself? the text is in just one of the columns of the CSV file.

\itap1\trowd \taflags1 \trgaph108\trleft-108 \trbrdrl\brdrnil \trbrdrr\brdrnil 

\clvertalc \clshdrawnil \clbrdrt\brdrs\brdrw20\brdrcf2 \clbrdrl\brdrs\brdrw20\brdrcf2 \clbrdrb\brdrs\brdrw20\brdrcf2 \clbrdrr\brdrs\brdrw20\brdrcf2 \clpadl100 \clpadr100 \gaph\cellx8640

\pard\intbl\itap1\pardeftab720

\cf0 i have been using your product and it has been helping me a lot to solve business problem,\cell \row



\itap1\trowd \taflags1 \trgaph108\trleft-108 \trbrdrl\brdrnil \trbrdrr\brdrnil 

\clvertalc \clshdrawnil \clbrdrt\brdrs\brdrw20\brdrcf2 \clbrdrl\brdrs\brdrw20\brdrcf2 \clbrdrb\brdrs\brdrw20\brdrcf2 \clbrdrr\brdrs\brdrw20\brdrcf2 \clpadl100 \clpadr100 \gaph\cellx8640

\pard\intbl\itap1\pardeftab720

\cf0 I am very happy with your products. Very easy to use.\cell \row



\itap1\trowd \taflags1 \trgaph108\trleft-108 \trbrdrl\brdrnil \trbrdrr\brdrnil 

\clvertalc \clshdrawnil \clbrdrt\brdrs\brdrw20\brdrcf2 \clbrdrl\brdrs\brdrw20\brdrcf2 \clbrdrb\brdrs\brdrw20\brdrcf2 \clbrdrr\brdrs\brdrw20\brdrcf2 \clpadl100 \clpadr100 \gaph\cellx8640

\pard\intbl\itap1\pardeftab720

\cf0 Many improvements with income tracker, and other time saving elements.  Newer look, easier navigation.  I believe there definitely is a time savings from past versions.\cell \row

Here is a snippet of the csv file:

page_url       Review_title   Product_id  Rating Publish_date  Review_Description
www.blabla.com  Great!         777777       5        01/01/14    Excellent upgrade! Was not disappointed!

I only copied text from the Review_Description column and pasted them all in a text file.

Here is my python code to just read the file:

text_file=open("my_text.txt", "r")
lines=text_file.readlines()
print lines
jxn
  • 7,685
  • 28
  • 90
  • 172
  • Yes, it's probably better to skip rows and columns in the CSV than to parse the CSV into some other form and then try to recover the original structure you threw away so you can skip parts of it. Can you show us a fragment of the CSV, and your current parsing code? – abarnert Jan 09 '14 at 23:15
  • Have included snippet of the csv file and my simple read file code. – jxn Jan 09 '14 at 23:28

1 Answers1

0

Your real problem here appears to be that you pasted the CSV into an RTF file, not a text file. Pasting into Wordpad on Windows or TextEdit on Mac (especially if you copied from, say, Excel or Numbers) and saving it without explicitly telling it to "save as plain text" or "convert to plain text" will generally "help" you this way automatically.

While you could try to parse the RTF to recover the original text, you're much better off just using the original text if possible. Parsing CSV files in Python—either with Pandas, or with the stdlib's csv module—is very easy.

For example, your file appears to use tabs as delimiters, and no other non-default features. If so:

import csv
with open('my_csv.csv', 'rb') as f:
    reader = csv.DictReader(f, delimiter='\t')
    reviews = [row['Review_Description'] for row in reader]

Now you have a list of all the reviews, and can do anything you want with them. If you just want to print them out, it's even simpler:

import csv
with open('my_csv.csv', 'rb') as f:
    reader = csv.DictReader(f, delimiter='\t')
    for row in reader:
        print row['Review_Description']
abarnert
  • 354,177
  • 51
  • 601
  • 671