Python read text file with newline and and paragraph separated elements

Question

I am trying to read a text file to a nested list in Python. That is, I would like to have the output as:

[[$5.79, Breyers Ice Cream, Homemade Vanilla, 48 oz], [$6.39, Haagen-dazs, Vanilla Bean Ice Cream, 1 pt], etc...]]

The ultimate goal is to read the information into a pandas DataFrame for some exploratory analysis.

The Data (in a .txt file)

$5.79  
Breyers Ice Cream  
Homemade Vanilla  
48 oz

$6.39  
Haagen-dazs  
Vanilla Bean Ice Cream  
1 pt

$6.89  
So Delicious  
Dairy Free Coconutmilk No Sugar Added Dipped Vanilla Bars  
4 x 2.3 oz

$5.79  
Popsicle Fruit Pops Mango  
12 ct

What I've Tried

with open(sample.txt) as f:
   creams = f.read()


creams = f.split("\n\n")

However, this returns:

['$5.79\nBreyers Ice Cream\nHomemade Vanilla\n48 oz', '$6.39\nHaagen-dazs\nVanilla Bean Ice Cream\n1 pt',

I have also tried utilizing list comprehension methods that look cleaner than the above code, but these attempts handle the newlines, not the paragraphs or returns. For example:

[x for x in open('<file_name>.txt').read().splitlines()]  
#Gives
['$5.79', 'Breyers Ice Cream', 'Homemade Vanilla', '48 oz', '', '$6.39', 'Haagen-dazs', 'Vanilla Bean Ice Cream', '1 pt', '', '

I know I would need to nest a list within the list comprehension, but I'm unsure how to perform the split.

Note: This is my first posted question, sorry for the length or lack of brevity. Seeking help because there are similar questions but not with the outcome I desire.

jeremy_rutman · Accepted Answer · 2020-01-04T20:18:31.970

4

You are nearly there once you have the four-line groups separated. All that's left is to split the groups again by a single newline.

with open('creams.txt','r') as f:
    creams = f.read()

creams = creams.split("\n\n")
creams = [lines.split('\n') for lines in creams]
print(creams)

edited Jan 04 '20 at 20:18

answered Jan 04 '20 at 19:54

jeremy_rutman

3,552
4
28
47

Thanks, I don't have enough reputation to upvote but if I did I would! – ExpertPomegranate Jan 05 '20 at 17:46

score 0 · Answer 2 · answered Jan 04 '20 at 20:03

You just have to split it again.

with open('sample.txt','r') as file:
    creams = file.read()

creams = creams.split("\n\n")
creams = [lines.split('\n') for lines in creams]

print(creams)
#[['$5.79  ', 'Breyers Ice Cream  ', 'Homemade Vanilla  ', '48 oz'], ['$6.39  ', 'Haagen-dazs  ', 'Vanilla Bean Ice Cream  ', '1 pt'], ['$6.89  ', 'So Delicious  ', 'Dairy Free Coconutmilk No Sugar Added Dipped Vanilla Bars  ', '4 x 2.3 oz'], ['$5.79  ', 'Popsicle Fruit Pops Mango', '-', '12 ct']]

#Convert to Data
df = pd.DataFrame(creams, columns =['Amnt', 'Brand', 'Flavor', 'Qty'])

      Amnt                      Brand  \
0  $5.79          Breyers Ice Cream     
1  $6.39                Haagen-dazs     
2  $6.89               So Delicious     
3  $5.79    Popsicle Fruit Pops Mango   

                                              Flavor         Qty  
0                                 Homemade Vanilla         48 oz  
1                           Vanilla Bean Ice Cream          1 pt  
2  Dairy Free Coconutmilk No Sugar Added Dipped V...  4 x 2.3 oz  
3                                                  -       12 ct

Note: I have added - in the last row for the flavor column as it was empty. If your original dataset, you must take this into consideration before performing any analysis.

This worked great. I had a ValueError initially, "4 columns passed, passed data had 5 columns." The last nested list contains a ' ' element. I'll look into ways for marking the end of the text. Added a Misc column which I can drop. Cheers. — ExpertPomegranate, Jan 05 '20 at 17:42

Python read text file with newline and and paragraph separated elements

The Data (in a .txt file)

What I've Tried

2 Answers2