Pandas: ignore new lines as separators in read_csv

Question

I have an input string that has delimiter $$$Field$$$. The string has some lines. I need return a list of all the items in the string, separated by $$$Field$$$ only.

In the example below I should receive as output ['Food', 'Fried\nChicken', 'Banana']. However, seems that it is interpreting the new lines as a separator as well, so instead of a list I am getting a table. How can I ignore those new lines, so that I just get a list back?

import pandas as pd
from pandas.compat import StringIO

temp=u"""Food$$$Field$$$Fried
Chicken$$$Field$$$Banana"""
df = pd.read_csv(StringIO(temp), sep='\$\$\$Field\$\$\$',engine='python')
print (df)

The only reason why I am using pandas is because this string is actually a huge .csv file, and I cannot read all this in memory at a time, but a streaming processing would be acceptable.

Remove the undesired `\n` from the input itself, using `temp = "".join(temp.split("\n"))` — ZdaR, Mar 01 '17 at 14:48
We want to keep all the \n, but as a part of the string, like "Fried\nChicken" in the example. — , Mar 01 '17 at 14:54
What does your desired DataFrame look like? New line characters are a default line delimiter for tabular files so there needs to be a way to distinguish if it is a line delimiter or to be kept in the string. — victorlin, Mar 01 '17 at 15:12
the DataFrame should look like `['Food', 'Fried\nChicken', 'Banana']`. We don't want to use line delimiters, all new lines should be kept in the string — , Mar 01 '17 at 15:20

score 1 · Accepted Answer · edited May 23 '17 at 12:17

1

Since you are not looking to store your information in a tabular format, I don't think a DataFrame is necessary. Instead, read your string in chunks and yield the buffer every time it encounters '$$$Field$$$'.

Adapted from https://stackoverflow.com/a/16260159/4410590:

def myreadlines(f, newline):
    buf = ""
    while True:
        while newline in buf:
            pos = buf.index(newline)
            yield buf[:pos]
            buf = buf[pos + len(newline):]
        chunk = f.read(4096)
        if not chunk:
            yield buf
            break
        buf += chunk

Then call the function:

> for x in myreadlines(StringIO(temp), '$$$Field$$$'):
      print repr(x)

u'Food'
u'Fried\nChicken'
u'Banana'

edited May 23 '17 at 12:17

Community

1
1

answered Mar 01 '17 at 15:47

victorlin

638
7
14

nice, what if the chunk of bytes read cuts off the delimiter. e.g the last bytes in the chunk could be "....Field$" – parsethis Mar 01 '17 at 15:50
@putonspectacles good catch. Updated with a better function to handle that. – victorlin Mar 01 '17 at 16:02

parsethis · Answer 2 · 2017-03-01T15:29:59.647

0

well this should do what you want just scale it to multiple lines:

df = pd.DataFrame("""Food$$$Field$$$Fried
Chicken$$$Field$$$Banana""".split("$$$Field$$$")).T

print(df)

Depending on where (how) your text is stored just you can do the splitting in a list comprehension:

df = pd.DataFrame(lines.split("$$$Field$$$") for line in lines).T

edited Mar 01 '17 at 15:29

answered Mar 01 '17 at 15:24

parsethis

7,998
3
29
31

Sounds good. And if instead of a string it was a HUGE .csv file with that contains that text? I just used a string to make the problem easier to understand. – Mar 01 '17 at 15:33
what is the delimiter in the csv file, a comma? – parsethis Mar 01 '17 at 15:36
1

If OP is trying to avoid reading the string into memory, you cannot call `split` on the entire string without reading it all into memory. – victorlin Mar 01 '17 at 15:36
@victor where is that a requirement? edit: sorry I didnt see that part of the comment – parsethis Mar 01 '17 at 15:37
the delimiter in the csv file is "$$$Field$$$" – Mar 01 '17 at 15:38

Pandas: ignore new lines as separators in read_csv

2 Answers2