2

I have an input string that has delimiter $$$Field$$$. The string has some lines. I need return a list of all the items in the string, separated by $$$Field$$$ only.

In the example below I should receive as output ['Food', 'Fried\nChicken', 'Banana']. However, seems that it is interpreting the new lines as a separator as well, so instead of a list I am getting a table. How can I ignore those new lines, so that I just get a list back?

import pandas as pd
from pandas.compat import StringIO

temp=u"""Food$$$Field$$$Fried
Chicken$$$Field$$$Banana"""
df = pd.read_csv(StringIO(temp), sep='\$\$\$Field\$\$\$',engine='python')
print (df)

The only reason why I am using pandas is because this string is actually a huge .csv file, and I cannot read all this in memory at a time, but a streaming processing would be acceptable.

  • 1
    Remove the undesired `\n` from the input itself, using `temp = "".join(temp.split("\n"))` – ZdaR Mar 01 '17 at 14:48
  • We want to keep all the \n, but as a part of the string, like "Fried\nChicken" in the example. –  Mar 01 '17 at 14:54
  • What does your desired DataFrame look like? New line characters are a default line delimiter for tabular files so there needs to be a way to distinguish if it is a line delimiter or to be kept in the string. – victorlin Mar 01 '17 at 15:12
  • the DataFrame should look like `['Food', 'Fried\nChicken', 'Banana']`. We don't want to use line delimiters, all new lines should be kept in the string –  Mar 01 '17 at 15:20

2 Answers2

1

Since you are not looking to store your information in a tabular format, I don't think a DataFrame is necessary. Instead, read your string in chunks and yield the buffer every time it encounters '$$$Field$$$'.

Adapted from https://stackoverflow.com/a/16260159/4410590:

def myreadlines(f, newline):
    buf = ""
    while True:
        while newline in buf:
            pos = buf.index(newline)
            yield buf[:pos]
            buf = buf[pos + len(newline):]
        chunk = f.read(4096)
        if not chunk:
            yield buf
            break
        buf += chunk

Then call the function:

> for x in myreadlines(StringIO(temp), '$$$Field$$$'):
      print repr(x)

u'Food'
u'Fried\nChicken'
u'Banana'
Community
  • 1
  • 1
victorlin
  • 638
  • 7
  • 14
0

well this should do what you want just scale it to multiple lines:

df = pd.DataFrame("""Food$$$Field$$$Fried
Chicken$$$Field$$$Banana""".split("$$$Field$$$")).T

print(df)

Depending on where (how) your text is stored just you can do the splitting in a list comprehension:

df = pd.DataFrame(lines.split("$$$Field$$$") for line in lines).T
parsethis
  • 7,998
  • 3
  • 29
  • 31
  • Sounds good. And if instead of a string it was a HUGE .csv file with that contains that text? I just used a string to make the problem easier to understand. –  Mar 01 '17 at 15:33
  • what is the delimiter in the csv file, a comma? – parsethis Mar 01 '17 at 15:36
  • 1
    If OP is trying to avoid reading the string into memory, you cannot call `split` on the entire string without reading it all into memory. – victorlin Mar 01 '17 at 15:36
  • @victor where is that a requirement? edit: sorry I didnt see that part of the comment – parsethis Mar 01 '17 at 15:37
  • the delimiter in the csv file is "$$$Field$$$" –  Mar 01 '17 at 15:38