Reading chunked file into dataframe

Question

I am fairly new to pandas/r, and I'm not quite sure how to go about reading this data into pandas or r for analysis.

Currently, I was thinking I could use readr's read_chunkwise, or pandas's chunksize, but this may not be what I need. Is this really something that is easily solved with a for loop or using purr to iterate over all the elements?

Data:

wine/name: 1981 Ch&#226;teau de Beaucastel Ch&#226;teauneuf-du-Pape
wine/wineId: 18856 
wine/variant: Red Rhone Blend 
wine/year: 1981 
review/points: 96   
review/time: 1160179200   
review/userId: 1 
review/userName: Eric 
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

wine/name: 1995 Ch&#226;teau Pichon-Longueville Baron 
wine/wineId: 3495 wine/variant: Red Bordeaux Blend 
wine/year: 1995 
review/points: 93 
review/time: 1063929600 
review/userId: 1 
review/userName: Eric 
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.

Currently, this is what I have as a function, but I am running into an error:

>

 convertchunkfile <- function(df){   for(i in 1:length(df)){
>     #While the length of any line is not 0, process it with the following loop
>     while(nchar(df[[i]]) != 0){
>       case_when(
>         
>         #When data at x index == wine/name, then extract the data after that clause
>         #Wine Name parsing
>         cleandf$WineName[[i]] <- df[i] == str_sub(df[1],0, 10) ~ str_trim(substr(df[1], 11, nchar(df[1]))),
>         #Wine ID parsing
>         cleandf$WineID[[i]] <- df[i] == str_sub(df[2],0,11) ~ str_trim(substr(df[2], 13, nchar(df[1])))
>         #same format for other attributes
>       )
>     }   
>    }
>  } 

Error in cleandf$BeerName[[i]] <- df[i] == str_sub(df[1], 0, 10) ~ str_trim(substr(df[1],  : 
  more elements supplied than there are to replace

EDIT:

After working through some issues, I think this is probably the best solution, drawing from @hereismyname's solution:

#Use Bash's iconv to force convert the file in OS X
iconv -c -t UTF-8 cellartracker-clean.txt > cellartracker-iconv.txt

#Check number of lines within the file
wc -l cellartracker-iconv.txt
20259950 cellartracker-iconv.txt

#Verify new encoding of the file
file -I cellartracker-clean.txt


ReadEmAndWeep <- function(file, chunk_size) {
  f <- function(chunk, pos) {
    data_frame(text = chunk) %>%
      filter(text != "") %>%
      separate(text, c("var", "value"), ":", extra = "merge") %>%
      mutate(
        chunk_id = rep(1:(nrow(.) / 9), each = 9),
        value = trimws(value)
      ) %>%
      spread(var, value)
  }

  read_lines_chunked(file, DataFrameCallback$new(f), chunk_size = chunk_size)
}

#Final Function call to read in the file
dataframe <- ReadEmAndWeep(file, chunk_size = 100000)

You seem to have some lines with multiple variables (eg: `wine/year: 1981 review/points: 96 `). Is that correct? — A5C1D2H2I1M1N2O1R2T1, Mar 27 '17 at 06:03
thanks for pointing that out, it was actually misformatted. I have update the data in the question. Thanks, — petergensler, Mar 27 '17 at 18:00

score 2 · Accepted Answer · answered Mar 09 '17 at 02:18

2

Here's a method that's fairly idiomatic in R:

library(readr)
library(tidyr)
library(dplyr)

out <- data_frame(text = read_lines(the_text)) %>%
  filter(text != "") %>% 
  separate(text, c("var", "value"), ":", extra = "merge") %>% 
  mutate(
    chunk_id = rep(1:(nrow(.) / 9), each = 9),
    value    = trimws(value)
  ) %>% 
  spread(var, value)

answered Mar 09 '17 at 02:18

Andrew

152
8

Can you comment on what the chunk id line is doing in this? I'm not sure I understand what is going on with this line of code. – petergensler Mar 27 '17 at 18:01
@petergensler the `chunk_id` variable is just there to create a unique record number for each review. Each review comes in as a set of 9 lines of text, but the code converts each line into a column. – Andrew Mar 28 '17 at 19:10
Why use the rep() / 9 ? That part is what is throwing me off. – petergensler Mar 28 '17 at 19:24

score 1 · Answer 2 · answered Mar 07 '17 at 20:46

Here is some code to read those records into a pandas.DataFrame. These records are structured like a yaml record so this code uses that fact to its advantage. A blank line is used as a record separator.

import pandas as pd
import collections
import yaml

def read_records(lines):
    # keep track of the columns in an ordered set
    columns = collections.OrderedDict()

    record = []
    records = []
    for line in lines:
        if line:
            # gather each line of text until a blank line
            record.append(line)

            # keep track of the columns seen in an ordered set
            columns[line.split(':')[0].strip()] = None

        # if the line is empty and we have a record, then convert it 
        elif record:

            # use yaml to convert the lines into a dict
            records.append(yaml.load('\n'.join(record)))
            record = []

    # record last record
    if record:
        records.append(yaml.load('\n'.join(record)))

    # return a pandas dataframe from the list of dicts
    return pd.DataFrame(records, columns=list(columns.keys()))

Test Code:

print(read_records(data))

Results:

                                           wine/name  wine/wineId  \
0  1981 Ch&#226;teau de Beaucastel Ch&#226;teaune...        18856   
1         1995 Ch&#226;teau Pichon-Longueville Baron         3495   

         wine/variant  wine/year  review/points  review/time  review/userId  \
0     Red Rhone Blend       1981             96   1160179200              1   
1  Red Bordeaux Blend       1995             93   1063929600              1   

  review/userName                                        review/text  
0            Eric  Olive, horse sweat, dirty saddle, and smoke. T...  
1            Eric  A remarkably floral nose with violet and chamb...

Test data:

data = [x.strip() for x in """
    wine/name: 1981 Ch&#226;teau de Beaucastel Ch&#226;teauneuf-du-Pape
    wine/wineId: 18856
    wine/variant: Red Rhone Blend
    wine/year: 1981
    review/points: 96
    review/time: 1160179200
    review/userId: 1
    review/userName: Eric
    review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

    wine/name: 1995 Ch&#226;teau Pichon-Longueville Baron
    wine/wineId: 3495
    wine/variant: Red Bordeaux Blend
    wine/year: 1995
    review/points: 93
    review/time: 1063929600
    review/userId: 1
    review/userName: Eric
    review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
""".split('\n')[1:-1]]

Thanks for the attempt! I was trying this in r earlier with my attempt listed above, but I'm starting to wonder if pandas is truly the better tool for this task...let me know what you think of my code — petergensler, Mar 08 '17 at 03:16
Sadly, I have never delved into `R`, so I will be of no value there. Cheers. — Stephen Rauch, Mar 08 '17 at 03:27
No worries, I'm thinking I may need to look into pandas, as this R code is pretty complicated already....Does pandas have any options to read in chunks of data formatted like this? I've seen this with other data sources, and thought there would be an easier way of handling this. — petergensler, Mar 08 '17 at 03:29

A5C1D2H2I1M1N2O1R2T1 · Answer 3 · 2017-03-27T12:37:30.107

Here's the approach I would suggest:

y <- readLines("your_file")
y <- unlist(strsplit(gsub("(wine\\/|review\\/)", "~~~\\1", y), "~~~", TRUE))

library(data.table)
dcast(fread(paste0(y[y != ""], collapse = "\n"), header = FALSE)[
  , rn := cumsum(V1 == "wine/name")], rn ~ V1, value.var = "V2")

The only assumption made is that the first line for each new wine starts with wine/name. Blank lines and so on don't matter.

Here are two datasets to try this out with.

Substitute "your_file" in the first line of code with either url1 or url2 to try it out.

url1 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines1.txt"

url2 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines2.txt"

Note that the second dataset is missing a value for wine/variant: for the first wine.

It would probably be even better to do the gsub in awk or something like that and fread directly on that.

Reading chunked file into dataframe

3 Answers3