14

The Go code below reads in a 10,000 record CSV (of timestamp times and float values), runs some operations on the data, and then writes the original values to another CSV along with an additional column for score. However it is terribly slow (i.e. hours, but most of that is calculateStuff()) and I'm curious if there are any inefficiencies in the CSV reading/writing I can take care of.

package main

import (
  "encoding/csv"
  "log"
  "os"
  "strconv"
)

func ReadCSV(filepath string) ([][]string, error) {
  csvfile, err := os.Open(filepath)

  if err != nil {
    return nil, err
  }
  defer csvfile.Close()

  reader := csv.NewReader(csvfile)
  fields, err := reader.ReadAll()

  return fields, nil
}

func main() {
  // load data csv
  records, err := ReadCSV("./path/to/datafile.csv")
  if err != nil {
    log.Fatal(err)
  }

  // write results to a new csv
  outfile, err := os.Create("./where/to/write/resultsfile.csv"))
  if err != nil {
    log.Fatal("Unable to open output")
  }
  defer outfile.Close()
  writer := csv.NewWriter(outfile)

  for i, record := range records {
    time := record[0]
    value := record[1]

    // skip header row
    if i == 0 {
      writer.Write([]string{time, value, "score"})
      continue
    }

    // get float values
    floatValue, err := strconv.ParseFloat(value, 64)
    if err != nil {
      log.Fatal("Record: %v, Error: %v", floatValue, err)
    }

    // calculate scores; THIS EXTERNAL METHOD CANNOT BE CHANGED
    score := calculateStuff(floatValue)

    valueString := strconv.FormatFloat(floatValue, 'f', 8, 64)
    scoreString := strconv.FormatFloat(prob, 'f', 8, 64)
    //fmt.Printf("Result: %v\n", []string{time, valueString, scoreString})

    writer.Write([]string{time, valueString, scoreString})
  }

  writer.Flush()
}

I'm looking for help making this CSV read/write template code as fast as possible. For the scope of this question we need not worry about the calculateStuff method.

BoltzmannBrain
  • 5,082
  • 11
  • 46
  • 79
  • 1
    How slow is 'terribly slow'? What is the file size of the csv file you are testing? Because the word 'slow' is subjective... – Rosdi Kasim Aug 15 '15 at 17:58
  • 2
    Assuming the calculation is dependant only on the current record (e.g. you don't need to total all values of a column, or sort, or somesuch) don't suck in the whole file into memory but instead operate on each record as it it's read, the write it out. Avoid things like `ioutil.ReadAll` when you can instead operate on a stream. In any case, unless you're swapping (due to using too much memory) you're likely IO bound. – Dave C Aug 15 '15 at 18:04
  • 1
    Also, make sure to check for errors!! E.g. https://play.golang.org/p/CBzz4lImW2 – Dave C Aug 15 '15 at 18:17
  • Also, if this the entire program (i.e. if it's not a part of a larger program) it makes more sense not to muck with file names but just read from `os.Stdin` and write to `os.Stdout` and let the shell handle opening files (or using direct output/input from/to other programs!) for you. – Dave C Aug 15 '15 at 18:25

3 Answers3

22

You're loading the file in memory first then processing it, that can be slow with a big file.

You need to loop and call .Read and process one line at a time.

func processCSV(rc io.Reader) (ch chan []string) {
    ch = make(chan []string, 10)
    go func() {
        r := csv.NewReader(rc)
        if _, err := r.Read(); err != nil { //read header
            log.Fatal(err)
        }
        defer close(ch)
        for {
            rec, err := r.Read()
            if err != nil {
                if err == io.EOF {
                    break
                }
                log.Fatal(err)

            }
            ch <- rec
        }
    }()
    return
}

playground

//note it's roughly based on DaveC's comment.

OneOfOne
  • 95,033
  • 20
  • 184
  • 185
7

This is essentially Dave C's answer from the comments sections:

package main

import (
  "encoding/csv"
  "log"
  "os"
  "strconv"
)

func main() {
  // setup reader
  csvIn, err := os.Open("./path/to/datafile.csv")
  if err != nil {
    log.Fatal(err)
  }
  r := csv.NewReader(csvIn)

  // setup writer
  csvOut, err := os.Create("./where/to/write/resultsfile.csv"))
  if err != nil {
    log.Fatal("Unable to open output")
  }
  w := csv.NewWriter(csvOut)
  defer csvOut.Close()

  // handle header
  rec, err := r.Read()
  if err != nil {
    log.Fatal(err)
  }
  rec = append(rec, "score")
  if err = w.Write(rec); err != nil {
    log.Fatal(err)
  }

  for {
    rec, err = r.Read()
    if err != nil {
      if err == io.EOF {
        break
      }
      log.Fatal(err)
    }

    // get float value
    value := rec[1]
    floatValue, err := strconv.ParseFloat(value, 64)
    if err != nil {
      log.Fatal("Record, error: %v, %v", value, err)
    }

    // calculate scores; THIS EXTERNAL METHOD CANNOT BE CHANGED
    score := calculateStuff(floatValue)

    scoreString := strconv.FormatFloat(score, 'f', 8, 64)
    rec = append(rec, scoreString)

    if err = w.Write(rec); err != nil {
      log.Fatal(err)
    }
  w.Flush()
  }
}

Note of course the logic is all jammed into main(), better would be to split it into several functions, but that's beyond the scope of this question.

Community
  • 1
  • 1
BoltzmannBrain
  • 5,082
  • 11
  • 46
  • 79
1

encoding/csv is indeed very slow on big files, as it performs a lot of allocations. Since your format is so simple I recommend using strings.Split instead which is much faster.

If even that is not fast enough you can consider implementing the parsing yourself using strings.IndexByte which is implemented in assembly: http://golang.org/src/strings/strings_decl.go?s=274:310#L1

Having said that, you should also reconsider using ReadAll if the file is larger than your memory.

Nick Keets
  • 3,240
  • 1
  • 14
  • 13
  • 2
    A faster implementation using `bufio.Scanner` and `bytes.LastIndex` might look something like [this](https://play.golang.org/p/SYG9dQV-ir). It's ~3x faster than `encoding/csv` but that's at the expense of a lot of the error checking and flexibility that `encoding/csv` provides. – Dave C Aug 15 '15 at 19:33