14

All I need to do is get the headers from a CSV file.

file.csv is:

"A", "B", "C"  
"1", "2", "3"

My code is:

table = CSV.open("file.csv", :headers => true)

puts table.headers

table.each do |row|
  puts row 
end

Which gives me:

true
"1", "2", "3"

I've been looking at Ruby CSV documentation for hours and this is driving me crazy. I am convinced that there must be a simple one-liner that can return the headers to me. Any ideas?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Anthony To
  • 2,193
  • 2
  • 21
  • 29

3 Answers3

23

In my opinion the best way to do this is:

headers = CSV.foreach('file.csv').first

Please note that its very tempting to use CSV.read('file.csv'. headers: true).headers but the catch is, CSV.read loads complete file in memory and hence increases your memory footprint and as also it makes it very slow to use for bigger files. Whenever possible please use CSV.foreach. Below are the benchmarks for just a 20 MB file:

Ruby version: ruby 2.4.1p111 
File size: 20M  
****************
Time and memory usage with CSV.foreach:
Time: 0.0 seconds
Memory: 0.04 MB
****************
Time and memory usage with CSV.read:
Time: 5.88 seconds
Memory: 314.25 MB

A 20MB file increased memory footprint by 314 MB with CSV.read, imagine what a 1GB file will do to your system. In short please do not use CSV.read, i did and system went down for a 300MB file.

For further reading: If you want to read more about this, here is a very good article on handling big files.

Also below is the script i used for benchmarking CSV.foreach and CSV.read:

require 'benchmark'
require 'csv'
def print_memory_usage
  memory_before = `ps -o rss= -p #{Process.pid}`.to_i
  yield
  memory_after = `ps -o rss= -p #{Process.pid}`.to_i
  puts "Memory: #{((memory_after - memory_before) / 1024.0).round(2)} MB"
end

def print_time_spent
  time = Benchmark.realtime do
    yield
  end
  puts "Time: #{time.round(2)} seconds"
end

file_path = '{path_to_csv_file}'
puts 'Ruby version: ' + `ruby -v`
puts 'File size:' + `du -h #{file_path}`
puts 'Time and memory usage with CSV.foreach: '
print_memory_usage do
  print_time_spent do
    headers = CSV.foreach(file_path, headers: false).first
  end
end
puts 'Time and memory usage with CSV.read:'
print_memory_usage do
  print_time_spent do
    headers = CSV.read(file_path, headers: true).headers
  end
end
Sahil Dhankhar
  • 3,596
  • 2
  • 31
  • 44
  • 2
    Normally `CSV::foreach` is called with a block, in which case Ruby closes the file before exiting the block. When called without a block (in which case an enumerator is returned) do you know when the file is closed? I've wondered about that for `IO::foreach`, which is more-or-less the same issue. – Cary Swoveland May 04 '20 at 06:06
21

It looks like CSV.read will give you access to a headers method:

headers = CSV.read("file.csv", headers: true).headers
# => ["A", "B", "C"]

The above is really just a shortcut for CSV.open("file.csv", headers: true).read.headers. You could have gotten to it using CSV.open as you tried, but since CSV.open doesn't actually read the file when you call the method, there is no way for it to know what the headers are until it's actually read some data. This is why it just returns true in your example. After reading some data, it would finally return the headers:

  table = CSV.open("file.csv", :headers => true)
  table.headers
  # => true
  table.read
  # => #<CSV::Table mode:col_or_row row_count:2>
  table.headers
  # => ["A", "B", "C"]
Dylan Markow
  • 123,080
  • 26
  • 284
  • 201
  • You can use `table.shift` (or `table.readline`) to read a single line before checking the headers, to avoid loading the entire file, if you have a very large file and only need the headers. If you want to start reading from the top of the file again after that, just use `table.rewind` to go back to the top. – mltsy May 27 '20 at 20:12
  • This solution is not good, the OP wants the headers only, and eagerly loading/parsing the whole file is a waste of polar bears. – akim Oct 06 '20 at 09:27
3

If you want a shorter answer then can try:

headers = CSV.open("file.csv", &:readline)
# => ["A", "B", "C"]