0

I'm trying to count the number of lines in a file in hdfs/HIVE. There are some cases where I want the number of lines of the entire table in HIVE, and some cases where I want the number of lines just in a file in HIVE.

I've tried some things like !hadoop fs -count /<path to file(s)/, but this only gives the FILE COUNT, then CONTENT_SIZE. from here

How do I get the number of lines?

makansij
  • 9,303
  • 37
  • 105
  • 183

2 Answers2

2

If you want to know the total number of lines you could check the 'Map Input Records' counter. This will give you the total number of lines in a given input (this is all the files in the directory).

If you need the number of lines in a given file (I still don't get why you'd need that) you need to get the same counter for the mapper that has read the given file. This can get a bit tricky, but it's doable.

If you're using Hadoop over Yarn I'd advise you to use Yarn's REST API, it's really easy to use and very convenient to do this kind of "fast queries" over some parts of the M/R processing.

Marc
  • 356
  • 1
  • 8
1

Hive won't let you create Tables on top of just a file. Remember, when you create a table in Hive you create it on top of a folder. (which allows us to add more files)

There is only way to read only one file into a table in hive.

load data [local] inpath '/input_folder/input_file.txt' into table dest_table;

To count the number of lines in that table

select count(*) from dest_table;

Link below has some useful information:

How to load a text file into a Hive table stored as sequence files

Community
  • 1
  • 1
hadooper
  • 726
  • 1
  • 6
  • 18
  • 1
    I'm not a big fan of this approach as it requires creating a Hive table just to count lines, which is a pretty big side effect. I'd lean towards using Pig as per the best answer http://stackoverflow.com/questions/32612867/how-to-count-lines-in-a-file-on-hdfs-command, as this doesn't require any temporary storage to be created, or for the data to be copied anywhere. – Ben Watson Nov 12 '15 at 11:29
  • Agreed Ben.. but he wants use only one file and so I suggested that approach. – hadooper Nov 12 '15 at 16:34