1

How does querying from an external table in Shark located on the local filesystem compare to using data located on HDFS in terms of query performance? I plan to use a single high end server for running shark queries and was wondering if its absolutely necessary to install hadoop/hdfs.

ssedano
  • 8,322
  • 9
  • 60
  • 98
DaTaBomB
  • 623
  • 3
  • 11
  • 23

1 Answers1

1

Generally, if you already intend to run on a single high-end server, there's no need to set up HDFS. In such a case, you should actually achieve somewhat better performance than with HDFS installed on a single machine, since you won't incur the extra overhead of doing the extra round-trips to localhost just to get file metadata, or the extra indirection of HDFS mapping files onto a series of opaque blocks which are themselves files on your local filesystem.

Note that you'll still automatically benefit from Shark going through the Hadoop RawLocalFileSystem (which is the default "Hadoop filesystem" loaded when HDFS is not explicitly set up), so that Shark will effectively think it's using an HDFS equivalent. This means that in the future, if you indeed need to run on a distributed cluster, it should be a simple matter of modifying the fs.default.name and everything else will work the same as you're used to on a single machine setup.

Dennis Huo
  • 10,517
  • 27
  • 43