Do avro and parquet formatted data have to be written within a hadoop infrastructure?

Question

I've been researching the pros and cons of using avro, parquet, and other data sources for a project. If I am receiving input data from other groups of people who do not operate using Hadoop, will they be able to provide this input data in avro/parquet format? My reading so far on these formats has only been within the sphere of the Hadoop infrastructure, so I am wondering how difficult it would be for folks who just use Oracle/SQL to provide data in this format.

Typically, such datasets don't accumulate on a single machine. Therefore, Hadoop w/ a query layer (or Athena / Bigquery) are used for actual analytics — OneCricketeer, Jun 26 '19 at 04:24

score 3 · Answer 1 · answered Jun 21 '19 at 16:01

It is possible to use these formats without Hadoop, but the ease of doing so depends on the language binding.

For example, reading/writing Parquet files on standalone machines may be very cumbersome with the Java language binding (which is even called parquet-mr where mr stands for MapReduce), as it builds heavily on Hadoop classes. These are typically provided on the classpath of a Hadoop cluster, but are less readily available on standalone machines.

(While parquet-mr is mainly a Java library, it also contains some tools that users may want to run on their local machine. To work around this issue, the parquet-tools module of parquet-mr contains a compilation profile called local that packages Hadoop dependencies alongside the compiled tool. However, this only applies to parquet-tools and you have to compile it yourself to make a local build.)

The python language binding, on the other hand, is very easy to set up and works fine on standalone machines as well. You can either use the high-level pandas interface or the actual implementations pyarrow/fastparquet directly.

Do avro and parquet formatted data have to be written within a hadoop infrastructure?

1 Answers1