1

I am trying to read Parquet file in Azure without downloading it. Below is my code:

public static void readParquetFile(String containerName) throws IOException {
    String token ="";
    BlobContainerClient containerClient =blobServiceClient.getBlobContainerClient(containerName);
    BlobContainerClient containerClient = blobServiceClient.createBlobContainer(containerName);
    BlobClient blobClient = containerClient.getBlobClient(fileName);
    Configuration config = new Configuration();
    config.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
    config.set("fs.azure.sas.testcontainer.javacodevalidationtest.blob.core.windows.net", token);
    URI uri = new URI("wasbs://testcontainer@javacodevalidationtest.blob.core.windows.net/" + blobClient.getBlobName());
    HadoopInputFile file = HadoopInputFile.fromPath(new Path(String.valueOf(uri)),
            config);
    Path path = (Path) Paths.get(String.valueOf(file));
    ParquetReader reader = AvroParquetReader.<GenericRecord> builder(path).build();

    GenericRecord record;
    while ((record = (GenericRecord) reader.read()) != null) {
        System.out.println(record);
    }
    reader.close();
}

When I run it, I get following exception:

Output:

java: cannot access org.apache.parquet.io.InputFile class file for org.apache.parquet.io.InputFile not found

Pom dependencies:

<dependency>
    <groupId>com.azure</groupId>
    <artifactId>azure-storage-blob</artifactId>
    <version>12.15.0</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.11</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.parquet/parquet-avro -->
<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-avro</artifactId>
    <version>1.8.1</version>
</dependency>
<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-hadoop</artifactId>
    <version>1.12.0</version>
</dependency>

<dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-sdks-java-io-parquet</artifactId>
    <version>2.37.0</version>
</dependency>

Imports:

package com.container;
import com.azure.storage.blob.BlobClient;
import com.azure.storage.blob.BlobContainerClient;
import com.azure.storage.blob.BlobServiceClient;
import com.azure.storage.blob.BlobServiceClientBuilder;
import com.azure.storage.blob.models.BlobItem;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.conf.Configuration;
import org.apache.commons.httpclient.URI;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;

Note:

  1. I was following this answer to perform the read operation: Read parquet data from Azure Blob container without downloading it locally

  2. I have checked all the possible answers in Stack Overflow, but I could not find the solution.

  3. My generic use case is: Read Parquet file content without downloading it with Java from Azure

QualityMatters
  • 895
  • 11
  • 31

0 Answers0