I am trying to read Parquet file in Azure without downloading it. Below is my code:
public static void readParquetFile(String containerName) throws IOException {
String token ="";
BlobContainerClient containerClient =blobServiceClient.getBlobContainerClient(containerName);
BlobContainerClient containerClient = blobServiceClient.createBlobContainer(containerName);
BlobClient blobClient = containerClient.getBlobClient(fileName);
Configuration config = new Configuration();
config.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.azure.sas.testcontainer.javacodevalidationtest.blob.core.windows.net", token);
URI uri = new URI("wasbs://testcontainer@javacodevalidationtest.blob.core.windows.net/" + blobClient.getBlobName());
HadoopInputFile file = HadoopInputFile.fromPath(new Path(String.valueOf(uri)),
config);
Path path = (Path) Paths.get(String.valueOf(file));
ParquetReader reader = AvroParquetReader.<GenericRecord> builder(path).build();
GenericRecord record;
while ((record = (GenericRecord) reader.read()) != null) {
System.out.println(record);
}
reader.close();
}
When I run it, I get following exception:
Output:
java: cannot access org.apache.parquet.io.InputFile class file for org.apache.parquet.io.InputFile not found
Pom dependencies:
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-storage-blob</artifactId>
<version>12.15.0</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.11</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.parquet/parquet-avro -->
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-hadoop</artifactId>
<version>1.12.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-parquet</artifactId>
<version>2.37.0</version>
</dependency>
Imports:
package com.container;
import com.azure.storage.blob.BlobClient;
import com.azure.storage.blob.BlobContainerClient;
import com.azure.storage.blob.BlobServiceClient;
import com.azure.storage.blob.BlobServiceClientBuilder;
import com.azure.storage.blob.models.BlobItem;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.conf.Configuration;
import org.apache.commons.httpclient.URI;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
Note:
I was following this answer to perform the read operation: Read parquet data from Azure Blob container without downloading it locally
I have checked all the possible answers in Stack Overflow, but I could not find the solution.
My generic use case is: Read Parquet file content without downloading it with Java from Azure