1

We have a parquet formatfile (500 mb) which is located in Azure blob.How to read the file directly from blob and save in memory of c# ,say eg:Datatable.

I am able to read parquet file which is physically located in folder using the below code.

public void ReadParqueFile()
    {
         using (Stream fileStream = System.IO.File.OpenRead("D:/../userdata1.parquet"))     
        {
            using (var parquetReader = new ParquetReader(fileStream))
            {
                DataField[] dataFields = parquetReader.Schema.GetDataFields();

                for (int i = 0; i < parquetReader.RowGroupCount; i++)
                {

                    using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i))
                    {
                        DataColumn[] columns = dataFields.Select(groupReader.ReadColumn).ToArray();

                        DataColumn firstColumn = columns[0];

                        Array data = firstColumn.Data;
                        //int[] ids = (int[])data;
                    }
                }
           }
        }

    }
}

(I am able to read csv file directly from blob using sourcestream).Please kindly suggest a fastest method to read the parquet file directly from blob

Rijitha T.J
  • 51
  • 1
  • 7

1 Answers1

0

Per my experience, the solution to directly read the parquet file from blob is first to generate the blob url with sas token and then to get the stream of HttpClient from the url with sas and finally to read the http response stream via ParquetReader.

First, please refer to the sample code below of the section Create a service SAS for a blob of the offical document Create a service SAS for a container or blob with .NET using Azure Blob Storage SDK for .NET Core.

private static string GetBlobSasUri(CloudBlobContainer container, string blobName, string policyName = null)
{
    string sasBlobToken;

    // Get a reference to a blob within the container.
    // Note that the blob may not exist yet, but a SAS can still be created for it.
    CloudBlockBlob blob = container.GetBlockBlobReference(blobName);

    if (policyName == null)
    {
        // Create a new access policy and define its constraints.
        // Note that the SharedAccessBlobPolicy class is used both to define the parameters of an ad hoc SAS, and
        // to construct a shared access policy that is saved to the container's shared access policies.
        SharedAccessBlobPolicy adHocSAS = new SharedAccessBlobPolicy()
        {
            // When the start time for the SAS is omitted, the start time is assumed to be the time when the storage service receives the request.
            // Omitting the start time for a SAS that is effective immediately helps to avoid clock skew.
            SharedAccessExpiryTime = DateTime.UtcNow.AddHours(24),
            Permissions = SharedAccessBlobPermissions.Read | SharedAccessBlobPermissions.Write | SharedAccessBlobPermissions.Create
        };

        // Generate the shared access signature on the blob, setting the constraints directly on the signature.
        sasBlobToken = blob.GetSharedAccessSignature(adHocSAS);

        Console.WriteLine("SAS for blob (ad hoc): {0}", sasBlobToken);
        Console.WriteLine();
    }
    else
    {
        // Generate the shared access signature on the blob. In this case, all of the constraints for the
        // shared access signature are specified on the container's stored access policy.
        sasBlobToken = blob.GetSharedAccessSignature(null, policyName);

        Console.WriteLine("SAS for blob (stored access policy): {0}", sasBlobToken);
        Console.WriteLine();
    }

    // Return the URI string for the container, including the SAS token.
    return blob.Uri + sasBlobToken;
}

Then to get the http response stream of HttpClient from the url with sas token .

var blobUrlWithSAS = GetBlobSasUri(container, blobName);
var client = new HttpClient();
var stream = await client.GetStreamAsync(blobUrlWithSAS);

Finally to read it via ParquetReader, the code comes from Reading Data of GitHub repo aloneguid/parquet-dotnet.

var options = new ParquetOptions { TreatByteArrayAsString = true };
var reader = new ParquetReader(stream, options);
Peter Pan
  • 23,476
  • 4
  • 25
  • 43
  • Thanks Peter.I have tried var stream = await client.GetStreamAsync(blobUrl); this but It getting time out issue..I was able to read small csv file using this method directly from blob with out locally downloading ..Actuallty I have to read either 1.7 GB csv file or corresponding parquet file of around 500 mb directly from blob – Rijitha T.J Jan 31 '20 at 08:12
  • 1
    Hi, am getting the below error when using the same logic. However, am able to print the file using streamreader. Any idea. Thanks "not a Parquet file(head is '')" – Kiran Feb 04 '21 at 09:33
  • @Subba I'm having the same error. Did you resolve it? – BlackShawarna Jun 22 '21 at 15:27
  • We get the same error - anyone found a working version? – Rodney Dec 15 '21 at 05:00
  • 1
    The Parquet.Net documentation mentions that files cannot be read from a network stream. See https://github.com/aloneguid/parquet-dotnet#reading-files – user3841460 Jan 08 '22 at 22:29