0

I have an Azure function with HTTP trigger. Function gets Email Address from the incoming request and search that Email Address in tab delimited text file that stored in Azure Blob Storage and return the entire row as JSON. Code is working fine for small file. But I'm getting time out request while processing around 200 GB file. I know it's bad idea to download the data from file to string variable. And that's where it's giving me time out request. Is there any other way to implement it?

Code:

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Logging;
using Newtonsoft.Json;
using System.Linq;
using System.Collections.Generic;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Blob;

namespace V012ProdFunctionApp
{
    public static class V012Consumer
    {
        [FunctionName("V012Consumer")]
        public static async Task<IActionResult> Run(
            [HttpTrigger(AuthorizationLevel.Function, "get", "post", Route = null)] HttpRequest req,
            ILogger log)
        {
            log.LogInformation("C# HTTP trigger function processed a request.");

            string email = req.Query["email"];

            string requestBody = await new StreamReader(req.Body).ReadToEndAsync();
            dynamic requestdata = JsonConvert.DeserializeObject(requestBody);
            email = email ?? requestdata?.email;

            string connectionString = "DefaultEndpointsProtocol=https;AccountName=storageaccountlearnazure;AccountKey=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx;BlobEndpoint=https://storageaccountlearnazure.blob.core.windows.net/;QueueEndpoint=https://storageaccountlearnazure.queue.core.windows.net/;TableEndpoint=https://storageaccountlearnazure.table.core.windows.net/;FileEndpoint=https://storageaccountlearnazure.file.core.windows.net/;";

            // Setup the connection to the storage account
            CloudStorageAccount storageAccount = CloudStorageAccount.Parse(connectionString);

            // Connect to the blob storage
            CloudBlobClient serviceClient = storageAccount.CreateCloudBlobClient();

            // Connect to the blob container
            CloudBlobContainer container = serviceClient.GetContainerReference("container-learn-azure");

            // Connect to the blob file
            CloudBlockBlob blob = container.GetBlockBlobReference("V12_ConsumerPlus_2020Q3_Sample.txt");
            //CloudBlockBlob blob = container.GetBlockBlobReference("V12_ConsumerPlus_2020Q3.txt");

            // Get the blob file as text
            string contents = blob.DownloadTextAsync().Result;

            var searchedLinesFromString = contents.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries).
                Select((text, index) => new { text, lineNumber = index + 1 })
                           .Where(x => x.text.Contains(email) || x.lineNumber == 1);

            List<DataObjects> objList = new List<DataObjects>();

            string[] headerColumns = null;
            foreach (var match in searchedLinesFromString)
            {
                if (match.lineNumber == 1)
                {
                    headerColumns = match.text.Split('\t');
                }
                else if (headerColumns != null)
                {
                    string missedProperties = string.Empty;
                    var data = match.text.Split('\t');
                    DataObjects obj = new DataObjects();
                    if (data.Any())
                    {
                        foreach (var prop in obj.GetType().GetProperties())
                        {
                            int valueIndex = Array.IndexOf(headerColumns, prop.Name);
                            if (valueIndex != -1)
                            {
                                var columnValue = data[valueIndex];
                                prop.SetValue(obj, columnValue);
                            }
                            else
                            {
                                missedProperties = missedProperties + ", " + prop.Name;
                            }
                        }
                        objList.Add(obj);

                        Console.WriteLine("{0}: {1}", match.lineNumber, match.text);
                    }
                }
            }
            var result = JsonConvert.SerializeObject(objList);

            return new OkObjectResult(await Task.FromResult(objList));

        }
    }
}

Error: enter image description here

Jay Desai
  • 821
  • 3
  • 15
  • 42
  • change `string contents = blob.DownloadTextAsync().Result;` to `string contents = await blob.DownloadTextAsync();` I don't know if that will fix anything, but that can and will cause a deadlock if you don't fix it. – Andy Feb 06 '21 at 19:56
  • Another side note: remove the last 2 lines and replace it with `return Ok(objList);` – Andy Feb 06 '21 at 19:58
  • If you want Newtonsoft to deserialize to `dynamic`, instead of doing this: `dynamic requestdata = JsonConvert.DeserializeObject(requestBody);`, do this: `var requestdata = JsonConvert.DeserializeObject(requestBody);` – Andy Feb 06 '21 at 20:00
  • @Andy, getting this error `System.Private.CoreLib: Exception while executing function: V012Consumer. Microsoft.WindowsAzure.Storage: Stream was too long. System.Private.CoreLib: Stream was too long.` – Jay Desai Feb 06 '21 at 20:09
  • 1
    200 GB????? Don't you think it's too huge file to process for each Http trigger? Should you think of a design to create smaller files based on some partition key. A approach can be here to download the file in chunks instead of reading whole file at one. Search the string in chunk if found then you are good and stop the processing. However you may need to handle the edge scenario that if chunk contains part of the string. Check for [this](https://stackoverflow.com/questions/44381097/download-and-split-large-file-into-100-mb-chunks-in-blob-storage) reference. – user1672994 Feb 07 '21 at 10:34
  • There are too many problems with the basic design for this to work. HTTP Trigger function would time out in 230 seconds. You can't process a 200 GB file, in ADLS, line by line, search though it and expect to return result in that time. Post your original problem as a question. This design can't work. And yes as @user1672994 said, loading 200GB is memory is crazy and won't work. – Kashyap Feb 08 '21 at 17:59

0 Answers0