0

I want to get a file in azure blob storage and using event trigger function, I want to convert this csv file into parquet format and upload it to another blob storage. I tried to use choparquet reader, but it takes a lot of time to convert the file to parquet format and when a large files comes, few events are missed from the queue and not all files are this converted in to parquet format. I am using c#

Can anyone help in simplified approach to this problem? The code I am working is

using (Stream blobStream = await blockBlob.OpenWriteAsync(accessCondition, null, null))

                    {
                        using (var reader = new StreamReader(longestFile.Open(), Encoding.UTF8))
                        using (var r = ChoCSVReader.LoadText(reader.ReadToEnd()).WithMaxScanRows(2).WithDelimiter("\t"))
                        {
                            log.LogInformation($"Inside csv reader");
                            using (var w = new ChoParquetWriter(blobStream))
                            {
                                log.LogInformation($"Inside parquet writer");
                                w.Write(r);
                                w.Serialize();
                                w.Close();
                            }
                        }
                    }
Cinchoo
  • 6,088
  • 2
  • 19
  • 34
Amey Pimpley
  • 31
  • 1
  • 4
  • How big is the csv file? May be this helps. Change `ChoCSVReader.LoadText(reader.ReadToEnd())` to `new ChoCSVReader(reader)` – Cinchoo Dec 14 '20 at 14:07
  • Hey.. thanks for your suggestion.. The csv files are of different sizes starting from 20KB till 34MB and there are many of the files coming at the same time. I have made the changes and its working now. I am new to this ChoETL tool... can you please elaborate how both of these commands are different and how are they working? – Amey Pimpley Dec 14 '20 at 22:09
  • `LoadText()` works off text (entire file loaded in the memory by `ReadToEnd()`), other one works of stream reader, which is less memory intensive operation. – Cinchoo Dec 14 '20 at 23:29
  • I have implemented this logic... now I am getting very few file output as corrupted parquet files, which can't be accessed having size of around 256B. Is there any specific reason or need to change anything to make the code more perfect. – Amey Pimpley Dec 15 '20 at 18:21
  • @Cinchoo: Around 2000 files comes to source destination everyday. Previously the count for corrupted files or 0 byte file was high. After implementing this solution, the count is low now. But the 0 byte files or corrupted files of size 256B are still there. The files which are getting corrupted are of different pattern, no common pattern is found in them. The files are ranging from 3Mb to 128 Mb. Can you please suggest on this? – Amey Pimpley Dec 16 '20 at 18:43
  • I'm not sure about corrupt files. But I found something in the sample code above. `w.Serialize()` and `w.Close()`. Pls remove them. Give it a try. – Cinchoo Dec 16 '20 at 20:51
  • @Cinchoo: I tried this as well, but its not working as well..... – Amey Pimpley Dec 16 '20 at 23:56
  • 0 byte file issue is fixed, pushed new version 1.0.1.0 to nuget. About corrupt file issue, I'll leave that to you to investigate it, as I dont have info to help you on it. – Cinchoo Dec 18 '20 at 00:55

0 Answers0