I have a big array of data coming from external aggregating system. The data part that is related to my question is array of strings. Examples (not real ones but quite illustrative):
- Model: TOYOTA COROLLA VIN: ABC123 Year: 2012 Color: Black
- White KIA RIO of 2013 year, transmission: 4AT
- Type:TruckModel:MANYear:2010VIN:QWE123Registration number:AZ12345
- 30 cows of Milky breed numbered #137
- 25 cows of Shello breed numbered #783
The overall number of strings is nearly 100M. And main purpose of them are to be shown to the users on the web site.
As you can see, all strings contain some patterns of key-value pairs naturally or can be transformed to such form. When aggregator takes this data from another systems, it drops delimiters somehow. I encountered over 20 of such key-value pairs in one string.
The first problem is how to restore delimiters (\r\n) at places where they had been dropped. Another problem is how to replace ,
with \r\n only where it is real delimiter of key-value pairs and not part of a value. Commas inside value part are not escaped.
These two problems lead to pattern extraction and then replacement via regexes. At first I planned to extract patters by hand, but It is very time consuming and does not cover some edge cases as I experienced.
I look for programmatic solutions for this problems.
Strings are stored in MSSQL table as a part of a larger database. Data processing platform is written in C#.