What is the fastest and most accurate way to infer the datetime format from a large set of string values? This is for creating a tool to process data files and any metadata is missing, so the datetime format is undetermined beforehand.
Is there maybe an established or well-known algorithm for this? It seemed like a straight forward problem, but in practice with large datasets, many values and many possible formats it's more complicated. For example I know Python's pandas
library has a pretty fast and accurate guess_date_format
function, see source here. But how would you do something like this in C#?
This is the code I've come up with so far
class Program
{
internal class DateFormatInfer
{
public int Count { get; set; } = 0;
public DateTime MinDate { get; set; }
public DateTime MaxDate { get; set; }
}
static private string[] GetPossibleFormats()
{
return new[]
{
"dd/MM/yyyy", "MM/dd/yyyy", "yyyy/MM/dd", "dd-MM-yyyy", "MM-dd-yyyy",
"yyyy-MM-dd", "dd.MM.yyyy", "MM.dd.yyyy", "yyyy.MM.dd", "ddMMyyyy" // "dd-MM-yy", "yyyy-MM-dd HH:mm", "yyyy-MM-dd HH:mm:ss" etc.
};
}
static void Main(string[] args)
{
string filePath = @"C:\Users\bas_d\source\repos\WinFormsApp_DateTest\WinFormsApp_DateTest\date_values_test.csv";
Dictionary<string, DateFormatInfer> dateFormats = new Dictionary<string, DateFormatInfer>();
foreach (var format in GetPossibleFormats())
{
DateFormatInfer info = new DateFormatInfer();
dateFormats.Add(format, info);
}
// Read the file line by line
using (StreamReader reader = new StreamReader(filePath))
{
string line;
int linecount = 0;
while ((line = reader.ReadLine()) != null)
{
linecount++;
foreach (var dtformat in dateFormats)
{
if (DateTime.TryParseExact(line, dtformat.Key, CultureInfo.InvariantCulture, DateTimeStyles.None, out DateTime parsedDate))
{
// keep track of count and min/max
dtformat.Value.Count++;
if ( (parsedDate < dtformat.Value.MinDate) || (dtformat.Value.Count == 1) ) dtformat.Value.MinDate = parsedDate;
if ( (parsedDate > dtformat.Value.MaxDate) || (dtformat.Value.Count == 1) ) dtformat.Value.MaxDate = parsedDate;
}
}
}
Console.WriteLine($"TOTAL VALUES = {linecount}");
}
// display results
foreach (KeyValuePair<string, DateFormatInfer> kvp in dateFormats)
{
DateFormatInfer info = kvp.Value;
var strmin = (info.MinDate.Year == 1 ? "" : info.MinDate.ToString(kvp.Key));
var strmax = (info.MaxDate.Year == 1 ? "" : info.MaxDate.ToString(kvp.Key));
Console.WriteLine($"{kvp.Key} = {info.Count}: {strmin} ~ {strmax}");
}
Console.ReadLine();
}
}
As a test input I've got a file with values 04/09/2023
through 31/05/2023
, so date format MM/dd/yyyy
. The program works and gives the expected result for this test file:
TOTAL VALUES = 53
dd/MM/yyyy = 16: 05/01/2023 ~ 05/12/2023
MM/dd/yyyy = 53: 04/09/2023 ~ 05/31/2023
yyyy/MM/dd = 0: ~
dd-MM-yyyy = 0: ~
MM-dd-yyyy = 0: ~
yyyy-MM-dd = 0: ~
dd.MM.yyyy = 0: ~
MM.dd.yyyy = 0: ~
yyyy.MM.dd = 0: ~
ddMMyyyy = 0: ~
But there are a couple of practical problems with this
- It should work for large data files with potentially multiple datetime columns with unknown format
- This is a brute force method and going through all combinations for all of columns gives poor performance
- How to get an exhaustive/definitive list of all possible formats
- When the list of possible formats becomes larger (date, datetime, two digit years, milliseconds etc) it will cause further performance issues
Any thoughts on this, what is the best way to approach this?