-1

What is the fastest and most accurate way to infer the datetime format from a large set of string values? This is for creating a tool to process data files and any metadata is missing, so the datetime format is undetermined beforehand.

Is there maybe an established or well-known algorithm for this? It seemed like a straight forward problem, but in practice with large datasets, many values and many possible formats it's more complicated. For example I know Python's pandas library has a pretty fast and accurate guess_date_format function, see source here. But how would you do something like this in C#?

This is the code I've come up with so far

class Program
{
    internal class DateFormatInfer
    {
        public int Count { get; set; } = 0;
        public DateTime MinDate { get; set; }
        public DateTime MaxDate { get; set; }
    }

    static private string[] GetPossibleFormats()
    {
        return new[]
        {
            "dd/MM/yyyy", "MM/dd/yyyy", "yyyy/MM/dd", "dd-MM-yyyy", "MM-dd-yyyy",
            "yyyy-MM-dd", "dd.MM.yyyy", "MM.dd.yyyy", "yyyy.MM.dd", "ddMMyyyy" // "dd-MM-yy", "yyyy-MM-dd HH:mm", "yyyy-MM-dd HH:mm:ss" etc.
        };
    }

    static void Main(string[] args)
    {
        string filePath = @"C:\Users\bas_d\source\repos\WinFormsApp_DateTest\WinFormsApp_DateTest\date_values_test.csv";

        Dictionary<string, DateFormatInfer> dateFormats = new Dictionary<string, DateFormatInfer>();
        foreach (var format in GetPossibleFormats())
        {
            DateFormatInfer info = new DateFormatInfer();
            dateFormats.Add(format, info);
        }

        // Read the file line by line
        using (StreamReader reader = new StreamReader(filePath))
        {
            string line;
            int linecount = 0;
            while ((line = reader.ReadLine()) != null)
            {
                linecount++;
                foreach (var dtformat in dateFormats)
                {
                    if (DateTime.TryParseExact(line, dtformat.Key, CultureInfo.InvariantCulture, DateTimeStyles.None, out DateTime parsedDate))
                    {
                        // keep track of count and min/max
                        dtformat.Value.Count++;
                        if ( (parsedDate < dtformat.Value.MinDate) || (dtformat.Value.Count == 1) ) dtformat.Value.MinDate = parsedDate;
                        if ( (parsedDate > dtformat.Value.MaxDate) || (dtformat.Value.Count == 1) ) dtformat.Value.MaxDate = parsedDate;
                    }
                }
            }
            Console.WriteLine($"TOTAL VALUES = {linecount}");
        }

        // display results
        foreach (KeyValuePair<string, DateFormatInfer> kvp in dateFormats)
        {
            DateFormatInfer info = kvp.Value;
            var strmin = (info.MinDate.Year == 1 ? "" : info.MinDate.ToString(kvp.Key));
            var strmax = (info.MaxDate.Year == 1 ? "" : info.MaxDate.ToString(kvp.Key));
            Console.WriteLine($"{kvp.Key} = {info.Count}: {strmin} ~ {strmax}");
        }

        Console.ReadLine();
    }
}

As a test input I've got a file with values 04/09/2023 through 31/05/2023, so date format MM/dd/yyyy. The program works and gives the expected result for this test file:

TOTAL VALUES = 53
dd/MM/yyyy = 16: 05/01/2023 ~ 05/12/2023
MM/dd/yyyy = 53: 04/09/2023 ~ 05/31/2023
yyyy/MM/dd = 0:  ~
dd-MM-yyyy = 0:  ~
MM-dd-yyyy = 0:  ~
yyyy-MM-dd = 0:  ~
dd.MM.yyyy = 0:  ~
MM.dd.yyyy = 0:  ~
yyyy.MM.dd = 0:  ~
ddMMyyyy = 0:  ~

But there are a couple of practical problems with this

  • It should work for large data files with potentially multiple datetime columns with unknown format
  • This is a brute force method and going through all combinations for all of columns gives poor performance
  • How to get an exhaustive/definitive list of all possible formats
  • When the list of possible formats becomes larger (date, datetime, two digit years, milliseconds etc) it will cause further performance issues

Any thoughts on this, what is the best way to approach this?

BdR
  • 2,770
  • 2
  • 17
  • 36
  • 1
    Are the target files expected to stick to one format per file or could they even contain different formats in one file? – Fildor May 21 '23 at 15:02
  • A file can contain columns with different datetime formats, although in practise it's usually for example `yyyy-MM-dd` combined with `yyyy-MM-dd HH:mm:ss`. Or sometimes it's a separate date and time column so one column `yyyy-MM-dd` and another column with `HH:mm:ss` which technically are two different datetime formats. – BdR May 21 '23 at 20:09
  • If you believe that pandas can do this for you, why not just let pandas do it? Andno, it is not straightforward. If you have got many values in a column, it may help, in particular for distinguishing MM-dd-yyyy from dd-MM-yyyy, which can otherwise be impossible. – Ole V.V. May 22 '23 at 23:59
  • 2
    *`04/09/2023` through `31/05/2023`, so date format `MM/dd/yyyy`*. Are you very sure that your calendar has got 31 months? :-P – Ole V.V. May 23 '23 at 00:01
  • 1
    You want to support both "dd-MM-yyyy" and "MM-dd-yyyy". So, would "01-02-2024" be January 2nd, or February 1st? … Ditto for "dd/MM/yyyy" and "MM/dd/yyyy". – Basil Bourque May 23 '23 at 23:48

1 Answers1

0

I worked on a similar task where I needed to parse a String to Date if it fit any date format. I worked in Java but the idea could be easily ported. So, my idea was to create a property file that contains all date formats that I would like to support. And than I attempted to parse the incoming String against each format one after another until succeeded or until I ran out of formats. Note that having formats in property file meant I could update my format list without any code changes. Also format ordering is important as sometimes String could match more than one format for example US vs European standard 02.03.2023 could be interpreted as March 2d or Feb 3d. I wrote an article summarizing the idea: Java 8 java.time package: parsing any string to date

Michael Gantman
  • 7,315
  • 2
  • 19
  • 36