0

I have to convert a (Squid Web Proxy Server) log file to CSV file, so that it can be loaded into powerpivot for analysis of queries. So how should I start, any help would strongly be appreciated. I've to use C# language for this task, log looks like the following:

Format: Timestamp Elapsed Client Action/Code Size Method URI Ident Hierarchy/From Content

1473546438.145    917 5.45.107.68 TCP_DENIED/403 4114 GET http://atlantis.pennergame.de/pet/ - NONE/- text/html
1473546439.111      3 146.148.96.13 TCP_DENIED/403 4604 POST http://mobiuas.ebay.com/services/mobile/v1/UserAuthenticationService - NONE/- text/html
1473546439.865    358 212.83.168.7 TCP_DENIED/403 3955 GET http://www.theshadehouse.com/left-sidebar-post/ - NONE/- text/html
1473546439.985    218 185.5.97.68 TCP_DENIED/403 3600 GET http://www.google.pl/search? - NONE/- text/html
1473546440.341      2 146.148.96.13 TCP_DENIED/403 4604 POST http://mobiuas.ebay.com/services/mobile/v1/UserAuthenticationService - NONE/- text/html
1473546440.840    403 115.29.46.240 TCP_DENIED/403 4430 POST http://et.airchina.com.cn/fhx/consumeRecord/getCardConsumeRecordList.htm - NONE/- text/html
1473546441.486      2 52.41.27.39 TCP_DENIED/403 3813 POST http://www.deezer.com/ajax/action.php - NONE/- text/html
1473546441.596      2 146.148.96.13 TCP_DENIED/403 4604 POST http://mobiuas.ebay.com/services/mobile/v1/UserAuthenticationService - NONE/- text/html
Slai
  • 22,144
  • 5
  • 45
  • 53
Ahsan Alii
  • 153
  • 1
  • 16
  • This *is* a CSV, one where the field separator is ` ` instead of `,` – Panagiotis Kanavos Oct 20 '16 at 13:49
  • 1
    just create a loop that counts from 0 to 3 replacing two whitespaces with one and after the loop replace every whitespace with a semicolon. Then write all to a file – Radinator Oct 20 '16 at 13:55
  • 1
    This is __not__ a CSV. Just something close to a CSV. It has fixed-width in front and in the tail you can see some `" - "` separators. – H H Oct 20 '16 at 14:01
  • Didn't get your point, are you sure 3 loops would be enough for this ? – Ahsan Alii Oct 20 '16 at 14:03
  • 1
    @HenkHolterman `-` is data, not a separator. In this case the separator is "multiple spaces". Power BI probably can't handle this, but a simple regex is enough to clean it – Panagiotis Kanavos Oct 20 '16 at 14:06
  • There are few ways to do it in Excel without the conversion to CSV. For example the [`Splitter.SplitTextByWhitespace`](https://msdn.microsoft.com/en-us/library/mt260939.aspx) Power Query Formula – Slai Oct 20 '16 at 14:53
  • @AhsanAlii: did it with Notepad++, three times are enough (if you feel better, use more iterations :D) – Radinator Oct 20 '16 at 15:05

3 Answers3

3

It is already close to a CSV, so read it line by line and clean each line up a little:

...
line = line
  .Replace("   ", " ")  // compress 3 spaces to 1
  .Replace("  ", " ")   // compress 2 spaces to 1
  .Replace("  ", " ")   // compress 2 spaces to 1, again
  .Replace(" ", "|")    // replace space by '|'
  .Replace(" - ", "|"); // replace  -  by '|'

You may want to tweak this for the fields like TCP_DENIED/403 .

this gives you a '|' separated line. Easy to convert to any separator you need. Or split it up:

// write it out or process it further    
string[] parts = line.split('|');
H H
  • 263,252
  • 30
  • 330
  • 514
  • So you means I would not have to use regex for this operation ? – Ahsan Alii Oct 20 '16 at 14:08
  • A regex is the *simpler* option and uses a *LOT* less memory. This answer will create one temporary string per replace, per line and then 10 for each split. That's 5x memory for the replacements and roughly 1 more for the split. Per line. How large is your log file? – Panagiotis Kanavos Oct 20 '16 at 14:11
  • Regex might be better but the text is already quite structured and prepared (no spaces inside any fields). – H H Oct 20 '16 at 14:11
  • Which is why a regex is simpler in this case and far cheaper – Panagiotis Kanavos Oct 20 '16 at 14:11
  • @PanagiotisKanavos - they are extremely short lived strings, just what the GC is optimized for. Regex has its advantages but don't spout this memory nonsense. – H H Oct 20 '16 at 14:12
  • They are *not* short lived. They get abandoned quickly but they still take memory until the GC collects them. This causes a huge CPU hit for collection, *and* memory bloat. Which is why log parsing uses regular expressions. If you use a memory profiler or Visual Studio's diagnostics window you'll see the memory increase fast for quite some time, then drop with a big CPU spike repeatedly. – Panagiotis Kanavos Oct 20 '16 at 14:14
  • Try it with a large file. Then time both of them. Not only is memory usage and CPU far lower, processing also takes a fraction of the time – Panagiotis Kanavos Oct 20 '16 at 14:17
  • A `String.Split(," ",StringSplitOptions.None)` would do the same work as the multiple replacements using only double the memory. It would still create a lot of temporary strings – Panagiotis Kanavos Oct 20 '16 at 14:19
  • My log files is about 800 MB. – Ahsan Alii Oct 20 '16 at 19:46
  • _A `String.Split(," ",StringSplitOptions.None)` would do the same_ - not at all, at the very least you'll have to use RemoveEmptyEntries but even then it would seem feeble. – H H Oct 21 '16 at 07:20
  • @AhsanAlii - 800MB would crack a 32 bit application, you will need 4x800 MB RAM. Using x64 and ample RAM it would work. But you should never need more then 1 line at a time with ReadLines (note: not ReadAllLines). – H H Oct 21 '16 at 07:23
  • Tried, but it's still taking a hell lot of time, infact if I directly gives it a log file as input, the program crashes after 100 seconds. – Ahsan Alii Oct 21 '16 at 17:18
  • Look into it carefully and maybe ask a new question. But "... it crashes" is not enough, you'll have to post code and error messages. – H H Oct 21 '16 at 17:48
2
public static class SquidWebProxyServerCommaSeparatedWriter
{
    public static void WriteToCSV(string destination, IEnumerable<SquidWebProxyServerLogEntry> serverLogEntries)
    {
        var lines = serverLogEntries.Select(ConvertToLine);

        File.WriteAllLines(destination, lines);
    }

    private static string ConvertToLine(SquidWebProxyServerLogEntry serverLogEntry)
    {
        return string.Join(@",", serverLogEntry.Timestamp, serverLogEntry.Elapsed.ToString(),
            serverLogEntry.ClientIPAddress, serverLogEntry.ActionCode, serverLogEntry.Size.ToString(),
            serverLogEntry.Method.ToString(), serverLogEntry.Uri, serverLogEntry.Identity,
            serverLogEntry.HierarchyFrom, serverLogEntry.MimeType);
    }
}    

public static class SquidWebProxyServerLogParser
{
    public static IEnumerable<SquidWebProxyServerLogEntry> Parse(FileInfo fileInfo)
    {
        using (var streamReader = fileInfo.OpenText())
        {
            string row;

            while ((row = streamReader.ReadLine()) != null)
            {
                yield return ParseRow(row)
            }
        }
    }

    private static SquidWebProxyServerLogEntry ParseRow(string row)
    {
        var fields = row.Split(new[] {"\t", " "}, StringSplitOptions.None);

        return new SquidWebProxyServerLogEntry
        {
            Timestamp = fields[0],
            Elapsed = int.Parse(fields[1]),
            ClientIPAddress = fields[2],
            ActionCode = fields[3],
            Size = int.Parse(fields[4]),
            Method =
                (SquidWebProxyServerLogEntry.MethodType)
                Enum.Parse(typeof(SquidWebProxyServerLogEntry.MethodType), fields[5]),
            Uri = fields[6],
            Identity = fields[7],
            HierarchyFrom = fields[8],
            MimeType = fields[9]
        };
    }

    public static IEnumerable<SquidWebProxyServerLogEntry> Parse(IEnumerable<string> rows) => rows.Select(ParseRow);
}

public sealed class SquidWebProxyServerLogEntry
{
    public enum MethodType
    {
        Get = 0,
        Post = 1,
        Put = 2
    }

    public string Timestamp { get; set; }
    public int Elapsed { get; set; }
    public string ClientIPAddress { get; set; }
    public string ActionCode { get; set; }
    public int Size { get; set; }
    public MethodType Method { get; set; }
    public string Uri { get; set; }
    public string Identity { get; set; }
    public string HierarchyFrom { get; set; }
    public string MimeType { get; set; }
}
Timothy Stepanski
  • 1,186
  • 7
  • 21
  • 1
    If only I could reach around my neckbeard to pat myself on the back. – Timothy Stepanski Oct 20 '16 at 14:09
  • Besides *not* answering the question, this code won't work - the file has *spaces* not tabs – Panagiotis Kanavos Oct 20 '16 at 14:18
  • 1
    I read it as he was trying to parse the file, not modify it. That said, my string split hits both tabs and spaces. – Timothy Stepanski Oct 20 '16 at 14:20
  • I fixed it that it can now write them back out to a CSV. – Timothy Stepanski Oct 20 '16 at 14:26
  • Above code is not working for me, P.S: what's wrong if we just replace every multiple spaces with single space and in the end single space by , (comma). Would not it be completely converted to CSV and be able to be opened in powerpivot ? – Ahsan Alii Oct 20 '16 at 19:45
  • How to use above code ? I tried to understand but don't how will it work – Ahsan Alii Oct 22 '16 at 14:37
  • I believe your should just copy paste it somewhere then do : var MyLog = SquidWebProxyServerLogParser.ParseRaw(put your data string here); then do : SquidWebProxyServerCommaSeparatedWriter.ConvertToLine(MyLog); which gives you a CSV in string format. – Pedram Oct 22 '16 at 15:25
0

A CSV is a delimited file whose field delimiter is ,. Almost all programs allow you to specify different field and record delimiters, using , and \n as defaults.

Your file could be treated as delimited if it didn't contain multiple spaces for indentation. You can replace multiple spaces with a single one using the regex \s{2,}, eg:

var regex=new Regex(@"\s{2,}");
var original=File.ReadAllText(somePath);
var delimited=regex.Replace(original," ");
File.WriteAllText(somePath,delimited);

Power BI Desktop already allows you to use space as a delimiter. Even if it didn't, you could just replace all spaces with a comma by changing the pattern to \s+, ie:

var regex=new Regex(@"\s+");
...
var delimited=regex.Replace(original,",");
...

Log files are large, so it's a very good idea to reduce the amount of memory they use. You can avoid reading the entire file in memory if you use ReadLines to read one line at a time, make the replacement and write it out:

using(var writer=File.CreateText(targetPath))
{
    foreach(var line in File.ReadLines(somePath))
    {
        var newline=regex.Replace(line," ");
        writer.WriteLine(newline);
    }
}

Unlike ReadAllLines which loads all lines in an array, ReadLines is an iterator that reads and returns one line at a time.

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • `ReadAllText()` ... Now that is memory bloat. This approach is only suited for logfiles that fit into memory, twice. Apply the Regex on a line by line basis and you're very close to the memory footprint of my answer. – H H Oct 20 '16 at 14:18
  • Which can be very easily be fixes with a ReadLines, which returns an IEnumerable. But that wasn't the question asked – Panagiotis Kanavos Oct 20 '16 at 14:20