-2

I have lots of data dumps in a pretty huge amount of data structured as follow

Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet

Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all

Which I would like to transform to something like:

Key1,Key2,Key3,Key5
Value,Other value,Maybe another value yet,
Different value,,Invaluable,Has no value at all

I mean:

  • Generate a collection of all the keys
  • Generate a header line with all the Keys
  • Map all the values to their correct "columns" (notice that in this example I have no "Key4", and Key3/Key5 interchanged)
  • Possibly in Perl, since it would be easier to use in various environments.

But I am not sure if this format is unusual, or if there is a tool that already does this.

Rafareino
  • 2,515
  • 1
  • 19
  • 26
  • What exactly is your question? The best format for transformation would be whatever the "thing" that's going to read the data wants! – Matt Jacob Apr 06 '17 at 18:38
  • 2
    That looks like a simple transposal. Are the files fairly small? What problems have you had with coding a solution? – Borodin Apr 06 '17 at 18:41
  • Are the dots really in the data? – Borodin Apr 06 '17 at 18:49
  • Have you got the 'Invaluable' in the right column of the CSV output? If so, how is it right? (The double commas should be before, not after, 'Invaluable' — I think.) – Jonathan Leffler Apr 06 '17 at 19:25
  • When you say 'pretty huge amount of data' is it so huge that it won't all fit into the memory of the computer processing the file? It is a speed and ease vs use of memory issue... – dawg Apr 06 '17 at 23:39
  • Not that huge @dawg , I simply have about 3 or 4 dozen of such reports, to which I would like to generate 3 or 4 dozen CSV dumps (which I am loading to a database to automatize a few analysis, but that don't come into the question matter), Memory is not a issue at all (in fact my server has 20+Gigs of **free** mem), the server is some form of high end one, designed to a big app, that requires lots of IO, but not that much ram, but it seems to be packaged in a way that we could't buy only the few Gigs we really needed, instead we end up with lots of unused RAM. – Rafareino Apr 07 '17 at 14:47
  • @Borodin Yes, the dos are there – Rafareino Apr 07 '17 at 14:50
  • @JonathanLeffler you are right, fixed – Rafareino Apr 07 '17 at 14:50

3 Answers3

2

This is fairly easy using hashes and the Text::CSV_XS module:

use strict;
use warnings;

use Text::CSV_XS;

my @rows;
my %headers;

{
    local $/ = "";

    while (<DATA>) {
        chomp;
        my %record;

        for my $line (split(/\n/)) {
            next unless $line =~ /^([^:]+):\.+\s(.+)/;
            $record{$1} = $2;
            $headers{$1} = $1;
        }

        push(@rows, \%record);
    }
}

unshift(@rows, \%headers);

my $csv = Text::CSV_XS->new({binary => 1, auto_diag => 1, eol => $/});
$csv->column_names(sort(keys(%headers)));

for my $row_ref (@rows) {
    $csv->print_hr(*STDOUT, $row_ref);
}

__DATA__
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet

Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all

Output:

Key1,Key2,Key3,Key5
Value,"Other value","Maybe another value yet",
"Different value",,Invaluable,"Has no value at all"
Matt Jacob
  • 6,503
  • 2
  • 24
  • 27
0

If your CSV format is 'complicated' - e.g. it contains commas, etc. - then use one of the Text::CSV modules. But if it isn't - and this is often the case - I tend to just work with split and join.

What's useful in your scenario, is that you can map key-values within a record quite easily using a regex. Then use a hash slice to output:

#!/usr/bin/env perl

use strict;
use warnings;

#set paragraph mode - records are blank line separated. 
local $/ = "";

my @rows;
my %seen_header;

#read STDIN or files on command line, just like sed/grep 
while ( <> ) {
   #multi - line pattern, that matches all the key-value pairs,
   #and then inserts them into a hash. 
   my %this_row = m/^(\w+):\.+ (.*)$/gm;
   push ( @rows, \%this_row ); 

   #add the keys we've seen to a hash, so we 'know' what we've seen. 
   $seen_header{$_}++ for keys %this_row; 
}

#extract the keys, make them unique and ordered. 
#could set this by hand if you prefer.    
my @header = sort keys %seen_header;

#print the header row
print join ",", @header, "\n";

#iterate the rows
foreach my $row ( @rows ) {
    #use a hash slice to select the values matching @header.
    #the map is so any undefined values (missing keys) don't report errors, they
    #just return blank fields. 
    print join ",", map { $_ // '' } @{$row}{@header},"\n";
}

This for you sample input, produces:

Key1,Key2,Key3,Key5,
Value,Other value,Maybe another value yet,,
Different value,,Invaluable,Has no value at all,

If you want to be really clever, then most of that initial building of the loop can be done with:

my @rows = map { { m/^(\w+):\.+ (.*)$/gm } } <>;

The problem then is - you would need to build up the 'headers' array still, and that means a bit more complicated:

$seen_header{$_}++ for map { keys %$_ } @rows;

It works, but I don't think it's as clear about what's happening.

However the core of your problem may be the file size - that's where you have a bit of a problem, because you need to read the file twice - first time to figure out which headings exist throughout the file, and then second time to iterate and print:

#!/usr/bin/env perl

use strict;
use warnings;

open ( my $input, '<', 'your_file.txt') or die $!;
local $/ = "";

my %seen_header;
while ( <$input> ) { 
    $seen_header{$_}++ for m/^(\w+):/gm; 
}  

my @header = sort keys %seen_header; 

#return to the start of file:
seek ( $input, 0, 0 ); 

while ( <$input> )  {
   my %this_row = m/^(\w+):\.+ (.*)$/gm;
   print join ",", map { $_ // '' } @{$this_row}{@header},"\n";
}

This will be slightly slower, as it'll have to read the file twice. But it won't use nearly as much memory footprint, because it isn't holding the whole file in memory.

Unless you know all your keys in advance, and you can just define them, you'll have to read the file twice.

Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • Totally agree, and I would not have problems either, reading the file twice or using a lot of memory, my main problem was exactly need to read the file first to get the keys and them crafting some custom code to each keyset. I was just a bit loose trying to do exactly that colection of keys upfront and them dumping they corectly. I am learning a lot by studying your code, thanks! – Rafareino Apr 07 '17 at 14:56
-1

This seems to work with the data you've given

use strict;
use warnings 'all';

my %data;

while ( <> ) {

    next unless /^(\w+):\W*(.*\S)/;

    push @{ $data{$1} }, $2;
}

use Data::Dump;
dd \%data;

output

{
  Key1 => ["Value", "Different value"],
  Key2 => ["Other value"],
  Key3 => ["Maybe another value yet", "Invaluable"],
  Key5 => ["Has no value at all"],
}
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • 1
    That doesn't look all that much like the desired output. Granted, with only 2 records, it isn't too hard to work out what's going on, but even with only a dozen records with varying lists of attributes, the format you show is not manageable. – Jonathan Leffler Apr 06 '17 at 19:54