2

We are doing some security research for which I need to extract from Debian repos all available package names, versions, desc, etc.

I'm trying to parse output of apt-cache dumpavail into CSV and organise data into tabular form like name, version, desc.

I'm not very good at AWK but I guess its the perfect tool for this ? Feel free recommend me the ways I can make a good regex for AWK.

Gagan Pal
  • 33
  • 7

2 Answers2

1

Here's some Perl. Requires Text::CSV from CPAN

apt-cache dumpavail | perl -MText::CSV -00 -ane '
    BEGIN {
        $csv = Text::CSV->new({eol=>"\n"});
        @wanted = qw/Package Version Architecture Description/;
        $csv->print(STDOUT, \@wanted);
        $re = "(" . join("|", @wanted) . "): (.+?)(?=\\Z|^[[:upper:]])";
    }
    %m = /$re/msg; 
    for $key (keys %m) {$m{$key} =~ s/\n//g} 
    $csv->print(STDOUT, [@m{@wanted}]);
' > avail.csv
glenn jackman
  • 4,630
  • 1
  • 17
  • 20
1

I think sed might be better suited, e.g. with GNU sed:

parse.sed

/^Package: /                { s///; h }
/^Version: |^Description: / { s///; H }
/^$/                        { x; s/\n/;/gp }

Explanation:

  • Locate lines starting with the desired prefixes, e.g. /^Package/
  • Remove the prefix s///, i.e. substitute the previously matched pattern with nothing
  • Save the rest to hold-space (h) or (H), note the h overwrites hold-space
  • When an inter-package empty line is encountered (/^$/), swap hold-space and pattern-space (x) and substitute new-lines with the desired delimiter, here semi-colon (s/\n/;/gp) and print the result

Run it like this:

apt-cache dumpavail | sed -nEf parse.sed

With head attached the output is:

0ad;0.0.23-1+b1;Real-time strategy game of ancient warfare                
0ad-data;0.0.23-1;Real-time strategy game of ancient warfare (data files)
0ad-data-common;0.0.23-1;Real-time strategy game of ancient warfare (common data files)
0xffff;0.8-1;Open Free Fiasco Firmware Flasher
2048-qt;0.1.6-1+b1;mathematics based puzzle game
2ping;4.2-1;Ping utility to determine directional packet loss
2vcard;0.6-1;perl script to convert an addressbook to VCARD file format
fonts-3270;2.0.0-1;monospaced font based on IBM 3270 terminals
389-admin;1.1.46-2;389 Directory Administration Server
libds-admin-serv0;1.1.46-2;Libraries for the 389 Directory Administration Server
Thor
  • 475
  • 5
  • 14