2

I have a tsv file foo.tsv with colnames: "a", "b", "c", "d". I want to read this file and load its content to PDL matrix. File foo.tsv looks like this:

a   b   c   d
1   6   7   4
2   7   6   10
3   8   5   6
4   9   4   8
5   10  3   7

I used this code to read file to the matrix and print it:

use PDL::Core qw(pdl);
use PDL::IO::CSV ':all';

# Header set to the first row following https://github.com/kmx/pdl-io-csv
# Sep_char set to the tab
my $data = rcsv2D('foo.tsv', {text2bad => 1, header => 1, sep_char => "\t"});

print $data;

The printed matrix is wrong as it lacks the first row with numbers after the header:

[
 [ 2  3  4  5]
 [ 7  8  9 10]
 [ 6  5  4  3]
 [10  6  8  7]
]

I changed the header value to 'auto' which should skip rows that have in all columns non-numeric values:

my $data = rcsv2D('foo.tsv', {text2bad => 1, header => 'auto', sep_char => "\t"});

Now I get a warning but a matrix looks ok:

Argument "auto" isn't numeric in foreach loop entry at C:/sw/pdl/perl/vendor/lib/PDL/IO/CSV.pm line 335, <DATA> line 207.
[
 [ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [ 7  6  5  4  3]
 [ 4 10  6  8  7]
]

I do not understand why the resulting matrices do differ and why I do get a wrong result by setting header to the first row with header => 1 ?

zubenel
  • 300
  • 4
  • 10
  • @ikegami I looked at an option `auto` as it was written that it would skip lines with non-numeric values. `header => 0` does not give this result as header line is not skipped but instead it turns to BAD values. – zubenel Jan 20 '20 at 18:35
  • @ikegami You are right. I got confused with this sentence: _Parameters and items supported in options hash are the same as by "rcsv1D"_. – zubenel Jan 20 '20 at 18:46

2 Answers2

2

It appears to be a bug that was fixed in 0.011.

0.011   2019/12/04
        - fix: header option eats extra line #2
        - fix: cpantesters failure on long-double perls

With 0.011, your code works fine.

use strict;
use warnings;

use PDL::IO::CSV ':all';

my $data = rcsv2D('foo.tsv', {text2bad => 1, header => 1, sep_char => "\t"});
print $data;
$ perl -e'
   CORE::say join "\t", @$_
      for
         [qw( a  b  c  d  )],
         #    -- -- -- --
         [qw(  1  6  7  4 )],
         [qw(  2  7  6 10 )],
         [qw(  3  8  5  6 )],
         [qw(  4  9  4  8 )],
         [qw(  5 10  3  7 )];
' >foo.tsv

$ perl a.pl

[
 [ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [ 7  6  5  4  3]
 [ 4 10  6  8  7]
]

(Note that header=>'auto' is not supported by rcsv2D, and is being treated as header=>0 after issuing the warning you reported.)

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I checked the file I am loading and it has a header and it opens in Excel fine as I separate entries by tabs. Also I checked the version of PDL::IO::CSV with this function: `perl -MPDL::IO::CSV -le "print $PDL::IO::CSV::VERSION"`. I have the version: `0.010`. – zubenel Jan 20 '20 at 19:17
  • 1
    Looking at the change log, it appears to be a bug fixed in 0.011. I've updated my answer. – ikegami Jan 20 '20 at 19:27
2

I found out that I have a version 0.010 of PDL::IO::CSV. From Changes file it seems that this version has a bug as header eats extra line. This was fixed in version 0.011:

0.011   2019/12/04
        - fix: header option eats extra line #2
        - fix: cpantesters failure on long-double perls

EDIT: I found a solution independently but an answer of ikegami is more useful as it explains behaviour of header => 'auto'.

zubenel
  • 300
  • 4
  • 10