3

I have to use Perl on a Windows environment at work, and I need to be able to find out the number of rows that a large csv file contains (about 1.4Gb). Any idea how to do this with minimum waste of resources?

Thanks

PS This must be done within the Perl script and we're not allowed to install any new modules onto the system.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Alex Wong
  • 761
  • 3
  • 9
  • 15
  • 1
    "Not allowed to install new modules" is generally a conceptual myth. http://stackoverflow.com/questions/755168/perl-myths/755179#755179 – Kent Fredric Apr 16 '09 at 08:39

6 Answers6

14

Do you mean lines or rows? A cell may contain line breaks which would add lines to the file, but not rows. If you are guaranteed that no cells contain new lines, then just use the technique in the Perl FAQ. Otherwise, you will need a proper CSV parser like Text::xSV.

jiggy
  • 3,828
  • 1
  • 25
  • 40
  • 1
    You should amend your question, since every other comment is just doing line counting. – jiggy Apr 15 '09 at 15:58
  • +1, good point, but it's also worth mentioning that there is no "official" CSV format -- just a collection of loosely-defined, somewhat incompatible formats that disagree on things like how to quote commas and whether line breaks are allowed in cells. Many tools assume 1 row == 1 line. – j_random_hacker Apr 15 '09 at 17:11
9

Yes, don't use perl.

Instead use the simple utility for counting lines; wc.exe

It's part of a suite of windows utilities ported from unix originals.

http://unxutils.sourceforge.net/

For example;

PS D:\> wc test.pl
     12      26     271 test.pl
PS D:\>

Where 12 == number of lines, 26 == number of words, 271 == number of characters.

If you really have to use perl;

D:\>perl -lne "END{print $.;}" < test.pl
12
Ed Guiness
  • 34,602
  • 16
  • 110
  • 145
  • 1
    Sure, wc would be the way to go on *nix where it will already be installed -- but is it really worth downloading a separate executable to do something that takes short line of Perl? – j_random_hacker Apr 15 '09 at 15:31
  • Yes, because Cygwin is a must-have for any Windows dev environment. – KenE Apr 15 '09 at 15:33
  • 1
    This isn't Cygwin but still a must-have. Counting lines in a file is such a common activity that its definitely worth having this utility. – Ed Guiness Apr 15 '09 at 15:34
  • @KenE: Is that sarcasm? FTR UnxUtils are non-Cygwin-based. – j_random_hacker Apr 15 '09 at 15:35
  • @edg: I see you added a Perl suggestion, +1. – j_random_hacker Apr 15 '09 at 15:36
  • Just use "wc -l filename". This is good advice, a very simple c/parser may be faster that one written in perl. With wc you don't have to write the parser. I use the GnuWin32 tools every day, they are worth getting even if you decide to write a perl parser: http://gnuwin32.sourceforge.net – daotoad Apr 15 '09 at 15:36
  • FTR it seems slightly silly that the asker is "required" to use Perl for this... Also I'd expect wc to be marginally faster (definitely faster startup, but that doesn't matter much for huge files). – j_random_hacker Apr 15 '09 at 16:00
  • For me, installing MSYS (http://www.mingw.org/ - I used to use unxutils) AND perl are the first things I do on any Windows system I have to work on. – runrig Apr 15 '09 at 16:09
  • Thank god the OP is using Perl. Imagine him using Java and not being allowed to install anything else. – innaM Apr 15 '09 at 16:27
  • @Manni: Or imagine if he was only allowed to use COBOL on a locked-down abacus that's on fire. :-P – j_random_hacker Apr 15 '09 at 16:52
  • Not sarcasm - I mean Cygwin or some equivalent which gives you these tools. Cygwin happens to be what I'm familiar with. – KenE Apr 15 '09 at 17:12
4
perl -lne "END { print $. }" myfile.csv

This only reads one line at a time, so it doesn't waste any memory unless each line is enormously long.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • Lines are not the same thing as CSV rows. Consider fields with embedded newlines, for instance. – brian d foy Apr 16 '09 at 16:36
  • @brian: That's true. But it's also true that working with CSV files containing fields with embedded newlines is destined to be painful, because there's no universal agreement across tools on whether or how such things should be encoded -- unfortunately, CSV is not a "standard." – j_random_hacker Apr 16 '09 at 17:44
3

This one-liner handles new lines within the rows:

  1. Considering lines with an odd number of quotes.
  2. Considering that doubled quotes is a way of indicating quotes within the field.
  3. It uses the awesome flip-flop operator.

    perl -ne 'BEGIN{$re=qr/^[^"]*(?:"[^"]*"[^"]*)*?"[^"]*$/;}END{print"Count: $t\n";}$t++ unless /$re/../$re/'
    

Consider:

  • wc is not going to work. It's awesome for counting lines, but not CSV rows
  • You should install--or fight to install--Text::CSV or some similar standard package for proper handling.
  • This may get you there, nonetheless.


EDIT: It slipped my mind that this was windows:
perl -ne "BEGIN{$re=qr/^[^\"]*(?:\"[^\"]*\"[^\"]*)*?\"[^\"]*$/;}END{print qq/Count: $t\n/;};$t++ unless $pq and $pq = /$re/../$re/;"

The weird thing is that The Broken OS' shell interprets && as the OS conditional exec and I couldn't do anything to change its mind!! If I escaped it, it would just pass it that way to perl.

Axeman
  • 29,660
  • 2
  • 47
  • 102
0

Upvote for edg's answer, another option is to install cygwin to get wc and a bunch of other handy utilities on Windows.

KenE
  • 1,805
  • 10
  • 6
  • IME, Cygwin adds too much complication unless you want to run a pseudo-unix environment. MinGW and MSYS provide a lighter weight system that works well for software development. For simple command line tools, GnuWin32 offers a good selection of tools with low impact, simple installers. – daotoad Apr 15 '09 at 16:26
  • Thanks for the tip - I'll give those a try sometime! – KenE Apr 15 '09 at 17:13
  • wc is not the answer since it counts lines, which is not the same as a CSV row. See Axeman's answer. – brian d foy Apr 16 '09 at 16:35
-1

I was being idiotic, the simple way to do it in the script is:

open $extract, "<${extractFileName}" or die ("Cannot read row count of $extractFileName");
$rowCount=0;    
while (<$extract>)
{
    $rowCount=$rowCount+1;
}

close($extract);
Alex Wong
  • 761
  • 3
  • 9
  • 15