4

I am processing a large directory every night. It accumulates around 1 million files each night, half of which are .txt files that I need to move to a different directory according to their contents.

Each .txt file is pipe-delimited and contains only 20 records. Record 6 is the one that contains the information I need to determine which directory to move the file to.

Example Record:

A|CHNL_ID|4

In this case the file would be moved to /out/4.

This script is processing at a rate of 80,000 files per hour.

Are there any recommendations on how I could speed this up?

opendir(DIR, $dir) or die "$!\n";
while ( defined( my $txtFile = readdir DIR ) ) {
    next if( $txtFile !~ /.txt$/ );
    $cnt++;

    local $/;
    open my $fh, '<', $txtFile or die $!, $/;
    my $data  = <$fh>;
    my ($channel) =  $data =~ /A\|CHNL_ID\|(\d+)/i;
    close($fh);

    move ($txtFile, "$outDir/$channel") or die $!, $/;
}
closedir(DIR);
Borodin
  • 126,100
  • 9
  • 70
  • 144
DenairPete
  • 61
  • 1
  • 4
  • 1
    You could move the regular expression out of the loop and pre-compile it with `qr`, but that won't save you much. You should check what the bottleneck is, it may be cpu, memory, or disk (hint....its probably disk). – xxfelixxx Mar 17 '18 at 04:26
  • 2
    Do you have a directory with 80000 files in it? If so, that itself might be your issue, depending on your filesystem. I would advise breaking them up into a bunch of subdirectories with fewer files in it...if you have more than say 1000 files, that is probably already too many. – xxfelixxx Mar 17 '18 at 04:28
  • Also, if you split the incoming work over different disks (perhaps on different machines as well), you could run this program several times in parallel, to speed up processing. – xxfelixxx Mar 17 '18 at 04:31
  • What kind of file system do you have? You can look at `/etc/fstab` or run the `df` command to find out. – xxfelixxx Mar 17 '18 at 04:33
  • 2
    As others are implying, you have an I/O bound task with performance entirely determined by the file system. If it turns out you have more than one disk channel, then you _might_ get a speedup by pipelining. E.g. use N copies of the script with an instance number I=0,1,...N-1 given as a command line argument. Sort the `readdir` result and process file I, I+N, I+2N, ... . Another possibility would be to use one script to determine where to move each file and pass it to another that does the move operation. There's no way to tell if these will result in a speedup except to try. – Gene Mar 17 '18 at 05:16
  • zdim, you should generate 1 million files (not 80k) as the original poster. There might be a performance issue with such a huge number of files on that particular file system. – Kjetil S. Mar 17 '18 at 06:37
  • `move` is not a perl builtin function. Have you tried `rename` instead? – Kjetil S. Mar 17 '18 at 06:38
  • `move` is not a perl builtin function. Have you tried `rename` instead? If `move()` is something that spawns a sub process, that could be the culprit. If you cannot for some reason use perls `rename`, collect more than one file and move those to the targets in groups. – Kjetil S. Mar 17 '18 at 06:46
  • With `local $/;` you read the whole file unnecessary. You only need to read to record #6. Are those large files? (how many records) – Kjetil S. Mar 17 '18 at 06:48
  • Try putting the files on a RAMdisk to reduce i/o times. – Mark Setchell Mar 17 '18 at 11:00
  • [Crossposted to PerlMonks](https://www.perlmonks.org/?node_id=1211102). – haukex Mar 19 '18 at 06:38

4 Answers4

5

You are being hurt by the sheer number of files in a single directory.

I created 80_000 files and ran your script which completed in 5.2 seconds. This is on an older laptop with CentOS7 and v5.16. But with half a million files it takes nearly 7 minutes. Thus the problem is not about the performance of your code per se (but which can also be tightened).

Then one solution is simple: run the script out of a cron, say every hour, as files are coming. While you move the .txt files also move the others elsewhere and there will never be too many files; the script will always run in seconds. In the end you can move those other files back, if needed.

Another option is to store these files on a partition with a different filesystem, say ReiserFS. However, this doesn't at all address the main problem of having way too many files in a directory.

Another partial fix is to replace

while ( defined( my $txtFile = readdir DIR ) )

with

while ( my $path = <"$dir/*txt"> )

which results in a 1m:12s run (as opposed to near 7 minutes). Don't forget to adjust file-naming since <> above returns the full path to the file. Again, this doesn't really deal with the problem.

If you had control over how the files are distributed you would want a 3-level (or so) deep directory structure, which can be named using files' MD5, what would result in a very balanced distribution.


File names and their content were created as

perl -MPath::Tiny -wE'
    path("dir/s".$_.".txt")->spew("A|some_id|$_\n") for 1..500_000
'
zdim
  • 64,580
  • 5
  • 52
  • 81
  • it works in my laptop, but not in my Enterprise Linux which has Perl 5.10.. how do I copy Path::Tiny module manually and use it here?. – stack0114106 Mar 21 '19 at 20:07
  • @stack0114106 ah, that -- looking at [Path::Tiny's code](https://metacpan.org/release/Path-Tiny/source/lib/Path/Tiny.pm) ... seems you can just save the file where you want it? I see no troubling dependencies and it's all one file :). Or you can of course just write a string to a file ... but `Path::Tiny` is indeed often super handy :) – zdim Mar 21 '19 at 21:11
3

This is the sort of task that I often perform. Some of these were already mentioned in various comments. None of these are special to Perl and the biggest wins will come from changing the environment rather than the language.

  • Segment files into separate directories to keep the directories small. Larger directories take longer to read (sometimes exponentially). This happens in whatever produces the files. The file path would be something like .../ab/cd/ef/filename.txt where the ab/cd/ef come from some function that has unlikely collisions. Or maybe it's like .../2018/04/01/filename.txt.

  • You probably don't have much control over the producer. I'd investigate making it add lines to a single file. Something else makes separate files out of that later.

  • Run more often and move processed files somewhere else (again, possibly with hashing.

  • Run continually and poll the directory periodically to check for new files.

  • Run the program in parallel. If you have a lot of idle cores, get them working on it to. You'd need something to decide who gets to work on what.

  • Instead of creating files, shove them into a lightweight data store, such as Redis. Or maybe a heavyweight data store.

  • Don't actually read the file contents. Use File::Mmap instead. This is often a win for very large files but I haven't played with it much on large collections of small files.

  • Get faster spinning disks or maybe an SSD. I had the misfortune where I accidentally created millions of files in a single directory on a slow disk.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
2

Try something like:

print localtime()."\n";                          #to find where time is spent
opendir(DIR, $dir) or die "$!\n";
my @txtFiles = map "$dir/$_", grep /\.txt$/, readdir DIR;
closedir(DIR);

print localtime()."\n";
my %fileGroup;
for my $txtFile (@txtFiles){
    # local $/ = "\n";                           #\n or other record separator
    open my $fh, '<', $txtFile or die $!;
    local $_ = join("", map {<$fh>} 1..6);      #read 6 records, not whole file
    close($fh);
    push @{ $fileGroup{$1} }, $txtFile
      if /A\|CHNL_ID\|(\d+)/i or die "No channel found in $_";
}

for my $channel (sort keys %fileGroup){
  moveGroup( @{ $fileGroup{$channel} }, "$outDir/$channel" );
}
print localtime()." finito\n";

sub moveGroup {
  my $dir=pop@_;
  print localtime()." <- start $dir\n";
  move($_, $dir) for @_;  #or something else if each move spawns sub process
  #rename($_,$dir) for @_;
}

This splits the job into three main parts where you can time each part to find where most time is spent.

Kjetil S.
  • 3,468
  • 20
  • 22
2

I don't think anyone brought it up but have you considered running a long term process that uses filesystem notifications as near realtime events, instead of processing in batch? Im sure CPAN will have something for Perl 5, there is a built in object in Perl 6 for this to illustrate what I mean https://docs.perl6.org/type/IO::Notification Perhaps someone else can chime in on what is a good module to use in P5?

Matt Oates
  • 778
  • 4
  • 6
  • [Linux::Inotify2](http://search.cpan.org/~mlehmann/Linux-Inotify2-1.22/Inotify2.pm) (see [this post](https://stackoverflow.com/a/41206434/4653379) for an example) – zdim Apr 25 '18 at 06:11