0

I am working on a Perl script that opens a huge file and which has the records in the below format. Script might run in Solaris 10 or HP UX 11.0

Filename1 , col1, col2
Filename1 , col1, col2
Filename2 , col1, col2
Filename3 , col1, col2

When I read the first field file name of the input file I need to create a new file if it doesn't exists and print the rest of the fields to the file. There might be 13000 unique file names in the input file. What is the maximum number of file handles that I can open in Solaris 10 or hpux 11? Will I be able to open 13000 file handles? I am planning to use a hash to store the file handles for writing it to the files and closing it. Also how can I easily get the unique file name from the first field across the whole file? Is there a easy way to do it rather than reading each line of the file?

Christopher Bottoms
  • 11,218
  • 8
  • 50
  • 99
Arav
  • 4,957
  • 23
  • 77
  • 123

3 Answers3

2

The maximum number of filehandles is OS depended (and is configurable)

See ulimit (manual page is here)

However opening that many file handles is unreasonable. Have a rethink about your algorithm.

Ed Heal
  • 59,252
  • 17
  • 87
  • 127
  • Thanks a lot for the info. How do i check the hard limits. I am unable to find the /etc/security/limits.conf file in solaris 10. There is no system.conf in solaris10. Not sure where is the config file. – Arav Oct 18 '12 at 23:23
1

No, there's no way to get all the unique filenames without reading the entire file. But you can generate this list as you're processing the file. When you read a line, add the filename as the key of a hash. At the end, print the keys of the hash.

Barmar
  • 741,623
  • 53
  • 500
  • 612
1

I don't know what your system allows, but you can open more file handles than your system permits using the FileCache module. This is a core Perl module, so you shouldn't even need to install it.

There is no way to get the first column out of a text file without reading the whole file, because text files don't really have an internal structure of columns or even lines; they are just one long string of data. The only way to find each "line" is to go through the whole file and look for newline characters.

However, even huge files are generally processed quite quickly by Perl. This is unlikely to be a problem. Here is simple code to get the unique filenames (assuming your file is opened as FILE):

my %files;
while (<FILE>) { /^(\S+)/ and $files{$1}++; }

This ends up with a count of how many times each file occurs. It assumes that your filenames don't contain any spaces. I did a quick test of this with >30,000 lines, and it was instantaneous.

dan1111
  • 6,576
  • 2
  • 18
  • 29
  • i read the ulimit command it's says u can't cross the hardlimits. only the root user can change it. Does the FileCache module sets the soft limit? Also does it work faster than the usual way of opening and closing the files? In the code that you pasted abouve what is the purporse of doing one or more whitespace check in the regular expreesion and adding to a hashmap. The file is a csv file and i require the first field only. – Arav Oct 18 '12 at 23:25