4

Reading file using java and jcifs on windows. I need to determine size of file, which contains multi-byte as well as ASCII characters.

how can i achieve it efficiently OR any existing API in java?

Thanks,

Sach
  • 659
  • 8
  • 20
  • 1
    You need to know the character encoding for the question to even make any sense. *Do* you know the encoding? – Jon Skeet Dec 21 '11 at 13:27
  • The file size itself? `new RandomAccessFile(...).getChannel().size()`? – fge Dec 21 '11 at 13:31
  • @fge, this is fine provided there are no multi-byte characters. – Peter Lawrey Dec 21 '11 at 13:58
  • Well, the OP asks for the file _size_, doesn't it? If it is the _length of the text in it_, it is another matter altogether. – fge Dec 21 '11 at 13:59
  • @Jon Skeet: OP wrote (before any edit if any) *"... which contains multi-byte as well as ASCII characters"* which make it sound like the two are mutually exclusive. So it seems to imply two things: ASCII characters are stored on one-byte while non-ASCII characters need at least two bytes. At least that's how I'm interpreting OP's sentence. Given that sentence and if I had to bet on it I'd put money on UTF-8 for I don't see which other common encoding would guarantee that every ASCII character would be stored on one byte while every non-ASCII character would be stored on at least two bytes : ) – TacticalCoder Dec 21 '11 at 14:41
  • @user988052: My guess is that the OP is actually slightly confused about what ASCII really means and about character encodings. He needs to think carefully about what he's really trying to find. – Jon Skeet Dec 21 '11 at 14:43

2 Answers2

3

No doubts, to get exact number of characters you have to read it with proper encoding. The question is how to read files efficiently. Java NIO is fastest known way to do that.

FileChannel fChannel = new FileInputStream(f).getChannel();
    byte[] barray = new byte[(int) f.length()];
    ByteBuffer bb = ByteBuffer.wrap(barray);
    fChannel.read(bb);

then

String str = new String(barray, charsetName);
str.length();

Reading into byte buffer is done with a speed near to maximum available ( for me it was like 60 Mb/sec while disk speed test gives about 70-75 Mb/sec)

andrey
  • 842
  • 4
  • 6
  • 3
    Isn't that going to be a bit of a memory explosion if you attempt to read a large file? – ewan.chalmers Dec 21 '11 at 14:11
  • 1
    Also, `new String(ByteBuffer, String)` does not compile. – ewan.chalmers Dec 21 '11 at 14:47
  • @sudocode Thank you for comments. You are absolutely right. Algorithm will work only for files that can fit in memory (suit for 99.99% practical tasks) But this is a good comment. About "new String(ByteBuffer, String)" you are right again. I've corrected it to "new String(barray, String)". Thanks! – andrey Dec 22 '11 at 07:10
1

To get the character count, you'll have to read the file. By specifying the correct file encoding, you ensure that Java correctly reads each character in your file.

BufferedReader.read() returns the Unicode character read (as an int in the range 0 to 65535). So the simple way to do it would be like this:

int countCharsSimple(File f, String charsetName) throws IOException {
    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(f), charsetName));
    int charCount = 0;
    while(reader.read() > -1) {
        charCount++;
    }
    reader.close();
    return charCount;
}

You will get faster performance using Reader.read(char[]):

int countCharsBuffer(File f, String charsetName) throws IOException {
    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(f), charsetName));
    int charCount = 0;
    char[] cbuf = new char[1024];
    int read = 0;
    while((read = reader.read(cbuf)) > -1) {
        charCount += read;
    }
    reader.close();
    return charCount;
}

For interest, I benchmarked these two and the nio version suggested in Andrey's answer. I found the second example above (countCharsBuffer) to be the fastest.

(Note that all these examples include line separator characters in their counts.)

ewan.chalmers
  • 16,145
  • 43
  • 60
  • @thanks sudocode, similar code was written. but i was doubtful so want to check other options.your comments really helped. – Sach Dec 22 '11 at 10:45