-2

How could I read the last few chars of a file with the most disk efficiency?

Lucas Noetzold
  • 1,670
  • 1
  • 13
  • 29
  • 1
    *"Is it possible?"* Yes! --- [Why is “Is it possible to…” a poorly worded question?](https://softwareengineering.meta.stackexchange.com/q/7273/202153) – Andreas Jul 15 '18 at 05:45
  • 1
    @Andreas: no actually it is not possible. You can only read data in sequentially, and this has nothing to do with Java but rather is an OS restriction – Hovercraft Full Of Eels Jul 15 '18 at 05:46
  • 4
    @HovercraftFullOfEels Sure it is possible. Question is about *"doing so with more disk efficiency"*, and you can do it by random-access reading the last block, process those bytes *(in reverse)*, then reading second-last block, process bytes, ... and finally reading first block, process bytes. Larger block size will improve efficiency. You can even make the characters available as a `Reader`, if you so choose. – Andreas Jul 15 '18 at 05:49
  • 1
    *random-access*, thats what I'll be googling for, thanks – Lucas Noetzold Jul 15 '18 at 05:51
  • Now, you can't make the reversed file content available as an `InputStream` and wrapper that with a `Reader`, because the bytes are in reverse order, so any multi-byte encoding (e.g. UTF-8) would fail. – Andreas Jul 15 '18 at 05:52
  • then if my intent is to read only the last 4 chars of the file, for a not so big file (1~5kb), you **guess** that it would be faster to do this and reverse the bytes? I have a tens of thousands to go. – Lucas Noetzold Jul 15 '18 at 05:56
  • See [RandomAccessFile](https://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html) – lance-java Jul 15 '18 at 05:57
  • 1
    @LucasNoetzold If your text file is using a single-byte character set (e.g. not UTF-8), then "last 4 chars" means "last 4 bytes", so then yes, using `RandomAccessFile` to only read last 4 bytes would *probably* be the fastest way. Only profiling can say for sure. – Andreas Jul 15 '18 at 06:00
  • With a character set like UTF-8, "last 4 chars" would mean between 4 and 16 bytes, so you could read last 16, then analyze those to find out how many are actually needed for last 4 chars. – Andreas Jul 15 '18 at 06:04
  • 1
    yep, it's UTF-8, from what I read it can vary the amount of bytes each char uses, this adds some complexity to the case. Thanks for the tips @Andreas. – Lucas Noetzold Jul 15 '18 at 06:06
  • For Unicode text, you should explore what you mean by "character". ["Grapheme cluster boundaries are important for collation, regular expressions, UI interactions, segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text."](http://unicode.org/reports/tr29/) See Java's `getCharacterInstance` in the [BreakIterator Class](https://docs.oracle.com/javase/tutorial/i18n/text/about.html) – Tom Blodget Jul 15 '18 at 13:57

1 Answers1

2

Here is a method for reading the last N characters from a UTF-8 encoded text file.

/**
 * Reads last {@code length} characters from UTF-8 encoded text file.
 * <p>
 * The returned string may be shorter than requested if file is too
 * short, if the leading character is a half surrogate-pair, or if
 * file has invalid UTF-8 byte sequences.
 * 
 * @param fileName Name of text file to read.
 * @param length Length of string to return.
 * @return String with up to {@code length} characters.
 * @throws IOException if an I/O error occurs.
 */
public static String readLastChars(String fileName, int length) throws IOException {
    // A char can only store characters in the Basic Multilingual Plane, which are
    // encoded using up to 3 bytes each. A character from a Supplemental Plane is
    // encoded using 4 bytes, and is stored in Java as a surrogate pair, ie. 2 chars.
    // Worst case (assuming valid UTF-8) is that file ends with a 4-byte sequence
    // followed by length-1 3-byte sequences, so we need to read that many bytes.
    byte[] buf;
    try (RandomAccessFile file = new RandomAccessFile(fileName, "r")) {
        int bytesToRead = length * 3 + 1;
        buf = new byte[bytesToRead <= file.length() ? bytesToRead : (int) file.length()];
        file.seek(file.length() - buf.length);
        file.readFully(buf);
    }
    // Scan bytes backwards past 'length' characters
    int start = buf.length;
    for (int i = 0; i < length && start > 0; i++) {
        if (buf[--start] < 0) { // not ASCII
            // Locate start of UTF-8 byte sequence (at most 4 bytes)
            int minStart = (start > 3 ? start - 3 : 0);
            while (start > minStart && (buf[start] & 0xC0) == 0x80)
                start--; // Skip UTF-8 continuation byte
            if (start == minStart)
                i++; // 4-byte UTF-8 -> 2 surrogate chars
        }
    }
    // Create string from bytes, and skip first character if too long
    // (text starts with surrogate pair, assuming valid UTF-8)
    String text = new String(buf, start, buf.length - start, StandardCharsets.UTF_8);
    while (text.length() > length)
        text = text.substring(text.offsetByCodePoints(0, 1));
    return text;
}
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • I suggest that you replace "character" with "codepoint", unless you mean that `length` is the number of `char` (UTF-16 code units)—which then gets into the general problem of indexing into Java's UTF-16 String. Also, if the file has a BOM, it should not be included in the returned text—since it's metadata, not text. Otherwise, great answer. – Tom Blodget Jul 15 '18 at 20:07
  • @TomBlodget The `length` parameter *is* the number of `char`, i.e. the desired `length()` of the returned `String`, which is why the javadoc says *"The returned string **may be shorter** than requested [...] if the leading character is a **half surrogate-pair**"*. E.g. if the file is all emojis, and you ask for 9 chars, it returns 8 surrogate pairs, i.e. 4 code points, as designed. --- The Java Runtime Library doesn't support BOM for UTF-8, i.e. all Java methods will return the BOM as a regular character. --- No reason for this method to deviate from either Java standard. – Andreas Jul 16 '18 at 00:20