3

I'd like to iterate through a text file one line at a time, operate on the contents, and stream the result to a separate file. Textbook case for BufferedReader.readLine().

But: I need to glue my lines together with newlines, and what if the original file didn't have the "right" newlines for my platform (DOS files on Linux or vice versa)? I guess I could read ahead a bit in the stream and see what kind of line endings I find, even though that's really hacky.

But: suppose my input file doesn't have a trailing newline. I'd like to keep things how they were. Now I need to peek ahead to the next line ending before reading every line. At this point why am I using a class that gives me readLine() at all?

This seems like it should be a solved problem. Is there a library (or even better, core Java7 class!) that will just let me call a method similar to readLine() that returns one line of text from a stream, with the EOL character(s) intact?

Coderer
  • 25,844
  • 28
  • 99
  • 154

3 Answers3

6

Here's an implementation that reads char by char until it finds a line terminator. The reader passed in must support mark(), so if yours doesn't, wrap it in a BufferedReader.

public static String readLineWithTerm(Reader reader) throws IOException {
    if (! reader.markSupported()) {
        throw new IllegalArgumentException("reader must support mark()");
    }

    int code;
    StringBuilder line = new StringBuilder();

    while ((code = reader.read()) != -1) {
        char ch = (char) code;

        line.append(ch);

        if (ch == '\n') {
            break;
        } else if (ch == '\r') {
            reader.mark(1);
            ch = (char) reader.read();

            if (ch == '\n') {
                line.append(ch);
            } else {
                reader.reset();
            }

            break;
        }
    }

    return (line.length() == 0 ? null : line.toString());
}
Jesse Merriman
  • 6,219
  • 1
  • 22
  • 10
  • 1
    I think this is more or less the same as the implementation I wound up having to build. Still mystified that nobody else seems to need this! – Coderer Mar 19 '15 at 12:46
2

Update:

But: I need to glue my lines together with newlines, and what if the original file didn't have the "right" newlines for my platform (DOS files on Linux or vice versa)? I guess I could read ahead a bit in the stream and see what kind of line endings I find, even though that's really hacky.

You can create a BufferedReader with a specified charset. So if the file is wacky, you'll have to supply the file's charset. Files.newBufferedReader(Path p, Charset cs)

Is there a library (or even better, core Java7 class!) that will just let me call a method similar to readLine() that returns one line of text from a stream, with the EOL character(s) intact?

If you're going to read a file, you have to know what charset it is. If you know what charset it is, then you don't need the EOL character to be "intact" since you can just add it on yourself.


From BufferedReader.readLine:

Reads a line of text. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed.

Returns: A String containing the contents of the line, not including any line-termination characters, or null if the end of the stream has been reached

So BufferedReader.readLine does not return any line-termination characters. If you want to preserve these characters, you can use the read method instead.

int size = 1000; // size of file

BufferedReader br = new BufferedReader(new FileReader("file.txt"));
char[] buf = new char[size];
br.read(buf, 0, size);

That is just a simple example, but if the file has line termination then it will show up in the buffer.

Community
  • 1
  • 1
ktm5124
  • 11,861
  • 21
  • 74
  • 119
  • Maybe I need to clarify the OP, but I understand that the method in BufferedReader won't do what I need. What I meant is, maybe there's an Apache Commons library or something in Guava that is more flexible? Ted is on the right track (below) but I don't think I can twist the StreamTokenizer into returning whole lines as a token (though of course I'd love to be proven wrong). – Coderer Feb 26 '14 at 08:44
  • You need to re-read my post more carefully. I gave you a method in BufferedReader which **will** do what you need. You **absolutely** don't need a third-party library to read every character from a file. This is a basic I/O operation which is implemented in every language. – ktm5124 Feb 26 '14 at 16:50
  • What I meant is that the `read()` method is only part of the picture. I can fill a buffer, of course, but then I have to find the line-ending in it, load more data... now what happens when I hit the end of the buffer? I need to load more... but what if a line is over 1000 characters? Etc etc etc. Now I'm basically on the road to re-implementing the whole `readLine` logic myself. It's not that it's *so hard* or anything, I just don't want to have to discover all the edge cases for myself. That's why I keep asking for a library... – Coderer Feb 26 '14 at 16:59
  • I updated my post to answer some of your questions. But I still don't understand your problem exactly. This is a trivial I/O operation and it shouldn't require a 3rd-party library. – ktm5124 Feb 26 '14 at 17:08
  • 1
    The line ending has nothing to do with the character set. Both DOS/Windows (`\r\n`) line endings and Linux/Unix (`\n`) line endings are perfectly valid ASCII / UTF-8 / whatever. And the point is that I don't know going in what the *platform* (DOS vs Windows) of the file will be, and I'd like to preserve it. – Coderer Feb 27 '14 at 13:55
0

You should be using the StreamTokenizer to get more detailed control over input pasring.

http://docs.oracle.com/javase/7/docs/api/java/io/StreamTokenizer.html

Ted Bigham
  • 4,237
  • 1
  • 26
  • 31
  • It looks like I'd have to iterate through each "word" of the line, which is almost as much of a pain as just reading the thing in chunks like @ktm5124 suggested. I really want an interface that just gives me one line at a time, including endings. It looks like I may have to build my own... – Coderer Feb 26 '14 at 08:39
  • I think last time i did what you're doing, i ended up reading the whole file as a string, then using a StringTokenizer on it (which supports returning the delimiters). – Ted Bigham Feb 26 '14 at 08:45
  • I don't have that option right now -- it's not a file, it's an InputStream handed to me by another framework. I could read the whole stream into memory but I have no guarantee that it won't be multiple GB. I really need to work stream-wise if at all possible :( – Coderer Feb 26 '14 at 09:07
  • I suppose you could read in a "chunk" at a time, and run each piece through the StringTokenizer. That would be a little messy, but probably not too bad. – Ted Bigham Feb 26 '14 at 09:17
  • I could, but then I'd have to handle the case of not fitting a whole token/line in the "chunk", at which point I'm basically writing the original logic I'm asking for in the first place :( – Coderer Feb 26 '14 at 09:37