Handling line separators in shared reader library

Question

I'm writing a PeekReader class that I plan to share as public available library. The question is how to handle correctly line separators.

I plan to create a filter that transforms the line separators in new line.

/* Not actual code, just to let people visualize the problem, 
doesn't need to be corrected */
String seq = readEOL();
return switch(seq) {
  case "\n" -> "\n"; // standard new line
  case "\r" -> "\n"; // Carrige return
  case "\f" -> "\n"; // Form feed
  case "\v" -> "\n"; // Vertical tabulation
  case "\r\n" -> "\n"; // New line in Microsoft
  case "\n\r" -> "\n"; // New line in RISC OS
  case "\u0015" -> "\n"; // NL used by IBM mainframes
  case "\u001E" -> "\n"; // RS used by QNX as line separator
  case "\u0085" -> "\n"; // Unicode next line
  case "\u2028" -> "\n"; // Unicode line separator
  case "\u2029" -> "\n"; // Unicode paragraph separator
  default -> seq;  
}

The question is: how should I handle different line separators or similar objects? Is it correct to filter those and make them new lines or should I let them as they are, and jut manage them when readLine() is called? I'm not asking opinions. I need to know the correct practice to be compliant subclassing Reader class. I placed all the known characters/sequences to understand which ones I need to actually consider new lines and which ones I should change in "\n" (if any at all).

I do have already my private version of this, thus I don't need suggestions on how write it. What I'm asking is how to write a function that meets the standards for sharing with the public.

Also, if you have some links about building libraries to share with the public, please feel free to post them. I'm really interested.

Addendum: PeekReader extends Reader and it's aimed to improve/replace BufferedReader, especially when tokenizing/parsing streams. It uses the function peek() that permits to preview the characters in the buffer without consuming them.

I can't answer your question but it would probably be worth looking at the source of ```BufferedReader``` if you haven't already. It doesn't attempt to be nearly so comprehensive about its line separators ;) — g00se, Sep 03 '22 at 14:43
I have already. BufferReader is not eligible as answer, because it doesn't actually consider anything else than `'\n'` and `'\r'` as single characters. If you have a file that uses the sequence `"\n\r"` BufferReader will split in 2 different lines, or, at least, that's my understanding of that code. — Luca Scarcia, Sep 03 '22 at 14:46
That might be the case but then you could argue that file is corrupt. Will you argue that it's *not*? ;) — g00se, Sep 03 '22 at 14:55
@Abra: this is certainly the case. But what if you are importing files from other platforms. I actually have to work with old files written with MS-DOS that uses `\r\n`, and they can become messy. That's why I'm writing something that can trap a lot of new line styles. — Luca Scarcia, Sep 03 '22 at 14:57
fwiw I'm not aware of a single line separator ever having been "\n\r" — g00se, Sep 03 '22 at 14:59
@g00se: neither I was before doing some research ;) But it's actually a thing in Risk OS it seems. I didn't know the Unicode ones either. But they have been encoded, thus there's a need to handle them. The big problem is if I should trap them or let them exposed for the user to manage them. Also the solution must be something that the end user will expect. — Luca Scarcia, Sep 03 '22 at 15:04
Don't you just need to set the correct encoding for the file you are reading? — Abra, Sep 03 '22 at 15:20
@Abra perhaps that would work for non ASCII, non UNICODE systems. But line terminators shouldn't have anything to do with encoding. But it's worth looking into this a little more. I'll tell you what I found. — Luca Scarcia, Sep 03 '22 at 15:24
@Abra: it doesn't seem related as I supposed. New lines are document format standard, not character format standard. Perhaps, If I use a CharsetDecoder to get a document from a non ASCII/non UNICODE system (like Atari or Sinclair) it would convert them to \n. But, if you reamin in the ASCII, UNICODE domain, you need to check manually :( — Luca Scarcia, Sep 03 '22 at 15:49
I'm a bit dim here as I'm not sure what `PeakReader` is and whether you wish to implement a super version of `BufferedReader.readLine`. Could you edit the question with a bit more detail to make it clear what you wish to achieve? As it is now, you've asked several questions but also say "I'm not asking opinions. I need to know ..." which might put off people from trying to form an answer this question. — DuncG, Sep 04 '22 at 09:39
@DuncG: a PeekReader is a Reader aimed for parsing. It has a peek function (instead of unread from PushbackReader) and a couple more utilities. And, of course, it has readLine, and the version readUntilEOL that doesn't consume the EOL sequence. But, in general, you can consider it a more powerful version of BufferedReader, yes. — Luca Scarcia, Sep 04 '22 at 12:31
By "Risk OS" do you actually mean "RISC OS"? (Note spelling and capitalization differences!") — Stephen C, Sep 04 '22 at 12:43
Note that RISC OS >supports< "\n\r" as a line separator, but the default is "\n" ... according to the RISC OS 5.28 manual (page 491) (https://www.riscosopen.org/zipfiles/platform/common/UserGuide.5.28.pdf) — Stephen C, Sep 04 '22 at 12:55
Yes, @StephenC, I'll correct right now. Thanks for the info. — Luca Scarcia, Sep 04 '22 at 13:05
Erm ... I found evidence that the RISC OS was still using LFCR in some of the files that they shipped in their ROMs as late as 5.24. See https://www.riscosopen.org/content/downloads/summary-rom-5-28/iomd-5-28 — Stephen C, Sep 04 '22 at 13:09
Well, managing \n\r after managing \r\n is little concern. The problem lies in whether I should filter those sequence and show them only as \n or if I should let the user decide how to handle them. — Luca Scarcia, Sep 04 '22 at 13:12
You’re explicitly asking for an opinion—there’s no particular way it “should” be written. If you do any conversion of newlines the onus is on the user to write out the correct newline, if you don’t the output won’t be compatible with other file processing pipelines. *My* preference would be to tell your code what a newline is, and a flag to indicate any conversion. — Dave Newton, Sep 04 '22 at 13:13
Well, you pointed a problem that is not an opinion, but a real problem (the fact that there's already software using BufferedReader that expects a certain behaviour), meaning that there's a definite solution (not filtering them). Adding a flag to activate/deactivate the filtering is really simple. But I don't understand why indicating the new line sequence. You should know it only if you knew beforehand what the file format is. And that's not the behaviour of BufferReader either. — Luca Scarcia, Sep 04 '22 at 13:18
@DaveNewton: this also adds another interesting possibility. I could associate a filtering function while extracting the characters (not just the EOL sequences). But that would be a little too complex. — Luca Scarcia, Sep 04 '22 at 13:30
@LucaScarcia If you don’t know beforehand you already have a problem—there’s no automagic way to definitively differentiate between a “newline” and a legitimate byte sequence that doesn’t mean “newline” in one file format but may in another. You could guess, but it’s just that—a guess. — Dave Newton, Sep 04 '22 at 14:21
@DaveNewton You are making a good point. That's why I accepted your suggestion and I added the setFilterEOLSequences flag and the readUntilEOL function that doesn't consume the EOL like readLine does. Still, an EOL sequence is a Control Sequence in a text file. We're not talking about binary files here, but text files, and the meaning of a Control Sequence is usually precodified. In example U+001E is the ISO control for Record Separator (RS), and that is why it was used as line separator character. In either case it means that the line has ended. Still, disabling the filtering should be enough — Luca Scarcia, Sep 04 '22 at 14:38
@LucaScarcia And since it’s precodified, you know what it is. Bear in mind that EOL can vary even within the same file—a carriage return is distinctly different from a line feed, this is how games were played with printouts early on: a CR could be used to overstrike an already-printed line, and LFs were used to move the platten w/o homing the typehead (think dot matrix/ball head printers). General-purpose EOL detection is non-trivial and almost always requires foreknowledge. — Dave Newton, Sep 04 '22 at 14:48
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/247775/discussion-between-luca-scarcia-and-dave-newton). — Luca Scarcia, Sep 04 '22 at 15:04

Handling line separators in shared reader library

0 Answers0