Detecting type of line breaks

Question

What would be the most efficient (fast and reliable enough) way in JavaScript to determine the type of line breaks used in a text - Unix vs Windows.

In my Node app I have to read in large utf-8 text files and then process them based on whether they use Unix or Windows line breaks.

When the type of line breaks comes up as uncertain, I want to conclude based on which one it is most likely then.

UPDATE

As per my own answer below, the code I ended up using.

What have you tried so far? what does your input look like? raw text? rich text? — Kyle Falconer, Jan 15 '16 at 21:40
Tried some sort of ugly head-on scanning so far. Files are large raw text ones. — vitaly-t, Jan 15 '16 at 21:42
@Sam-Graham Only if the only line break is at the end of the file. It will stop after finding the first instance. — Mike Cluck, Jan 15 '16 at 22:02
Unless there are not any '\r'. It would only happen with Unix line endings. But it could happen, and hang the app if the file is large enough. If you first search for '\n' and then see if there is a '\r' behind it, that should be the most efficient, right? — Sam-Graham, Jan 15 '16 at 22:13

score 7 · Answer 1 · edited Apr 13 '19 at 13:38

You would want to look first for an LF. like source.indexOf('\n') and then see if the character behind it is a CR like source[source.indexOf('\n')-1] === '\r'. This way, you just find the first example of a newline and match to it. In summary,

function whichLineEnding(source) {
     var temp = source.indexOf('\n');
     if (source[temp - 1] === '\r')
         return 'CRLF'
     return 'LF'
}

There are two popularish examples of libraries doing this in the npm modules: node-newline and crlf-helper The first does a split on the entire string which is very inefficient in your case. The second uses a regex which in your case would not be quick enough.

However, from your edit, if you want to determine which is more plentiful. Then I would use the code from node-newline as it does handle that case.

Thank you, but I was looking for something more reliable, so I ended with one based on statistics - see my own answer. — vitaly-t, May 14 '19 at 04:25

Mir-Ismaili · Answer 2 · 2022-03-15T11:09:00.827

Thank @Sam-Graham. I tried to produce an optimized way. Also, the output of the function is directly usable (see below example):

function getLineBreakChar(string) {
    const indexOfLF = string.indexOf('\n', 1)  // No need to check first-character
    
    if (indexOfLF === -1) {
        if (string.indexOf('\r') !== -1) return '\r'
        
        return '\n'
    }
    
    if (string[indexOfLF - 1] === '\r') return '\r\n'
    
    return '\n'
}

^{Note1: Supposed string is healthy (only contains one type of line-breaks).}

^{Note2: Supposed you want LF to be default encoding (when no line-break found).}

Usage example:

fs.writeFileSync(filePath,
        string.substring(0, a) +
        getLineBreakChar(string) +
        string.substring(b)
);

This utility may be useful too:

const getLineBreakName = (lineBreakChar) =>
    lineBreakChar === '\n' ? 'LF' : lineBreakChar === '\r' ? 'CR' : 'CRLF'

In relation to the optimized approach, see the update I added to my answer. It does not get more efficient than that ;) — vitaly-t, Nov 09 '19 at 17:40

vitaly-t · Accepted Answer · 2021-01-23T01:44:57.840

In the end I used my own solution for this, based on simple statistics:

const {EOL} = require('os');

function getEOL(text) {
    const m = text.match(/\r\n|\n/g);
    const u = m && m.filter(a => a === '\n').length;
    const w = m && m.length - u;
    if (u === w) {
        return EOL; // use the OS default
    }
    return u > w ? '\n' : '\r\n';
}

When there are no line breaks, or their number suddenly equal, it will return the OS's default EOL.

UPDATE

Later on I found out through further practice, that if you want to process text in the same way, regardless of whether it has Unix or Windows encoding, then the most efficient approach is to simply replace any possible Windows encoding with the Unix one, and not bother with any verification at all:

text = text.replace(/\r\n/g, '\n'); // replace every \r\n with \n

Gyandeep · Answer 4 · 2016-01-15T21:59:35.910

1

This is how we detect line endings in JavaScript files using ESLint rule. Source means the actual file content.

Note: Sometimes you can have files with mixed line-endings also.

https://github.com/eslint/eslint/blob/master/lib/rules/linebreak-style.js

edited Jan 15 '16 at 21:59

answered Jan 15 '16 at 21:46

Gyandeep

12,726
4
31
42

If the license is compatible, please copy the relevant code to the answer. If that link breaks, this answer would become relatively useless. Also, check your spelling :). – Heretic Monkey Jan 15 '16 at 21:58

nigelheap · Answer 5 · 2016-01-15T21:56:30.223

1

Try this

if(text.search(/\r/) > -1 || text.search(/\r\n/) > -1){
   alert('Windows');
} else if(text.search(/\n/) > -1){
   alert('Unix');
} else {
   alert('No line breaks found')
}

edited Jan 15 '16 at 21:56

answered Jan 15 '16 at 21:54

nigelheap

56
2

When both are found, I want to conclude based on which ones found more. – vitaly-t Jan 15 '16 at 21:56

Detecting type of line breaks

5 Answers5

Linked

Related