8

What would be the most efficient (fast and reliable enough) way in JavaScript to determine the type of line breaks used in a text - Unix vs Windows.

In my Node app I have to read in large utf-8 text files and then process them based on whether they use Unix or Windows line breaks.

When the type of line breaks comes up as uncertain, I want to conclude based on which one it is most likely then.

UPDATE

As per my own answer below, the code I ended up using.

vitaly-t
  • 24,279
  • 15
  • 116
  • 138

5 Answers5

7

You would want to look first for an LF. like source.indexOf('\n') and then see if the character behind it is a CR like source[source.indexOf('\n')-1] === '\r'. This way, you just find the first example of a newline and match to it. In summary,

function whichLineEnding(source) {
     var temp = source.indexOf('\n');
     if (source[temp - 1] === '\r')
         return 'CRLF'
     return 'LF'
}

There are two popularish examples of libraries doing this in the npm modules: node-newline and crlf-helper The first does a split on the entire string which is very inefficient in your case. The second uses a regex which in your case would not be quick enough.

However, from your edit, if you want to determine which is more plentiful. Then I would use the code from node-newline as it does handle that case.

Ashley Medway
  • 7,151
  • 7
  • 49
  • 71
Sam-Graham
  • 1,254
  • 12
  • 20
  • 1
    Thank you, but I was looking for something more reliable, so I ended with one based on statistics - see my own answer. – vitaly-t May 14 '19 at 04:25
5

Thank @Sam-Graham. I tried to produce an optimized way. Also, the output of the function is directly usable (see below example):

function getLineBreakChar(string) {
    const indexOfLF = string.indexOf('\n', 1)  // No need to check first-character
    
    if (indexOfLF === -1) {
        if (string.indexOf('\r') !== -1) return '\r'
        
        return '\n'
    }
    
    if (string[indexOfLF - 1] === '\r') return '\r\n'
    
    return '\n'
}

Note1: Supposed string is healthy (only contains one type of line-breaks).

Note2: Supposed you want LF to be default encoding (when no line-break found).


Usage example:

fs.writeFileSync(filePath,
        string.substring(0, a) +
        getLineBreakChar(string) +
        string.substring(b)
);

This utility may be useful too:

const getLineBreakName = (lineBreakChar) =>
    lineBreakChar === '\n' ? 'LF' : lineBreakChar === '\r' ? 'CR' : 'CRLF'
Mir-Ismaili
  • 13,974
  • 8
  • 82
  • 100
  • In relation to the optimized approach, see the update I added to my answer. It does not get more efficient than that ;) – vitaly-t Nov 09 '19 at 17:40
2

In the end I used my own solution for this, based on simple statistics:

const {EOL} = require('os');

function getEOL(text) {
    const m = text.match(/\r\n|\n/g);
    const u = m && m.filter(a => a === '\n').length;
    const w = m && m.length - u;
    if (u === w) {
        return EOL; // use the OS default
    }
    return u > w ? '\n' : '\r\n';
}

When there are no line breaks, or their number suddenly equal, it will return the OS's default EOL.

UPDATE

Later on I found out through further practice, that if you want to process text in the same way, regardless of whether it has Unix or Windows encoding, then the most efficient approach is to simply replace any possible Windows encoding with the Unix one, and not bother with any verification at all:

text = text.replace(/\r\n/g, '\n'); // replace every \r\n with \n
vitaly-t
  • 24,279
  • 15
  • 116
  • 138
1

This is how we detect line endings in JavaScript files using ESLint rule. Source means the actual file content.

Note: Sometimes you can have files with mixed line-endings also.

https://github.com/eslint/eslint/blob/master/lib/rules/linebreak-style.js

Gyandeep
  • 12,726
  • 4
  • 31
  • 42
  • If the license is compatible, please copy the relevant code to the answer. If that link breaks, this answer would become relatively useless. Also, check your spelling :). – Heretic Monkey Jan 15 '16 at 21:58
1

Try this

if(text.search(/\r/) > -1 || text.search(/\r\n/) > -1){
   alert('Windows');
} else if(text.search(/\n/) > -1){
   alert('Unix');
} else {
   alert('No line breaks found')
}
nigelheap
  • 56
  • 2