12

I would like to determine the tab width used in source files indented with spaces. This is not hard for files with particularly regular indentation, where the leading spaces are only used for indentation, always in multiples of the tab width, and with indentation increasing one level at at time. But many files will have some departure from this sort of regular indentation, generally for some form of vertical alignment. I'm thus looking for a good heuristic to estimate what tab width was used, allowing some possibility for irregular indentation.

The motivation for this is writing an extension for the SubEthaEdit editor. SubEthaEdit unfortunately doesn't make the tab width available for scripting, so I'm going to guess at it based on the text.

A suitable heuristic should:

  • Perform well enough for interactive use. I don't imagine this will be a problem, and just a portion of the text can be used if need be.
  • Be language independent.
  • Return the longest suitable tab width. For example, any file with a tab width of four spaces could also be a file with two-space tabs, if every indentation was actually by twice as many levels. Clearly, four spaces would be the right choice.
  • Always get it right if the indentation is completely regular.

Some simplifying factors:

  • At least one line can be assumed to be indented.
  • The tab width can be assumed to be at least two spaces.
  • It's safe to assume that indentation is done with spaces only. It's not that I have anything against tabs---quite the contrary, I'll check first if there are any tabs used for indentation and handle it separately. This does mean that indentation mixing tabs and spaces might not be handled properly, but I don't consider it important.
  • It may be assumed that there are no lines containing only whitespace.
  • Not all languages need to be handled correctly. For example, success or failure with languages like lisp and go would be completely irrelevant, since they're not normally indented by hand.
  • Perfection is not required. The world isn't going to end if a few lines occasionally need to be manually adjusted.

What approach would you take, and what do you see as its advantages and disadvantages?

If you want to provide working code in your answer, the best approach is probably to use a shell script that reads the source file from stdin and writes the tab width to stdout. Pseudocode or a clear description in words would be just fine, too.

Some Results

To test different strategies, we can apply different strategies to files in the standard libraries for language distributions, as they presumably follow the standard indentation for the language. I'll consider the Python 2.7 and Ruby 1.8 libraries (system framework installs on Mac OS X 10.7), which have expected tab widths of 4 and 2, respectively. Excluded are those files which have lines beginning with tab characters or which have no lines beginning with at least two spaces.

Python:

                     Right  None  Wrong
Mode:                 2523     1    102
First:                2169     1    456
No-long (12):         2529     9     88
No-long (8):          2535    16     75
LR (changes):         2509     1    116
LR (indent):          1533     1   1092
Doublecheck (10):     2480    15    130
Doublecheck (20):     2509    15    101

Ruby:

                     Right  None  Wrong
Mode:                  594    29     51
First:                 578     0     54
No-long (12):          595    29     50
No-long (8):           597    29     48
LR (changes):          585     0     47
LR (indent):           496     0    136
Doublecheck (10):      610     0     22
Doublecheck (20):      609     0     23

In these tables, "Right" should be taken as determination of the language-standard tab width, "Wrong" as a non-zero tab width not equal to the language-standard width, and "None" as zero tab-width or no answer. "Mode" is the strategy of selecting the most frequently occurring change in indentation; "First" is taking the indentation of the first indented line; "No-long" is FastAl's strategy of excluding lines with large indentation and taking the mode, with the number indicating the maximum allowed indent change; "LR" is Patrick87's strategy based on linear regression, with variants based on the change in indentation between lines and on the absolute indentation of lines; "Doublecheck" (couldn't resist the pun!) is Mark's modification of FastAl's strategy, restricting the possible tab width and checking whether half the modal value also occurs frequently, with two different thresholds for selecting the smaller width.

JasonMArcher
  • 14,195
  • 22
  • 56
  • 52
Michael J. Barber
  • 24,518
  • 9
  • 68
  • 88

7 Answers7

3

Okay, as you want a language-agnostic solution, we won't be able to use any syntactical hints. Although you said, that you don't want a perfect solution, here is one, that is working very well with most languages.

I actually had to solve a similar issue in cryptography to get the correct code word length in a polyalphabetic cipher. This kind of encryption is a basic Caesar-chiffre (each letter of the alphabet is moved by n letters), where the cryptword is used to move the letters differently (the nth letter of the clear text is moved by the mod(nth, length(cryptword)) letter of the cryptword). The weapon of choice is autocorrelation.

The algorithm would be like this:

  1. strip all characters after the whitespace at the beginning of a line has ended - leave the line-end markers intact.
  2. remove lines with zero whitespace (as they are only blank lines)
  3. Count the whitespace width for each line and save this in an array lengths
  4. Autocorrelation: loop until the maximum estimated number - may be fairly high like 32 or something - current iteration shall be i. For each iteration, calculate the distance between each entry and the ith entry. Count the number of distances = 0 (same values for the nth and (n+i)th entries), save in an array for the key i.
  5. You now have an array of same-pair-occurances. Calculate the mean of this array, and delete all values near this mean (leaving the spikes of the autocorrelation). The spikes will be multiples of the lowest value, which will be the searched number of spaces used for indentation.

The autocorrelation is a very nice function, usable for every situation, in which you want to detect repeating values in a stream of data. It is heavily used in signal processing and very fast (depending on the estimated maximum distance of signal repeats).

And yes, back then I cracked the polyalphabetic ciphertext with autocorrelation. ;)

Lars
  • 5,757
  • 4
  • 25
  • 55
  • *Very* interesting approach. It's been a while since I've done any signal processing, but I think I can see how this works. You're essentially suggesting a way to do a cheap Fourier transform, based on the assumption that low frequencies dominate (i.e., the limit in step 4 is a form of low-pass filter). Step five throws out the values that contribute little in the frequency domain power spectrum. Does that sound about right? – Michael J. Barber Aug 23 '11 at 15:38
  • Implementing this, I'm not finding your step 4 very clear: what does *i* represent? It appears to be the difference between the index of lines being compared, but how does that get converted to a tab width at the end? Is it supposed to be a 2D autocorrelation, perhaps? – Michael J. Barber Aug 23 '11 at 18:31
  • @michael-j-barber sounds about right, but to be honest, signal processing is not my best area of knowledge. I've also read about the similarities to FFT. In the end, you try to amplify spikes by comparing a signal to itself with an offset. Imagine a sinus-wave, which you copy and iteratively increase the offset, until the two waves match up again. This amplifies the signal noticably and thus you can determine the wavelength by looking at the offset. The same works for ciphertext, if you take the letter number in the alphabet as the value for the wave, but that really get'S off topic now. ;) – Lars Aug 23 '11 at 19:19
  • @Michael-j-barber: i is the iteration or current offset being tested. Have a look at [Index of Coincidence](http://en.wikipedia.org/wiki/Index_of_coincidence) for a detailled explanation of solving a polyalphabetic chiffre. Maybe this will clear it up better than I can provide in 500 chars. – Lars Aug 23 '11 at 19:23
  • I'll take a look at the article, hopefully it will clear things up. Right now, points 4 and 5 seem to be saying to count how many pairs of lines with offsets *i* have the same indentation, and select the offsets with high counts. But that would ignore the actual indentation, with no way to recover it. Don't forget you can edit your answer: the 500 character limit is not a problem! – Michael J. Barber Aug 23 '11 at 21:02
  • got a point there - I'll see if I can find some time to write some actual code on this one tomorrow. – Lars Aug 23 '11 at 22:10
  • Autocorreletion is overkill for this problem, plus you need to estimate/guess the highest number, and it is quite slow for this simple problem. – Jürgen Strobel Aug 24 '11 at 15:20
2
  • For each line in the file
    • If indented more than the previous, add the difference to a list
      • discard if > 12, it's probably a line continuation
  • Generate a frequency table of the #s in the list
  • #1 is likely your answer.

edit

I have VB.Net open (didn't you? :-) Here's what I mean:

    Sub Main()
        Dim lines = IO.File.ReadAllLines("ProveGodExists.c")
        Dim previndent As Integer = 0
        Dim indent As Integer
        Dim diff As Integer
        Dim Diffs As New Dictionary(Of Integer, Integer)
        For Each line In lines
            previndent = indent
            indent = Len(line) - Len(LTrim(line))
            diff = indent - previndent
            If diff > 0 And diff < 13 Then
                If Diffs.ContainsKey(diff) Then
                    Diffs(diff) += 1
                Else
                    Diffs.Add(diff, 1)
                End If
            End If
        Next
        Dim freqtbl = From p In Diffs Order By p.Value Descending
        Console.WriteLine("Dump of frequency table:")
        For Each item In freqtbl
            Console.WriteLine(item.Key.ToString & " " & item.Value.ToString)
        Next
        Console.WriteLine("My wild guess at tab setting: " & freqtbl(0).Key.ToString)
        Console.ReadLine()
    End Sub

Results:

Dump of frequency table:
4 748
8 22
12 12
2 2
9 2
3 1
6 1
My wild guess at tab setting: 4

Hope that helps.

FastAl
  • 6,194
  • 2
  • 36
  • 60
  • Not bad, except this can't e.g. determine the tab width is 8, if 45% of the tab widths are 7 and 55% are 9. Interesting, though. – Patrick87 Aug 18 '11 at 20:45
  • @Patrick87 - if you sort the freq table it will, those #s will be in subsequent slots. But, I don't think the OP wanted that; I re-read the question and still I think he just wants the most likely candidate. – FastAl Aug 18 '11 at 21:39
  • @Patrick87 I wouldn't expect that a file in which indentation never changes by 8 would have a tab width of 8. The numbers you give seem like an exceptional case that one shouldn't worry about much. – Michael J. Barber Aug 19 '11 at 06:55
  • More specifically for this answer, it is much in line with what I'm looking for. In the end, if you can come up with a good rule for eliminating spurious indents, it should be possible to do very well with a simple selection strategy like the mode of the indentation changes. I'll implement this later and see whether "large indents" are a good test for spurious indents. – Michael J. Barber Aug 19 '11 at 06:59
  • Your choices are (realistically) 2,3,4,5,6,7,8. I'd scan the first 50-100 non-empty lines with this method and pick the highest. If the hit is 8, 6, or 4 I'd do a second check to see if 4, 3, or 2 were the second highest and pick that one instead. I'd pick a "rationalization" scheme for fixing the goofs, for each of your 7 possibilities. – Mark Aug 24 '11 at 15:55
  • @Mark I've awarded the bounty to this answer, as it is the best one so far and I have no further time to implement and test another before the deadline would have expired. But please post your comment as an answer, I'll implement it tomorrow and see if it does better. I'll wait on accepting an answer until I've tried yours out as well. – Michael J. Barber Aug 24 '11 at 16:31
1

Maybe do something like...

  1. get a list of all the tab widths in the file
  2. remove 50% of the entries which are least frequent
  3. sort the remaining entries in ascending order
  4. compute a list of (a, b) pairs where b's are in the list of tab widths and the a's give the rank of that tab width.
  5. plot a best-fit line
  6. the slope of the best-fit line is the guess for the tab width. round to the nearest integer.

Example:

  1. list = [4, 4, 6, 8, 8, 4, 4, 4, 8, 8, 12, 5, 11, 13, 12, 12]
  2. list = [4, 4, 4, 4, 4, 8, 8, 8]
  3. already sorted
  4. [(1, 4), (1, 4), (1, 4), (1, 4), (1, 4), (2, 8), (2, 8), (2, 8)]
  5. the best fit line is b = 4a + 0 (R^2 = 0)
  6. slope is 4, so this is probably the tab width.
Patrick87
  • 27,682
  • 3
  • 38
  • 73
  • When you refer to tab width, do you mean the leading indentation of the lines or the change in indentation between successive lines? – Michael J. Barber Aug 17 '11 at 09:49
  • My method would approximate both: the leading indentation is the y-intercept, and the change in indentation would be the slope. Alternatively, this line would give a function of indentation spaces versus tab depth. – Patrick87 Aug 17 '11 at 12:54
  • OK, then I'll follow-up with questions and comments for both versions. For indentation width, it seems that the approach can get it wrong, even if all indentation changes are the same magnitude; this might not be a problem in practice, and may well be necessary to improve the overall results---to be determined empirically. I note that your example numbers have no zeros---was that deliberate? – Michael J. Barber Aug 17 '11 at 13:35
  • In the case of indentation changes, there seems to be an assumption that most changes are in even multiples of the tab width, which I'm not sure about---again, something which I'll address empirically. Your example numbers have neither zeros nor negatives. Is the intention to omit reductions in the indentation? To use the magnitudes of non-zero changes? – Michael J. Barber Aug 17 '11 at 13:40
  • No, you could add zeros. I'm not sure I follow about how this could go wrong. This is an empirical question, and fitting a curve to data - for indentation, one would assume a linear curve is most suitable - is standard practice. The only time I see this method failing spectacularly is when all indentation levels are the same... in that case, you tell me what indentation scheme the guy was using! – Patrick87 Aug 17 '11 at 13:40
  • Another way to say it is this: my method is the best guess you can make looking at the data... to do better, you'd need assumptions. Say somebody chooses to indent at the first tab level and third, with tab width 2. Then there are lots of 2s and 6s, and my method would say the tab width is 4. If that's not good enough, you need a psychic, not an algorithm. By the way, the data points are the total number of leading spaces on each line... Not some sort of line-to-line delta. – Patrick87 Aug 17 '11 at 13:45
  • Thanks for the additional comments, I'm sure I can fairly capture your intention for testing purposes now. BTW, using indentation width can go wrong when you have a function consisting of a bunch of lines indented one level, a line or two indented two levels, and then a bunch of lines indented three levels; the two-level indent gets thrown out and double the actual tab width is returned. This structure shows up in, e.g., numerical code where you're looping over both indices of a 2d matrix. – Michael J. Barber Aug 17 '11 at 14:13
1

Your choices are (realistically) 2,3,4,5,6,7,8.

I'd scan the the first 50-100 lines or so using something like what @FastAl suggested. I'd probably lean toward just blindly pulling the spaces count from the front of any row with text and counting the length of the white space string. Left trimming lines and running length twice seems like a waste if you have regex available. Also, I'd do System.Math.abs(indent - previndent) so you get de-indent data. The regex would be this:

row.matches('^( +)[^ ]') # grab all the spaces from line start to non-space.

Once you've got a statistic for which of the 7 options has the highest count, run with it as the first guess. For 8, 6, and 4 you should check to see if there is also a significant count (2nd place or over 10% or some other cheapo heuristic) for 4 and 2, 3, or 2. If there are a lot of 12s (or 9s) that might hint that 4 (or 3) is a better choice than 8 (or 6) as well. Dropping or adding more than 2 levels at a time (usually collapsed ending brackets) is super rare.

Irrelevant mumbling

The one problem I see is that old .c code in particular has this nasty pattern going on in it:

code level 0
/* Fancy comments get weird spacing because there 
 * is an extra space beyond the *
 * looks like one space!
 */
  code indent (2 spaces)
  /* Fancy comments get weird spacing because there 
   * is an extra space beyond the *
   * looks like three spaces!
   */

code level 0
  code indent (2 spaces)
  /* comment at indent level 1
     With no stars you wind up with 2 spaces + 3 spaces.
  */

Yuck. I don't know how you deal with comment standards like that. For code that is "c" like you might have to deal with comments special in version 2.0... but I would just ignore it for now.

Your final issue is dealing with lines that don't match your assumptions. My suggestion would be to "tab" them to depth and then leave the extra spaces in place. If you have to correct I'd do this: rowtabdepth = ceiling((rowspacecount - (tabwidth/2)) / tabwidth)

Mark
  • 1,058
  • 6
  • 13
  • That gives a nice improvement for the ruby standard library, but actually a tiny loss for python---it looks like more in absolute terms, but as a percentage, the gain for ruby outweighs the loss for python. Looking through where Python gets it wrong, there's just not many more files to get correct than "no-long8" does. Using a threshold of 20% seems to work a little better than your guess of 10%. I found your description a little unclear, reading like you're working with the absolute indentation but referring to FastAl's which is about the differences; perhaps some editing is in order. – Michael J. Barber Aug 25 '11 at 19:34
  • Elaborate layout like the C you mention is exactly why I stressed "not all languages, perfection not required." Even with the exact tab width, it would be hard to insert text to match the formatting: better to call `indent` or the like. – Michael J. Barber Aug 25 '11 at 19:39
  • You are correct, I blended two answers, poorly. :-/ I'll tweak the answer to push it toward relative tabbing like @FastAl's. – Mark Aug 25 '11 at 20:43
0

For each langauge you want to support, you'll need to do a bit of parsing:
1) exclude comments (either line-wise or block-wise, maybe also nested?)
2) find openings of sub-block ({ in C-like languages, begin in pascal, do in shell etc.)

Then just see how much the number of spaces increase after sub-block has been opened. Make some simple statistics - to find most frequent value, maximum and minimum value, average value. This way you can also see if the indentation is regular or not and how much.

Tomas
  • 57,621
  • 49
  • 238
  • 373
0

As a baseline, one could simply calculate all indentation increases, and take the most frequent increase as the tab width. As a shell script, written to have small actions per pipeline stage, it could look like this:

#!/bin/sh

grep -v -E '^[[:space:]]*$' | 
  sed 's/^\([[:space:]]*\).*/\1/' | 
    awk '{ print length($0) }' | 
      awk '$1 > prev { print $1 - prev } { prev = $1 }' | 
        sort | 
          uniq -c | 
            sort -k1nr | 
              awk '{ print $2 }' | 
                head -n 1

This implementation is O(n log(n)) where n is the number of lines in the file, but it could readily be done in O(n).

Michael J. Barber
  • 24,518
  • 9
  • 68
  • 88
  • I like this for the sheer perversity of it. After spawning 9 processes I don't think sort's non linear O() behavior is a problem for typical source files. – Jürgen Strobel Aug 24 '11 at 15:24
  • @Jürgen This was intended as a step-by-step illustration with one action per stage of the pipeline, to act as a baseline that others could modify without much trouble---I wanted ideas more than an efficient implementation. That doesn't necessarily lead to a fast implementation (the two consecutive `awk` stages look particularly egregious, as does the sorting strategy to get the maximum). That said, it runs on a file with 10k lines with no noticeable time lag; fast enough for interactive use isn't much of a constraint! – Michael J. Barber Aug 24 '11 at 15:40
  • I fully understand that. My python script uses almost the same strategy. – Jürgen Strobel Aug 24 '11 at 15:53
0

Heuristic:

  1. Get a list of all indention changes from a line to its next line which are > 0.
  2. Make a frequency table of all values in this list.
  3. Take the value with highest frequency.

Python script, takes filenames or stdin and prints best indent number:

#!/usr/bin/env python

import fileinput, collections

def leadingSpaceLen(line):
    return len(line) - len(line.lstrip())

def indentChange(line1, line2):
    return leadingSpaceLen(line2) - leadingSpaceLen(line1)

def indentChanges(lines):
    return [indentChange(line1, line2)
        for line1, line2 in zip(lines[:-1], lines[1:])]

def bestIndent(lines):
    f = collections.defaultdict(lambda: 0)
    for change in indentChanges(lines):
        if change > 0:
            f[change] += 1
    return max(f.items(), key=lambda x: x[1])[0]

if __name__ == '__main__':
    print bestIndent(tuple(fileinput.input()))
Jürgen Strobel
  • 2,200
  • 18
  • 30