4

Is there a way to tell sloccount that some files are neither of the existing languages already, but a new (different) language (some DSL, a language not supported by sloccount, scala, go, rust...) but not based on file extension, rather by their content (e.g. contain some specific keywords, or a specific style of comments, I could provide a complete list of tokens to the tool, etc.).

Is there is a better tool (simple) for the job for this specific task ?

Thanks in advance.

  • By the way if anyone comes back to this question, I found out about a year (?) ago that GitHub had worked with ml@b at Berkeley to create an actual performant language classifier. There used to be a web page https://lexicon.github.io/ but it now 404s... I mailed repeatedly GitHub to see if they would open-source this work (they don't have to, but if they aren't doing anything with the R&D maybe someone could), alas it seems not on the table. Seemed to work really well. Oh well, so long... – Touisteur EmporteUneVache Dec 13 '18 at 06:47

2 Answers2

1

You can use find together with wc -l to achieve a similar result as sloccount.

If you are in your project directory you can run the following to get the number of lines of code in the project:

find . -name '*.scala' -print0 | wc -l --files0-from=-

Do note that it also counts empty lines, if you want to skip empty lines you can add a grep -v:

find -name '*.scala' -exec grep -v -e '^[[:space:]]*$' {} \; | wc -l
spydon
  • 9,372
  • 6
  • 33
  • 63
  • Hi, I think sloccount perform something a bit more clever than line count, but tries to count 'single' lines of code, so mostly one can't 'cheat' splitting lines to have a larger sloc count but also eliminate comments. – Touisteur EmporteUneVache Jul 28 '21 at 22:56
  • Hi, It does remove comments, but it does not remove lines from expressions that are split on several lines. To remove single line comments we'll just have to add a case to the grep, but multi-line comments will of course be more complicated. – spydon Jul 29 '21 at 08:44
-2

Op writes: Is there is a better tool (simple) for the job for this specific task ?

What you want is a tool that knows something about a wide variety of languages, can use the file extension as a hint and uses the file content as a sanity check or a classification if the extension isn't present.

Semantic Designs' (my company) File Inventory tool scans a large set of files and classifies them this way. File extensions hint at content. When no file extension is present, a set of user-definable regexes are used to attempt a basic classification of the type of file. Once the file content is guessed, a second pass using language accurate lexical scanners are used to confirm that content is what it claims to be to provide confidence factors. (It works without the lexical scanners too... you just get the hinted type).

FileInventory doesn't compute source code metrics by itself. (It does compute file size and line counts for files that appear to contain text). But it does manufacture project files for the classified files to drive our Source Code Search Engine (SCSE), a tool for search large code bases in multiple languages. A side effect of SCSE scanning the code base to index it for fast access, is the computation of basic metrics: lines, SLOC, comments, Halstead, McCabe metrics (example output).

[We have a special lexical analyzer called "Ad Hoc Text". This tries to model the random programming language found in the zillion how-to computer books, so it know about typical comments /* ... */ -- ... , various kinds of quoted strings "...." '....' ...., lots of numerical literals types (decimal, float), typical keywords 'function' 'if' 'do' etc. Using this lexical analyzer the SCSE can lex most randomly chosen programming languages partially, but its good enough to compute not-terribly inaccurate metrics. That's really handy to for all the uncatergorized source code one often finds in big crufty source code bases.]

So the combination of FileInventory and Source Code Search Engine seem to do what you want, at scale. These tools are not what I would call simple in terms of how that are internally implemented (doing anything that knows details about programming languages is actually pretty complicated), but they are very simple to configure and run.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • I find it amazing that while my answer directly addresses OP's key question, and that the OP marked this as the best answer, that there are enough naysayers to commercial products at SO that this answer has been significantly downvoted. I was very clear about the fact that it is commercial, and that I have a relationship with the company that supplies it, as per SO policy guidelines. – Ira Baxter Jul 29 '21 at 12:56