-1

There are all sorts of tools for counting lines of code in a source file or directory tree (e.g. cloc). There are also tools for counting words in a plain text file (wc).

How would I go about counting words or tokens in my code, though? Is this feasible without writing a full-fledged program of my own to do it, using some generic programming language parsing mechanism like tree-sitter? More specifically, can I do this with shell tools or a simple(ish) script?

Note: Only words/tokens outside of comments must be counted. For general word counting I'm sure there are other questions on SO...

Example: Suppose my code is in the C language, and my foo.c file contains

int /* this is
a multi-line
comment!
*/
foo(int x) { 
    /* comment 1 */
    return 123;  // comment 2
}

The exact number expected here would depend on whether we think of braces and semicolons as words/tokens to count. If we do, then this should be 11 tokens: int, foo, (, int, x, ), {, return, 123, ;, }. If we ignore them (which I would rather not, but it could still be a legitimate approach) then we have 6 words: int, foo, int, x, return, 123.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • Can you elaborate and give a sample input and expected output? – Gilles Quénot Feb 25 '23 at 10:31
  • without a tokenizer there's no way to count the number of tokens correctly, because strings or comments can contain completely valid code inside, but they're not code – phuclv Feb 25 '23 at 10:49
  • @phuclv: Hence my question. – einpoklum Feb 25 '23 at 10:51
  • but the rules to tokenize depend on the language, so this is impossible to answer without knowing the language – phuclv Feb 25 '23 at 10:56
  • @phuclv: Of course they depend on the language. That's why I mentioned cloc and tree-sitter, which incorporate language-specific knowledge. You could answer suggesting a language-specific tool rather than a general tool, in which case - choose some language to demonstrate with, and I (or other readers) will need to adapt the solution to other languages. – einpoklum Feb 25 '23 at 10:56

2 Answers2

3

Total Non-Comment Tokens Per-Line

Edit, my bad, I went off @Gilles example and missed the comment part. Per your example using C/C++ comments and ignoring multi-line comments between /* and */, the per-line non-comment tokens can be obtained with awk using a counter tokens and a flag skip by checking whether a field is comprised on "//", "/*" or "*/" as you show whitespace surrounding each. A simple awk script to process the file into non-comment whitespace separated tokens could be:

#!/bin/awk -f

{
  tokens = 0
  skip = 0
  for (i=1; i<=NF; i++) {
    if ($i == "//") {
      break
    }
    if ($i == "/*") {
      skip = 1
    }
    if (!skip) {
      tokens++
    }
    if ($i == "*/") {
      skip = 0
    }
  }
  printf "line %d: %d tokens\n", FNR, tokens
}

(note: parsing individual tokens from C containing non-witespace, e.g. "foo(int" isn't addressed. If parsing at that level is needed, then reinventing the wheel with awk may not be your best choice. However adding conditions to ignore fields comprised solely of (,{,[ or ],},) is easy to do.)

The single-rule iterates over each field and checks for the opening comment. In the case of "//", the remainder of the line is ignored. In the case of "/*", the skip flag is set and no more tokens are counted until after a closing "*/" is encountered in that line.

Example Use/Output

Modified example file:

$ cat file
foo bar // base base
lorem ipsum doloris
qux /* aze */ qwe base

If you named your awk script noncmttokens.awk and made it executable with chmod +x noncmttokens.awk then all you need to is run it providing file as the argument, e.g.

$ ./noncmttokens.awk file
line 1: 2 tokens
line 2: 3 tokens
line 3: 3 tokens

Sorry about overlooking the comment verbiage in the question, I got off track using the example file from the other answer -- happens...


Adding Mult-line Comment Handling and split on "("

To process your file into the tokens you desire, while maintaining that all comment open/close will be whitespace separated and only splitting non-whtiespace separated tokens on "(", you can do:

#!/bin/awk -f

BEGIN {
  tokens_in_file = 0    # initialize vars that are persistent across records
  skip = 0
}

{
  tokens_in_line = 0;   # per-record reset of varaibles
  ndx = 1
}

skip {  # if in muli-line comment
  for (ndx=1; ndx<=NF; ndx++) {   # iterate fields
    if ($ndx == "*/") {           # check for multi-line close
      skip = 0;                   # unset skip flag
      ndx++                       # increment field index
      break
    }
  }
  if (skip) {   # still in multi-line comment
    ndx = 1
    printf "line %d: %d tokens\n", FNR, tokens_in_line
    next
  }
}

{
  for (i=ndx; i<=NF; i++) {   # process fields from ndx to last
    if ($i ~/^[({})]$/) {     # ignore "(, {, }, )" fields
      continue
    }
    if ($i == "//") {         # C++ rest of line comment
      break
    }
    if ($i == "/*") {         # multi-line opening
      if (skip) {             # handle malformed multi-line error
        print "error: duplicate milti-line comment entry tokens" 
      }
      skip = 1                # set skip flag
    }
    if (!skip) {              # if not skip, process toks, split on "("
      tokens_in_line += split ($i, tok_arr, "(")
    }
    if ($i == "*/") {         # check if last field multi-line close
      skip = 0
    }
  }
  # output per-line stats, add tokens_in_line to tokens_in_file
  printf "line %d: %d tokens\n", FNR, tokens_in_line
  tokens_in_file += tokens_in_line
}

END { # output file stats
  printf "\nindentified %d tokens in %d lines\n", tokens_in_file, FNR
}

Example Use/Output

With the sample file you provide in file2.c, e.g.

$ cat file2.c
int /* this is
a multi-line
comment!
*/
foo(int x) {
    /* comment 1 */
    return 123;  // comment 2
}

Providing that file as the argument to the expanded awk script you would get:

$ ./noncmttokens2.awk file2.c
line 1: 1 tokens
line 2: 0 tokens
line 3: 0 tokens
line 4: 0 tokens
line 5: 3 tokens
line 6: 0 tokens
line 7: 2 tokens
line 8: 0 tokens

indentified 6 tokens in 8 lines

awk can handle just about anything you need to do in a highly efficient manner, but as mentioned in the comments, I suspect that as more detail is added it will become more of a job reinventing what the compiler does in one of its compilation levels. This splitting of tokens in rudimentary, but the number of corner cases that would need to be handled, e.g. to handle obfuscated C/C++ code rapidly grows exponentially.

Hopefully this provides what you need.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • This doesn't support multi-line comments. Also - it doesn't provide a token count for the entire file. – einpoklum Feb 26 '23 at 21:03
  • @einpoklum - those are easy enough to add and I'll provide a way to hand both, but I have a nagging feeling that we are slowly reinventing something the compiler already does somewhere along the way in one of the compilation levels. – David C. Rankin Feb 26 '23 at 21:23
  • I explicitly wondered in the question whether I can do this without using a compiler frontend equivalent like treesitter. Maybe I'm not as good as I thought in communicating my questions... – einpoklum Feb 26 '23 at 23:14
  • Yes, the statement about increasing complexity reinventing what the compiler already does remains true. `awk` is a good choice for implementing a quick hack at it, or you could write the parser in C, etc.. Either way you are simply applying a set of parsing rules to separate and count tokens. From a utility approach, `awk` is ideal because `sed` can't count, and `awk` is more than reasonably fast at applying a large set of rules to each line. – David C. Rankin Feb 28 '23 at 03:53
0

File

$ cat file
foo bar base base
lorem ipsum doloris
qux aze qwe base

Consider this simple concise snippet:

$ perl -snE '$c += s/\bbase\b/$&/g;END{say $c}' file
3

With :

for word in $(< file); do
    [[ $word == base ]] && ((c++))
done
echo "$c"

With :

printf '%s\n' $(< file) | grep -wc base 

With :

tr ' ' $'\n' < file | awk '$1=="base"{c++}END{print c}'
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • You're completely ignoring the fact that I need to count words in _code_, not in _comments_. Emphasized this in the question. – einpoklum Feb 25 '23 at 10:27
  • Your new requirements are far different than the vague question you asked first. It would have been interesting to know the details from the beginning. – Gilles Quénot Feb 25 '23 at 11:54
  • Those were my requirements from the beginning. I specifically talked about counting words of _code_, and mentioned cloc, If that had not been a requirement, then the task is rather trivial - applying a known tool (or extremely simple logic) to multiple files rather than just one. – einpoklum Feb 25 '23 at 12:39