Total Non-Comment Tokens Per-Line
Edit, my bad, I went off @Gilles example and missed the comment part. Per your example using C/C++ comments and ignoring multi-line comments between /*
and */
, the per-line non-comment tokens can be obtained with awk
using a counter tokens
and a flag skip
by checking whether a field is comprised on "//"
, "/*"
or "*/"
as you show whitespace surrounding each. A simple awk
script to process the file into non-comment whitespace separated tokens could be:
#!/bin/awk -f
{
tokens = 0
skip = 0
for (i=1; i<=NF; i++) {
if ($i == "//") {
break
}
if ($i == "/*") {
skip = 1
}
if (!skip) {
tokens++
}
if ($i == "*/") {
skip = 0
}
}
printf "line %d: %d tokens\n", FNR, tokens
}
(note: parsing individual tokens from C containing non-witespace, e.g. "foo(int"
isn't addressed. If parsing at that level is needed, then reinventing the wheel with awk
may not be your best choice. However adding conditions to ignore fields comprised solely of (,{,[
or ],},)
is easy to do.)
The single-rule iterates over each field and checks for the opening comment. In the case of "//"
, the remainder of the line is ignored. In the case of "/*"
, the skip
flag is set and no more tokens are counted until after a closing "*/"
is encountered in that line.
Example Use/Output
Modified example file:
$ cat file
foo bar // base base
lorem ipsum doloris
qux /* aze */ qwe base
If you named your awk
script noncmttokens.awk
and made it executable with chmod +x noncmttokens.awk
then all you need to is run it providing file
as the argument, e.g.
$ ./noncmttokens.awk file
line 1: 2 tokens
line 2: 3 tokens
line 3: 3 tokens
Sorry about overlooking the comment verbiage in the question, I got off track using the example file from the other answer -- happens...
Adding Mult-line Comment Handling and split
on "("
To process your file into the tokens you desire, while maintaining that all comment open/close will be whitespace separated and only splitting non-whtiespace separated tokens on "("
, you can do:
#!/bin/awk -f
BEGIN {
tokens_in_file = 0 # initialize vars that are persistent across records
skip = 0
}
{
tokens_in_line = 0; # per-record reset of varaibles
ndx = 1
}
skip { # if in muli-line comment
for (ndx=1; ndx<=NF; ndx++) { # iterate fields
if ($ndx == "*/") { # check for multi-line close
skip = 0; # unset skip flag
ndx++ # increment field index
break
}
}
if (skip) { # still in multi-line comment
ndx = 1
printf "line %d: %d tokens\n", FNR, tokens_in_line
next
}
}
{
for (i=ndx; i<=NF; i++) { # process fields from ndx to last
if ($i ~/^[({})]$/) { # ignore "(, {, }, )" fields
continue
}
if ($i == "//") { # C++ rest of line comment
break
}
if ($i == "/*") { # multi-line opening
if (skip) { # handle malformed multi-line error
print "error: duplicate milti-line comment entry tokens"
}
skip = 1 # set skip flag
}
if (!skip) { # if not skip, process toks, split on "("
tokens_in_line += split ($i, tok_arr, "(")
}
if ($i == "*/") { # check if last field multi-line close
skip = 0
}
}
# output per-line stats, add tokens_in_line to tokens_in_file
printf "line %d: %d tokens\n", FNR, tokens_in_line
tokens_in_file += tokens_in_line
}
END { # output file stats
printf "\nindentified %d tokens in %d lines\n", tokens_in_file, FNR
}
Example Use/Output
With the sample file you provide in file2.c
, e.g.
$ cat file2.c
int /* this is
a multi-line
comment!
*/
foo(int x) {
/* comment 1 */
return 123; // comment 2
}
Providing that file as the argument to the expanded awk
script you would get:
$ ./noncmttokens2.awk file2.c
line 1: 1 tokens
line 2: 0 tokens
line 3: 0 tokens
line 4: 0 tokens
line 5: 3 tokens
line 6: 0 tokens
line 7: 2 tokens
line 8: 0 tokens
indentified 6 tokens in 8 lines
awk
can handle just about anything you need to do in a highly efficient manner, but as mentioned in the comments, I suspect that as more detail is added it will become more of a job reinventing what the compiler does in one of its compilation levels. This splitting of tokens in rudimentary, but the number of corner cases that would need to be handled, e.g. to handle obfuscated C/C++ code rapidly grows exponentially.
Hopefully this provides what you need.