Extract source comments from a Scala source file

Question

I would like to programmatically extract the code comments from a Scala source file.

I have access to both the source file and objects of the classes whose comments I am interested in. I am also open to writing the comments in the Scala source file in a specific form to facilitate extraction (though still following Scaladoc conventions).

Specifically, I am not looking for HTML or similar output.

However, a json object I can then traverse to get the comments for each field would be perfectly fine (though json is not a requirement). Ideally, I would like to get a class or class member comment given its "fully qualified" name or an object of the class.

How do I best do this? I am hoping for a solution that is maintainable (without too much effort) from Scala 2.11 to Scala 3.

Appreciate all help!

tjheslin1 · Answer 1 · 2021-10-14T10:42:03.763

I have access to both the source file

By this I assume you have that path to the file, which I'll represent in my code as:

val pathToFile: String = ???

TL;DR

import scala.io.Source

def comments(pathToFile: String): List[String] = {
  def lines: Iterator[(String, Int)] = Source.fromFile(pathToFile).getLines().zipWithIndex

  val singleLineJavaDocStartAndEnds = lines.filter {
    case (line, lineNumber) => line.contains("/*") && line.contains("*/")
  }.map { case (line, _) => line }

  val javaDocComments = lines.filter {
    case (line, lineNumber) =>
      (line.contains("/*") && !line.contains("*/")) ||
      (!line.contains("/*") && line.contains("*/"))
  }
  .grouped(2).map {
    case Seq((_, firstLineNumber), (_, secondLineNumber)) =>
      lines
        .map { case (line, _) => line }
        .slice(firstLineNumber, secondLineNumber+1)
        .mkString("\n")
  }

  val slashSlashComments = lines
    .filter { case (line, _) => line.contains("//") }
    .map { case (line, _) => line }

  (singleLineJavaDocStartAndEnds ++ javaDocComments ++ slashSlashComments).toList
}

Full explanation

First thing to do is to read the contents of the file:

import scala.io.Source

def lines: Iterator[String]  = Source.fromFile(pathToFile).getLines()

// here we preserve new lines, for Windows you may need to replace "\n" with "\r\n
val content: String = lines.mkString("\n")
// where `content` is the whole file as a `String`

I have made lines a def to prevent unintended results if calling lines multiple times. This is due to the return type of Source.fromFile and how it handles iterating over the file. This comment here adds an explanation. Since you are reading source code files I think rereading the file is a safe operation to perform and won't lead to memory or performance issues.

Now that we have the content of the file we can begin to filter out the lines we don't care about. Another way of viewing the problem is that we only want to keep - filter in - the lines that are comments.

Edit:

As @jwvh rightly pointed out, where I was using .trim.startsWith ignored comments such as:

val x = 1 //mid-code-comments

/*fullLineComment*/

To address this I've replaced .trim.startsWith with .contains.

For single line comments this is simple:

val slashComments: Iterator[String] = lines.filter(line => line.contains("//"))

Notice the call to .trim above which is important as often developers start comments intended to match the indentation of the code. trim removes any whitespace characters at the start of the string. Now using .contains which catches any line with a comment starting anywhere.

Now we'll file multi-line comments, or JavaDoc; for example (the content is not important):

/**
 * Class String is special cased within the Serialization Stream Protocol.
 *
 * A String instance is written into an ObjectOutputStream according to
 * .....
 * .....
 */

The safest thing to do is to fine the lines that the /* and */ appear on and include all of the lines in between:

def lines: Iterator[(String, Int)] = Source.fromFile(pathToFile).getLines().zipWithIndex

val javaDocStartAndEnds: Iterator[(String, Int)] = lines.filter { 
  case (line, lineNumber) => line.contains("/*") || line.contains("*/")
}

.zipWithIndex gives us an incrementing number alongside each line. We can use these to represent the line numbers of the source file. At the moment this will give us a list of lines containing /* and */. We need to group these into groups of 2 as all of these kinds of comments will have a matching pair of /* and */. Once we have these groups we can select, using slice, all of the lines starting from the first index until the last. We want to include the last line so we do a +1 to it.

val javaDocComments = javaDocStartAndEnds.grouped(2).map {
  case Seq((_, firstLineNumber), (_, secondLineNumber)) =>
    lines // re-calling `def lines: Iterator[(String, Int)]`
      .map { case (line, _) => line } // here we only care about the `line`, not the `lineNumber`
      .slice(firstLineNumber, secondLineNumber+1)
      .mkString("\n")
  }

Finally we can combine slashComments and javaDocComments:

val comments: List[String] = (slashComments ++ javaDocComments).toList

Regardless of the order in which we join them they won't appear in an ordered list. An improvement that could be made here would be to preserve lineNumber and order by this at the end.

I will include a "too long; didn't read" (TL;DR) version at the top so anyone can just copy the code in full without the step by step explanation.

How do I best do this? I am hoping for a solution that is maintainable (without too much effort) from Scala 2.11 to Scala 3.

I hope I have answered your question and provided a useful solution. You mentioned a JSON file as output. What I've provided is a List[String] in memory which you can process. If output to JSON is required I can update my answer with this.

All `val x = 1 //mid-code-comments` are missed, and this `/*fullLineComment*/` will throw off your multi-line calculation. — jwvh, Oct 12 '21 at 21:34
Hi @jwvh. That's very true! I'll update my answer to replace the `.trim.startsWith` with `contains`. — tjheslin1, Oct 13 '21 at 08:26
@skm Please let me know how you get on trying this solution :) — tjheslin1, Oct 13 '21 at 08:32
You really should test your code before posting. The current iteration fails for multiple code/comment combinations. Besides, it's not at all clear that the (now silent) OP wants a simple text parser. This business about _"given ... an object of the class"_ is confusing at best. — jwvh, Oct 13 '21 at 18:06
Thanks @jvwh. You're certainly right. I can see that the `/*fullLineComment*/` wasn't handled in the last version. Could you give examples of any other _"code/comment combinations"_ and I will update. I intend to keep the example as simple as possible, and so intend to only handle the most common cases. Some things slip through the net still, for example URLs containing a `//`, and will appear. — tjheslin1, Oct 14 '21 at 10:40
I'm not convinced that this filter-file-lines by comment-type can be made to work. There are just too many corner cases and special conditions. The current code can't even handle a simple 3 line comment: `/*line1\nline2\nline3*/` — jwvh, Oct 18 '21 at 06:05
Hi jwvh. As far as I have tested this code can handle mutli-line “/*” comments and preserves the new lines in the output. As i stated previously: _” I intend to keep the example as simple as possible, and so intend to only handle the most common cases.”_. There is certainly more that could be added and might even be worth it’s own library. I’d be happy to review and defer to a solution of yours that’s more comprehensive. — tjheslin1, Oct 18 '21 at 07:28

Extract source comments from a Scala source file

1 Answers1

TL;DR

Full explanation