3

I have a string containing some text. The text may or may not be code. Using Github's Linguist, I have been able to detect the likely programming language only if I give it a list of candidates.

# test_linguist_1.rb
#!/usr/bin/env ruby

require 'linguist'

s = "int main(){}"
candidates = [Linguist::Language["Python"], Linguist::Language["C"], Linguist::Language["Ruby"]]
b = Linguist::Blob.new('', s)
langs = Linguist::Classifier.call(b, candidates)
puts langs.inspect

Execution:

$ ./test_linguist_1.rb 
[#<Linguist::Language name=C>, #<Linguist::Language name=Python>, #<Linguist::Language name=Ruby>]

Notice that I gave it a list of candidates. How can I avoid having to define a list of candidates?

I tried the following:

# test_linguist_2.rb
#!/usr/bin/env ruby

require 'linguist'

s = "int main(){}"
candidates = Linguist::Language.all
# I also tried only Popular
# candidates = Linguist.Language.popular
b = Linguist::Blob.new('', s)
langs = Linguist::Classifier.call(b, candidates)
puts langs.inspect    

Execution:

$ ./test_linguist_2.rb 
/home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:131:in `token_probability': undefined method `[]' for nil:NilClass (NoMethodError)
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:120:in `block in tokens_probability'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `each'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `inject'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `tokens_probability'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:105:in `block in classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:104:in `each'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:104:in `classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:78:in `classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:20:in `call'
from ./test_linguist.rb:21:in `block in <main>'
from ./test_linguist.rb:14:in `each'
from ./test_linguist.rb:14:in `<main>'

Additional:

  1. Is this the best way to use Github Linguist? FileBlob is an alternative to Blob but this requires writing my string to a file. This is problematic for two reasons 1) it is slow, and 2) the chosen file extension then guides linguist and we do not know the correct file extension.
  2. Are there better tools to do this? Github Linguist perhaps works well over files but not over strings.
Martin Velez
  • 1,379
  • 11
  • 24

2 Answers2

5

Taking a quick look at the source code of Linguist, it appears to use a number of strategies to determine the language, and it calls each strategy in turn. Classifier is the last strategy to be called, by which time it has (hopefully) picked up language "candidates" (as you've discovered for yourself) from the prior strategies. So I think for the particular sample you've shared with us, you have to pass a filename of some kind, even if a file doesn't actually exist, or a list of language candidates. If neither is an option for you, this may not be a feasible solution for your problem.

$ ruby -r linguist -e 'p Linguist::Blob.new("foo.c", "int main(){}").language'
#<Linguist::Language name=C>

It returns nil without a filename, and #<Linguist::Language name=C++> with "foo.cc" and the same code sample.

The good news is that you picked a really bad sample to test with. :-) Other strategies look at modelines and shebangs, so more complex samples have a better chance at succeeding. Take a look at these:

$ ruby -r linguist -e 'p Linguist::Blob.new("", "#!/usr/bin/env perl
print q{Hello, world!};
").language'
#<Linguist::Language name=Perl>
$ ruby -r linguist -e 'p Linguist::Blob.new("", "# vim: ft=ruby
puts %q{Hello, world!}
").language'
#<Linguist::Language name=Ruby>

However, if there isn't a shebang or a modeline, we're still out of luck. It turns out that there's a training dataset that is computed and serialized to disk at install time, and automatically loaded during language detection. Unfortunately, I think there's a bug in the library that is preventing this training dataset from being used if there aren't any candidates by the time it gets to this step. Fixing the bug lets me do this:

$ ruby -Ilib -r linguist -e 'p Linguist::Blob.new("", "int main(){}").language'
#<Linguist::Language name=XC>

(I don't know what XC is, but adding some other tokens to the string such as #include <stdio.h> or int argc, char* argv[] gives C. I'm sure most of your samples will have more meat to analyze.)

It's a real simple fix and I've submitted a PR for it. You can use my fork of the Gem if you'd like in the meantime. Otherwise, we'll need to look into using Linguist::Classify directly, as you've started exploring, but that has the potential to get messy.

To use my fork, add/modify your Gemfile to read as such:

gem 'github-linguist',
  require: 'linguist',
  git: 'https://github.com/mwpastore/linguist.git',
  branch: 'fix-no-candidates'

I'll try to come back and update this answer when the PR has been merged and a new version of the Gem has been released with the fix. If I have to do any force-pushes to meet the repository guidelines and/or make the maintainers happy, you may have to do a bundler update to reflect the changes. Let me know if you have any questions.

mwp
  • 8,217
  • 20
  • 26
  • Thank you for the answer. However, as I mention in the question, I do know the file extension because I do not know the programming language. I am trying to detect the (likely) programming language. – Martin Velez Sep 09 '16 at 00:25
  • @MartinVelez Ah, I see that now. My mistake. Let me tinker with it a bit more. – mwp Sep 09 '16 at 00:26
  • @MartinVelez I think you found a bug in the gem. :-) I've submitted a PR and documented how you can use my fork. Please let me know your thoughts. – mwp Sep 09 '16 at 03:02
  • @MartinVelez: The file extension isn't really a reliable source for detecting the programming language. For instance, when writing executable scripts (not libraries) in Perl, Python, Ruby, and shell languages, it is quite common to not use any extension for the file name. – user1934428 Sep 09 '16 at 05:44
  • @user1934428, yes, I agree. But again, I do not want to write to a file because 1) it is slow and 2) I do not know the file extension. Extensions that I have tried have influenced the outcome. – Martin Velez Sep 09 '16 at 07:25
  • Thanks, @mwp! I also noticed that an empty `candidates` was being passed to the `Classifier` in `Linguist.detect` but I did not realize it was a bug. I simply skipped directly to using the Classifier. I'll accept your answer. – Martin Velez Sep 09 '16 at 07:38
  • 1
    xC is a language for parallel real-time embedded programming, integrating elements from Occam-π and C. The snippet you used happens to be a complete and valid xC program. It's also a complete and valid Cyclone program. And a complete and valid Objective-C++, Objective-C, C++, and D program. As you said, the longer the program is the more likely it is that the language will be unique. But still: Objective-C is a proper superset of C, thus *all* C programs are also Objective-C programs, for example. – Jörg W Mittag Sep 09 '16 at 09:16
  • `I think there's a bug in the library that is preventing this training dataset from being used if there aren't any candidates by the time it gets to this step` I'm gonna have to argue that this is not a bug, but a design choice. Neither the Bayesian classifier nor the heuristic rules were built to choose a language among all possible languages. They are only refinement strategies. If you try to use them with all languages as input, you'll most likely end up with very poor results. – pchaigno Aug 12 '17 at 20:46
  • @pchaigno I can't disagree with you because this is exactly what the maintainers of Linguist have told me. However, I would ask you to look at this line of code, and explain to me its purpose. Because it reads, "if no languages have been chosen as candidates at this point in the process, use all the possible languages in the database." https://github.com/github/linguist/blob/983ff20d3cee56a8d1625fcd6347372d765b8c57/lib/linguist/classifier.rb#L77 – mwp Aug 13 '17 at 07:21
  • Ah, I see that you are one of those maintainers. I don't understand why you've chosen to discuss this with me here instead of in the comments on my PR, but notwithstanding, these are (close to) my final thoughts on the matter. – mwp Aug 13 '17 at 07:26
  • I added a comment here because I wouldn't want people to use your fork and then think Linguist is doing a very poor job. If they use it knowing what to expect then it's fine by me :-) The line of code you cite dates from 2012. Linguist changed a lot since then. It's possible that, at first, the maintainers tried to use the Bayesian classifier to select a language among all, or it may just be a mistake. In any case, we should remove this line, even if it's never actually used. – pchaigno Aug 13 '17 at 07:32
-1

Taking another quick look at Linguist source, Linguist::Language.all seems to be what you're looking for.

EDIT: Tried the Linguist::Language.all myself. The failure is due to yet another bug: some languages seem to have faulty data. For example, this also fails:

candidates = [Linguist::Language['ADA']]

This apparently because of the fact that in lib/linguist/samples.json, tokens.ADA doesn't exist. It is not the only such language.

To avoid the bug, you can filter the languages:

non_buggy_languages = Linguist::Samples.cache['tokens'].keys
candidates = non_buggy_languages.map { |l| Linguist::Language[l] }
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • thanks! I tried that. It did not work. See the error that triggered. – Martin Velez Sep 09 '16 at 07:27
  • Linguist selection strategies (be that the Bayesian classifier or the heuristic rules) weren't built to choose a language among all possible languages. They are only refinement strategies. If you try to use them with all languages as input, you'll most likely end up with very poor results. – pchaigno Aug 12 '17 at 20:48
  • @pchaigno: So? The question was "How can I avoid having to define a list of candidates?", and the answer answered it (at the time; I haven't looked whether or not it still does). It is quite unnecessary to put a disclaimer "if you take away some of the methods we use to improve the classification accuracy, the classification accuracy might suffer". – Amadan Aug 13 '17 at 17:03
  • I disagree. Someone reading your answer might think that Linguist is not doing that by default because of a bug. And it's actually "if you take away some of the methods we use to improve the classification accuracy, the classification accuracy **will** suffer". – pchaigno Aug 13 '17 at 20:54