How to distinguish between code and "human text" on Apple platforms?

Question

My problem in a nutshell: for a given string, I would like to identify whether it is a piece of code, or freeform text in human language. This should work on Apple devices (both macOS and iOS) locally on device.

So:

If input string is body { color: #c00; }, it could be classified as css or code. (Ditto for more complex multi-line code snippets.)
If input is the quick brown fox jumps over the lazy dog, it should be classified as text.

I thought of using CoreML. There is a great example of how to identify a programming language. It is missing one crucial piece for my use: there is no “other” category if the detection doesn’t match any programming language. CoreML also does not provide a confidence score for a prediction. (If there was a low confidence score for all languages, I could assume the text to be not code.)

One way out of this with CoreML would be to train my model also with human language next to the programming language samples, but I don’t really want to do that, I want to keep the model size fairly small.

There is some related work based on Keras available, where I can see it is capable of outputting the confidence score for each language. I’m not an expert in Keras or ML though, and don’t know how to bring this over to the Apple world.

What solution could I use to distinguish between “code” and “text” on Apple platforms? (Identifying the specific programming language would be a bonus, but not strictly needed.) Doesn’t necessarily have to be machine-learning-based, though that seems to be the most promising avenue.

What is the use case? How long are the input strings expected to be? I would first filter on encoding, and further try to identify the human language (there are existing solutions for that). If it is (mostly) English, then proceed to finding and counting special characters peculiar to programming languages. So, not an ML-only solution. — Maxim Volgin, May 15 '19 at 04:55
Use case is something like a chat app, where users can enter or paste both human language text and code of varying length. I note that there is is `NSLinguisticTagger` available on Apple platforms that lets me identify human languages, and has "unknown language" response, but it still identifies programming language as English in my testing. Your suggested approach sounds promising, thanks. I wonder if there is a readymade recipe for the “finding and counting special characters” part :) — Jaanus, May 29 '19 at 12:24

How to distinguish between code and "human text" on Apple platforms?

0 Answers0