My problem in a nutshell: for a given string, I would like to identify whether it is a piece of code, or freeform text in human language. This should work on Apple devices (both macOS and iOS) locally on device.
So:
- If input string is
body { color: #c00; }
, it could be classified ascss
orcode
. (Ditto for more complex multi-line code snippets.) - If input is
the quick brown fox jumps over the lazy dog
, it should be classified astext
.
I thought of using CoreML. There is a great example of how to identify a programming language. It is missing one crucial piece for my use: there is no “other” category if the detection doesn’t match any programming language. CoreML also does not provide a confidence score for a prediction. (If there was a low confidence score for all languages, I could assume the text to be not code.)
One way out of this with CoreML would be to train my model also with human language next to the programming language samples, but I don’t really want to do that, I want to keep the model size fairly small.
There is some related work based on Keras available, where I can see it is capable of outputting the confidence score for each language. I’m not an expert in Keras or ML though, and don’t know how to bring this over to the Apple world.
What solution could I use to distinguish between “code” and “text” on Apple platforms? (Identifying the specific programming language would be a bonus, but not strictly needed.) Doesn’t necessarily have to be machine-learning-based, though that seems to be the most promising avenue.