Vision Framework with ARkit and CoreML

Question

While I have been researching best practices and experimenting multiple options for an ongoing project(i.e. Unity3D iOS project in Vuforia with native integration, extracting frames with AVFoundation then passing the image through cloud-based image recognition), I have come to the conclusion that I would like to use ARkit, Vision Framework, and CoreML; let me explain.

I am wondering how I would be able to capture ARFrames, use the Vision Framework to detect and track a given object using a CoreML model.

Additionally, it would be nice to have a bounding box once the object is recognized with the ability to add an AR object upon a gesture touch but this is something that could be implemented after getting the solid project down.

This is undoubtedly possible, but I am unsure of how to pass the ARFrames to CoreML via Vision for processing.

Any ideas?

Start with the basics. Oh, and understand you **must** use Xcode 9 and iOS 11. Now, by the "basics", I mean learn what each piece *is* and *isn't*. Vision will be the highest level - it can track and use a CoreML model. But do you *know* how to train a model? If not, then learn ML first, and *then* learn how to import a trained model into CoreML. I won't address ARKit (sorry that's way too broad for what my niche - are you sure it's needed for your's?) but if you want to address CoreML directly you will need to understand what a `CVPixelBuffer` is. good luck! — , Jul 07 '17 at 17:48
thanks for your response; yes, i'm using both Xcode 9 and iOS 11. i can train a model as that's how i actually got into programming to begin with (NLP in python) and I can convert and insert the model into Xcode. besides that, there are pre-trained models i can use to test the functionality of the app for the time being. i'm having trouble understanding the methodology of extracting the ARframe and passing it through Vision using CoreML's model. there isn't really any deep documentation yet and was curious if someone could shed insight @dfd — pythlang, Jul 07 '17 at 17:53
You're welcome. Now, what *specific* issue do you have? Something with tracking an object using Vision? Importing a trained model into CoreML? Displaying a bounding box? — , Jul 07 '17 at 17:56
sorry, i accidentally pressed "Return" during my response; please see the edit above @dfd — pythlang, Jul 07 '17 at 17:57
@pythlang Hi, Thanks for posting this question. I have the same objective as yours. Did you achieve what you have asked here? — jegadeesh, Jan 11 '18 at 10:45
@jegadeesh yes, actually I did! the answer below combined with more independent research, trial & error, and long hours proved to be worth it. — pythlang, Feb 12 '18 at 22:08

rickster · Accepted Answer · 2018-03-15T07:12:08.637

Update: Apple now has a sample code project that does some of these steps. Read on for those you still need to figure out yourself...

Just about all of the pieces are there for what you want to do... you mostly just need to put them together.

You obtain ARFrames either by periodically polling the ARSession for its currentFrame or by having them pushed to your session delegate. (If you're building your own renderer, that's ARSessionDelegate; if you're working with ARSCNView or ARSKView, their delegate callbacks refer to the view, so you can work back from there to the session to get the currentFrame that led to the callback.)

ARFrame provides the current capturedImage in the form of a CVPixelBuffer.

You pass images to Vision for processing using either the VNImageRequestHandler or VNSequenceRequestHandler class, both of which have methods that take a CVPixelBuffer as an input image to process.

You use the image request handler if you want to perform a request that uses a single image — like finding rectangles or QR codes or faces, or using a Core ML model to identify the image.
You use the sequence request handler to perform requests that involve analyzing changes between multiple images, like tracking an object's movement after you've identified it.

You can find general code for passing images to Vision + Core ML attached to the WWDC17 session on Vision, and if you watch that session the live demos also include passing CVPixelBuffers to Vision. (They get pixel buffers from AVCapture in that demo, but if you're getting buffers from ARKit the Vision part is the same.)

One sticking point you're likely to have is identifying/locating objects. Most "object recognition" models people use with Core ML + Vision (including those that Apple provides pre-converted versions of on their ML developer page) are scene classifiers. That is, they look at an image and say, "this is a picture of a (thing)," not something like "there is a (thing) in this picture, located at (bounding box)".

Vision provides easy API for dealing with classifiers — your request's results array is filled in with VNClassificationObservation objects that tell you what the scene is (or "probably is", with a confidence rating).

If you find or train a model that both identifies and locates objects — and for that part, I must stress, the ball is in your court — using Vision with it will result in VNCoreMLFeatureValueObservation objects. Those are sort of like arbitrary key-value pairs, so exactly how you identify an object from those depends on how you structure and label the outputs from your model.

If you're dealing with something that Vision already knows how to recognize, instead of using your own model — stuff like faces and QR codes — you can get the locations of those in the image frame with Vision's API.

If after locating an object in the 2D image, you want to display 3D content associated with it in AR (or display 2D content, but with said content positioned in 3D with ARKit), you'll need to hit test those 2D image points against the 3D world.

Once you get to this step, placing AR content with a hit test is something that's already pretty well covered elsewhere, both by Apple and the community.

although i did know some of this information, putting it together in such a cohesive way is absolutely refreshing so... wow, thanks so much for such a great, detailed response. i love answers like this because while everyone loves details and walkthroughs, the guide you have provided will keep me busy researching and implementing for a while. thanks again @rickster — pythlang, Jul 15 '17 at 23:46

Vision Framework with ARkit and CoreML

1 Answers1

Linked