2

Has anybody tried developing a SLAM system that uses deep learned features instead of the classical AKAZE/ORB/SURF features?

Scanning recent Computer Vision conferences, there seem to be quite a few reports of successful usage of neural nets to extract features and descriptors, and benchmarks indicate that they may be more robust than their classical computer vision equivalent. I suspect that extraction speed is an issue, but assuming one has a decent GPU (e.g. NVidia 1050), is it even feasible to build a real-time SLAM system running say at 30FPS on 640x480 grayscale images with deep-learned features?

Daniel Danciu
  • 133
  • 1
  • 8

1 Answers1

1

This was a bit too long for a comment, so that's why I'm posting it as an answer.

I think it is feasible, but I don't see how this would be useful. Here is why (please correct me if I'm wrong):

  • In most SLAM pipelines, precision is more important than long-term robustness. You obviously need your feature detections/matchings to be precise to get reliable triangulation/bundle (or whatever equivalent scheme you might use). However, the high level of robustness that neural networks provide is only required with systems that do relocalization/loop closure on long time intervals (e.g. need to do relocalization in different seasons etc). Even in such scenarios, since you already have a GPU, I think it would be better to use a photometric (or even just geometric) model of the scene for localization.

  • We don't have any reliable noise models for the features that are detected by the neural networks. I know there have been a few interesting works (Gal, Kendall, etc...) for propagating uncertainties in deep networks, but these methods seem a bit immature for deployment ins SLAM systems.

  • Deep learning methods are usually good for initializing a system, and the solution they provide needs to be refined. Their results depend too much on the training dataset, and tend to be "hit and miss" in practice. So I think that you could trust them to get an initial guess, or some constraints (e.g. like in the case of pose estimation: if you have a geometric algorithm that drifts in time, then you can use the results of a neural network to constrain them. But I think that the absence of a noise model as mentioned previously will make the fusion a bit difficult here...).

So yes, I think that it is feasible and that you can probably, with careful engineering and tuning produce a few interesting demos, but I wouldn't trust it in real life.

Ash
  • 4,611
  • 6
  • 27
  • 41
  • 2
    Thanks Ash, so, if I understand your answer correctly, in your experience deep-learned features suffer from precision issues, in addition to being unable to find an adequate noise model. That's good to know. BTW, we had at some point a half-hearted attempt at using learned depth to initialize the SLAM system, but in the end abandoned it. We also had much better success with geometric features than with photometric ones. – Daniel Danciu Oct 08 '18 at 19:40
  • @DanielDanciu Yes, that's what I think at the moment... May I ask for a few details about why you abandoned your learned-depth based initialization? I'm just curious about how you tried to do that fusion (like was it some sort of constrained bundle adjustment or just an initialization using those depth values followed by some sort of geometric refinement?)... Yeah I agree that what I said about using a photometric model wasn't very intelligent (especially for long time operation, thanks for correcting me.), but I think that a geometric model would be fine (like buildling models and so on). – Ash Oct 08 '18 at 20:22
  • 2
    Re depth integration: at some point we were using a depth filter to refine the depth of the 3D map points, as described here: http://rpg.ifi.uzh.ch/docs/TRO17_Forster-SVO.pdf. One could use the predicted depth to initialize the depth filter. We played around a bit with that, but it didn't seem like a promising path to pursue (compared to simply initializing by triangulating direct or geometric matches), so we abandoned it. Actually, we abandoned the whole depth filter idea in the end and simply use triangulation + a structure bundle adjustment instead. – Daniel Danciu Oct 10 '18 at 08:08