4

I am trying to write a program from scratch that can estimate the pose of a camera. I am open to any programming language and using inbuilt functions/methods for feature detection...

I have been exploring different ways of estimating pose like SLAM, PTAM, DTAM etc... but I don't really need need tracking and mapping, I just need the pose.

Can any of you suggest an approach or any resource that can help me ? I know what pose is and a rough idea of how to estimate it but I am unable to find any resources that explain how it can be done.

I was thinking of starting with a video recorded, extracting features from the video and then using these features and geometry to estimate the pose.

(Please forgive my naivety, I am not a computer vision person and am fairly new to all of this)

Rohit H.S.
  • 65
  • 1
  • 6

2 Answers2

4

In order to compute a camera pose, you need to have a reference frame that is given by some known points in the image. These known points come for example from a calibration pattern, but can also be some known landmarks in your images (for example, the 4 corners of teh base of Gizeh pyramids).

The problem of estimating the pose of the camera given known landmarks seen by the camera (ie, finding 3D position from 2D points) is classically known as PnP. OpenCV provides you a ready-made solver for this problem.

However, you need first to calibrate your camera, ie, you need to determine what makes it unique. The parameters that you need to estimate are called intrinsic parameters, because they will depend on the camera focal length, sensor size... but not on the camera location or orientation. These parameters will mathematically explain how world points are projected onto your camera sensor frame. You can estimate them from known planar patterns (again, OpenCV has some ready-made functions for that).

sansuiso
  • 9,259
  • 1
  • 40
  • 58
  • 1
    Thanks for the reply! But can I find the pose of the camera itself using this technique? From what I understand this gives us the pose of the objects inside the image... – Rohit H.S. Nov 26 '14 at 17:37
  • The output of the PnP solver is a 3D rotation and a 3D translation that describe the orientation and position (respectively) with respect to the known landmarks. – sansuiso Nov 26 '14 at 18:44
  • Should you care about camera callibration, if you aren't looking for absolute scale/values, but rather relative values? I think e.g. in the case of SLAM (and don't do any distance measurement or smth similar), this step is unnecessary. I am now referring to normal cameras, not cameras with fisheye lenses or something similar... – privetDruzia Sep 10 '17 at 11:04
  • If you don't calibrate the intrinsics of a camera but still try to use 2 views (or equivalently 2 cameras) then you'll get a correct reconstruction up to a homography, which is worse than just missing a scale. Knowledge of the fundamental matrix is a requirement for 2 view geometry. If you know a bit more about your cameras then you can get results up to a given scale. – sansuiso Sep 10 '17 at 15:38
  • @RohitH.S. You are right, if you are interested in finding the camera pose relate to the calibration object frame, you need to perform a coordinate transformation between the calibration object frame and camera frame. You may find this link helpful https://ksimek.github.io/2012/08/22/extrinsic/ – el psy Congroo Aug 25 '23 at 07:03
0

Generally, you can extract the pose of a camera only relative to a given reference frame. It is quite common to estimate the relative pose between one view of a camera to another view. The most general relationship between two views of the same scene from two different cameras, is given by the fundamental matrix (google it). You can calculate the fundamental matrix from correspondences between the images. For example look in the Matlab implementation: http://www.mathworks.com/help/vision/ref/estimatefundamentalmatrix.html After calculating this, you can use a decomposition of the fundamental matrix in order to get the relative pose between the cameras. (Look here for example: http://www.daesik80.com/matlabfns/function/DecompPMatQR.m).

You can work a similar procedure in case you have a calibrated camera, and then you need the Essential matrix instead of fundamnetal.

ezfn
  • 173
  • 5
  • Thanks ezfn! So this is my understanding as of now, let me know If I am on the right track. 1. Extract features from 2 consecutive images (say in a video) using maybe SurfPoins or cornerPoints. 2. Create 2 different matrices with the co-ordinates of these points, then pass these 2 to the estimateFundamentalMatrix function. 3. The output will be a fundamental matrix 4. Using DecompPMatQR decompose the fundamental matrix to get 3 matrices, intrinsic matrix, rotation and translation, which should give me the rotation and translation of the camera between the two images. – Rohit H.S. Nov 26 '14 at 16:06
  • Cool, I am going to try this approach first, will let you know how it goes! – Rohit H.S. Nov 26 '14 at 19:32
  • Hey @ezfn,this process is working so far, I got the fundamental matrix and then using the camera intrinsics(E=K'FK)I was able to get the essential matrix. I was then able to decompose E using SVD and then get the Rotation matrix which gives theta x,y and z ,and the Translation Matrix. I dont know how I can extract the translation from the translation matrix. The matrix itself looks like this in matlab[0,-Tz,Ty;Tz,0,-Tx;-Ty,Tx,0]. I need the translation along X,Y and Z axis.Are these values the translation?if yes,then what is the scale and is it - or +,If not then how do I get the translation? – Rohit H.S. Nov 30 '14 at 00:07
  • Hi, there are 4 possible solutions for the essential matrix decomposition. – ezfn Dec 01 '14 at 08:33
  • Quoting from Wikipedia(http://en.wikipedia.org/wiki/Essential_matrix): "It turns out, however, that only one of the four classes of solutions can be realized in practice. Given a pair of corresponding image coordinates, three of the solutions will always produce a 3D point which lies behind at least one of the two cameras and therefore cannot be seen. Only one of the four classes will consistently produce 3D points which are in front of both cameras. This must then be the correct solution. Still, however, it has an undetermined positive scaling related to the translation component." – ezfn Dec 01 '14 at 08:34
  • Indead, {Tx,Ty,Tz} are the translations up to a sign. You can verify what is the correct decomposition (the sign of {Tx,Ty,Tz} and the direction of R) - using the constraint that all points are in front of both cameras. – ezfn Dec 01 '14 at 08:39
  • About the scale, you cannot determine the scale, since you don't have any physical knowledge about the scene (It can be a close small puppet house or a far huge mansion). You can extract the real scale, by factoring in some measured length in the scene. Clear enough? – ezfn Dec 01 '14 at 08:42
  • Awesome thank you so much, the end goal is to make a 3d model copy the cameras pose as it records a video. Now the new problem I am facing is that there might be a problem with my fundamental matrix, which leads to ThetaZ (from R) being mostly equal to ~90 which is not true. Here is a link to the question and code : http://stackoverflow.com/questions/27223361/error-in-fundamental-matrix Can you please help me out if you have time? Thank you so much for all your help so far! – Rohit H.S. Dec 01 '14 at 16:54
  • If we are done with this post and you are satisfied with the answer, I'll be happy if you can mark it as the best answer for your question. – ezfn Dec 01 '14 at 18:01