0

Overview

I'm writing a simple Python program to calculate the distance to a chosen target from a pair of two arbitrarily located stereo images. Both images are taken from a single camera with known intrinsics K, and although the camera locations and poses are arbitrary, I always know the baseline distance between locations.

I believe I'm having issues getting a good essential matrix from OpenCV and I can't figure out whether this is the case or I'm just interpreting the results incorrectly. Here is a sample image pair: Stereo image pair (left camera on left, right camera on right (If someone could put the images inline that would be much appreciated :P).

My imagery is currently from an iPhone 12 camera which almost certainly comes undistorted automatically. I still calibrated the camera anyways but am not unidstorting anything at any point in my script. (should I be?)

Method

The process I currently use for computing geometry is as follows:

  1. Perform keypoint detection and matching to obtain two lists of corresponding image points
  2. Calculate the essential matrix E using OpenCV's 5-point algorithm cv2.findEssentialMat(). I use RANSAC here for outlier filtering.
  3. Recover rotation matrix R and translation unit vector t from E using OpenCV's cv2.recoverPose(). Here I also multiply t by the baseline distance to get real-world scale.

Problems

The resulting calculated rotation and translation do not match expected results. I will start by asserting that my keypoint detection and matching is very good and is almost certainly not the culprit of my issues. The left image was taken head-on and the right image was taken 7m to the right and then angled in towards the scene, so I would expect a translation from the left camera to the right camera of t = [7, 0, 0]. Instead I am getting t = [-6.597, 0.256, 2.324] from cv2.recoverPose(). Could I be interpreting the coordinate frame of this result incorrectly? Could this instead be the translation from the right camera to the left? Or could cv2.recoverPose() be converging on the wrong solution (unlikely)?

Visualizing the epilines on the images also raises concern: Images with epilines. I read that the 8-point algorithm to find the fundamental matrix was sensitive to noise so I decided to back solve for the fundamental matrix using F = Kinv.T @ E @ Kinv where Kinv is the inverse of the intrinsics matrix K. Looking at the resulting corresponding epilines, it appears as if the "other" camera is to the left of the imaging camera in BOTH images. This obviously doesn't make sense. Additionally, changing the algorithm or RANSAC parameters sometimes drastically changes the resulting image: Same stereo images with different RANSAC parameters The epipoles should not be visible in either image, and in this case the calculated R and t are obviously incorrect.

Is the poor performance a result of noise in the system? A very good homography is easily obtained (cv2.findHomography()) as visualized here: left image warped onto right and I curated this pair of images for best performance (by imaging a wall). Could the small amount of parallax in the system be adding noise? Noise is clearly filtered out for the homography and I'm passing the 5-point algorithm nearly 1500 pairs of matched points for these images.

At the end of the day, I need to get accurate R and t between perspectives and it seems to me that a better E matrix is necessary first to do so. Any suggestions would be much appreciated, thank you in advance!

Code

Note: assume pts0 and pts1 are appropriately populated with good correspondences (left and right respectively).

# matched points in each image
pts0 = np.int32(<left image points here>)
pts1 = np.int32(<corresponding right image points here>)    

E, mask = cv2.findEssentialMat(pts0, pts1, K, cv2.RANSAC, prob=.99999, threshold=.1)
F = Kinv.T @ E @ Kinv # solve for fundamental matrix

# recover pose to verify calculated transform to actual camera placement
_, R, t, _ = cv2.recoverPose(E, pts0, pts1, K, mask=mask)
t *= scale # convert to real-world scale

pts0 = pts0[mask.ravel() == 1]
pts1 = pts1[mask.ravel() == 1]

"""
followed tutorial at <https://docs.opencv.org/4.x/da/de9/tutorial_py_epipolar_geometry.html>
for displaying epilines
"""
llines = cv2.computeCorrespondEpilines(pts1.reshape(-1, 1, 2), 2, F)
llines = llines.reshape(-1, 3)
out02, _ = drawlines(left_img, right_img, llines, pts0, pts1)
rlines = cv2.computeCorrespondEpilines(pts0.reshape(-1, 1, 2), 1, F)
rlines = rlines.reshape(-1, 3)
out01, _ = drawlines(right_img, left_img, rlines, pts1, pts0)

cv2.imshow('epilines', np.hstack((out02, out01)))
cv2.waitKey(0)
cv2.destroyAllWindows()
Evan
  • 1

1 Answers1

0

so looking at the images, it's likely (?) most features lie on a single plane, in which case estimating the essential matrix from these points has known degeneracies (see Multiple View Geometry in Computer Vision, section 11.9 )

also using a simulated problem first helps to ensure your code is correct (e.g. sample N random 3d points, define 2 Poses and project to the 2 corresponding images, compare the output with the expected essential matrix derived from the 2 poses)