1

We crawled a set of images from the Google Street View (GSV) API. I want to estimate the 3D World coordinate from 2D Image given the following:

1. The GPS location (i.e., latitude and longitude) of the camera capturing the image

Conversion of GPS coordinates to translation matrix: Used 2 types of conversion methods to get the translation matrix -> UTM conversion and conversion to Cartesian coordinates.

  • UTM conversion: Used Python's UTM library to convert GPS coordinates to UTM coordinates. Used the north and east values with a fixed height to create the translation matrix.
  • Cartesian conversion - Used the following formula to generate translation matrix:

x = Radius*math.cos(latitude)*math.cos(longitude)

y = Radius*math.cos(latitude)*math.sin(longitude)

z = Radius*math.sin(latitude)

2. The rotation matrix calculated using openSFM (i.e., the SFM algorithm).

The library provides alpha, beta, gamma angles (in Radian) which map to yaw, pitch, and roll angles, respectively. The rotation matrix is constructed using the formula (http://planning.cs.uiuc.edu/node102.html)

Rotation Matrix (R): R(alpha, beta, gamma)= R_z (alpha) * R_y (beta) * R_x (gamma)

3. Based on the angle of field of view and the dimensions of the image, we estimate the calibration matrix as the following (https://codeyarns.com/2015/09/08/how-to-compute-intrinsic-camera-matrix-for-a-camera/enter link description here):

K = [[f_x s X], [0 f_y Y], [0 0 1]]

x and y are half of the image dimensions (i.e., x = width/2 and y = height/2)

The GSV API provides field of view angle θ (e.g., 45 or 80) so the focal length can be calculated as

f_x= x/tan⁡(θ/2)

f_y= y/tan⁡(θ/2)

Using the matrices T, R, and K, how can we estimate the 3D World coordinates of each pixel in the 2D image?

fararjeh
  • 11
  • 2

1 Answers1

1

Impossible from a single image - 3D depth information is lost in projection. And very very hard to do (or near impossible to do with any accuracy) with the data you have, even if you use multiple images.

The GSV API does not give you the original image data, but rather images already projected into cubical panoramas, following a pipeline of transformations whose goal is to enhance the visual appearance of the final panorama. Plus, the original images themselves are captured with rolling-shutter cameras from a moving platform, so the standard pinhole model does not apply to them regardless of the nonlinear lens distortion. Attempting to do "structure from motion" on streetview images is bound to be an endless disappointment, unless you know exactly what you are doing, and are working at Google with access to internal data.

The "real" way to do it is to register the LIDAR data collected by the same vehicles with the imagery. Google does this internally, but I don't believe they have ever exposed the results into an externally accessible product.

Francesco Callari
  • 11,300
  • 2
  • 25
  • 40