I'm having the same problem at the moment. My current solution is below:
Quote from paper you cited:
Puppeteering Our model can also be used for virtual pup-
peteering and facial triggers. We built a small fully con-
nected model that predicts 10 blend shape coefficients for
the mouth and 8 blend shape coefficients for each eye. We
feed the output of the attention mesh submodels to this
blend shape network. In order to handle differences be-
tween various human faces, we apply Laplacian mesh edit-
ing to morph a canonical mesh into the predicted mesh [3].
This lets us use the blend shape coefficients for different hu-
man faces without additional fine-tuning. We demonstrate
some results in Figure 5
I think my approach at the moment is pretty much the same as what they've done.
My approach: First sample many pairs of random blendshapes -> face mesh (detecting face mesh on 3D model), and then learning an inverse model from that. (A simple neuronet would do)
Therefore you end up with a model that can give blendshapes given a face mesh.
The catch, which is also mentioned in the above blurb, is that you wanna handle different face mesh inputs. In the above blurb it seems that they sample the 3D model but transform the sampled mesh into the canonical face mesh, and hence end up with a canonical inverse model. At inference you transform a given mesh into the canonical face mesh as well.
Another solution might be to directly transform your different people's face meshes into the 3D model's mesh.
I haven't yet done the canonical mesh part, but the step one should work.
Best regards,
C