Paper

Gaze Estimation from Multimodal Kinect Data
author
Kenneth Alberto Funes Mora and Jean-Marc Odobez
Idiap Research Institute, CH-1920, Martigny, Switzerland E ́colePolytechniqueFe ́de ́raldeLausanne,CH-1015,Lausanne,Switzerland
link: https://ieeexplore.ieee.org/document/6239182/

Gaze Estimation

tracking where the people look at.

Summary

exploit the depth sensor to perform an accurate tracking of a 3D mesh model and robustly estimate a person head pose
compute a person’s eye-in-head gaze direction via usage of the image modality.

Pipeline

a) Offline step.
From multiple 3D face instances the 3DMM is fit to obtain a person specific 3D model.
b)-d) Online steps.
- b) The person model is registered at each instant to multimodal data to retrieve the head pose. In the figure, the model is rendered with a horizontal spacing for visualization. The region used for tracking is rendered in purple.
- c) Head stabilization computed from the inverse head pose parameters and 3D mesh, creating a frontal pose face image. Further steps show the gaze estimation in the head coordinate system. The final gaze vector is corrected according to the estimated head pose.
- d) Obtained gaze vectors (in red our estimation and in green the ground truth).

3D Morphable Model/Basel Face Model

The faces are parameterized as triangular meshes with m = 53490 vertices and shared topology.

vertices:
- $(x_j, y_j, z_j)^T \in R^3$
- $s = (x_1, y_1, z_1, …, x_m, y_m, z_m)^T$
colors:
- $(r_j, g_j, b_j)^T \in [0, 1]^3$
- $t = (r_1, g_1, b_1, …, r_m, g_m, b_m)^T$

BFM assumes independence between shape and texture, constructing two independent Linear Models, $M_s = (\mu_s, \sigma_s, U_s)$, $\mu_t, \sigma_t, U_t$

$s(\alpha) = \mu_s + U_s diag(\sigma_s)\alpha$
$t(\beta) = \mu_t + U_t diag(\sigma_t)\beta$

Pose Tracking/ICP (Iterative Closest Points)

Given: two corresponding point sets,
- $x = {x_1, …, x_n}$, $P = {p_1, …, p_n}$
Wanted: translation t and rotation R thatminimizes the sum of the squared error:
- $E(R, t) = \frac{1}{N_p}\sum_{i=1}^{N_p}(x_i - Rp_i -t)^2$
solution: SVD

Head stabilization

render the scene using the inverse rigid transformation of the head pose parameters. $p_t^{-1} = {R_t^T, -R_t^Tt_t}$

Eye-in-Head Gaze estimation/Adaptive Linear Regression

Key idea:
- is to adaptively find the subset of training samples where the test sample is most linearly representable.
eye appearance feature extraction:
- feature vector: $e_i = \frac{[s_1, s_2, …, s_{r \times c}]^T}{\sum_j S_j}$
- $E = [e1,e2,··· ,en] ∈ R^{m×n}, X = [x1, x2, · · · , xn] ∈ R^{2×n}$
- $AE = X$