On the 18th day of Christmas, Santa brought me …. eye tracking on neuromorphic hardware.
Gaze direction, iris/pupil detection, eye openness, and eye landmarks.
2.84 degree gaze accuracy (on in-domain data)
1000+ FPS on neuromorphic hardware
<250 KB model size
Asymmetrical eye tracking (left/right eye independent)
Designed specifically for applications of digital avatars, human machine interfaces, robots, and driver monitoring systems.
Here’s how he did it.
1. Defining what we really need.
Over the past 10 years there has been an explosion of methods for gaze tracking, especially using deep learning. However we found most of this work driven by researchers focused on research problems and did not align with what we actually need in real applications. Secondly we wanted an approach that could work on next generation AI hardware, especially neuromorphic hardware where we have constraints on branching and skip connections. Most recent research seemed to focus on transformers and self-attention, both methods are difficult to run on neuromorphic hardware architectures. Hence we first defined what we really wanted out of gaze tracking.
Extreme real-time performance – we want to be able to track eyes at speeds greater than 240 FPS to capture rapid eye movements.
High accuracy gaze, we want to be able to ideally detect gaze direction with an error between 2.0 to 3.0 degrees in ideal situations.
We want to be able to check for conditions of lazy-eye, walled-eye, and cross-eye so left and right eye tracking needs to be completely independent.
We want to be able to detect eye openness and iris and pupil size for cognitive and emotional states.
We want this in a model that can run in less than 256KB of storage so can go into low cost IoT devices and MCUs.
We want to be able to use low cost CMOS RGB or IR sensors without any special hardware.
Finally we want to know if the eye is occluded or is not visible and the confidence of the outputs.
Now with the shopping list – elves to work.
2. Choosing an architectural approach early on.
Based on our experience over the last 10 years with state-of-the-art recognition of facial expressions using deep learning – we decided to go with a simple method early on. We decided to use a well tested and proven feed-forward only deep CNN with a well designed single multi-task head and focus on the quality of the data and loss functions to generate a latent space representation that is coherent across all the tasks we are trying to solve. Our assumption here is that when we model a problem such as the eye, we are looking at a fairly constrained physical problem. Every person has an eye which behaves more or less the same way, an eyeball rotates in 3d space, a pupil dilates, there is skin (eyelid) that occludes what we are observing via blinking, and there are finite number of degrees of freedom (your gaze can only go so far left and right, and up and down). Secondly, you can notice that in literature many people use LTSM or RNN networks to solve gaze errors. We didn’t like this approach, as ultimately feedback loops imply that the frame rate needs to be fixed, and we wanted to be able to run eye tracking on low power robots running at 1FPS or on VR headsets running >250 FPS with the same model. Building in a time-dependent error correction scheme is not ideal in hardware architectures where frame latency may vary and should be a “last resort” rather than a first principle design approach.
3. To build or not to build a new dataset?
To build or not to build a new dataset? As we have almost a decade of experience with building large scale computer vision datasets for human behaviours especially emotions we knew choosing or creating the right training dataset would be key to solving this problem. However we lacked a rich dataset of eye tracking data specifically for the “shopping list” of features. Many existing datasets existed () in literature however we ultimately knew that the inner distributions of existing datasets involve two fundamental issues which are:
- Implicitly biased – data is almost always skewed which leads to inherent biases related to gender, ethnicity, socioeconomic status, age, etc;
- Unrepresented corner cases – collection of data samples for corner cases can be complicated and/or expensive, so model quality can suffer from the small diversity in data.
The second issue is that existing datasets did not come with the labels we needed. Taking existing datasets and applying manual data labelling is a very labor-consuming process. When it comes to large production datasets even ordinary labelling tasks such as segmentation or classification can cost hundreds of thousands of dollars per one dataset. Moreover, sometimes desired labelling can not be labeled at all and require complex and expensive data collection using special equipment (e.g. lighting environments). Also many existing datasets are restricted to only non-commercial use and so cannot be used. Hence the choice was taken to build a new dataset specifically for the task at hand.
4. Taking a data centric model driven approach to the training data.
We now knew we needed to build a dataset and now the question was how we could do this fast and at low cost. Existing approaches for large scale datasets involve creating special hardware setup and getting people to come into a lab to take photos of eyes (see ETH-XGaze dataset: a large scale (over 1 million samples) gaze estimation dataset with high-resolution images under extreme head poses and gaze directions). This involves a) building a hardware setup to capture, store, and synchronize the data. It also involves creating task specific tools and labelling manually images captured. It also involves having to recruit people to come into a lab. Typically this process takes 12 months to do properly and budgets easily can extends into hundreds of thousands. Neither did we have time or the money to invest.
To solve this problem, we decided to go with a synthetic data pipeline for the training dataset. Synthetic data is using a computer to generate data for model training. In the case of humans, a growing trend is to use computer gaming hardware (what you get in a PS5) to create images for computer vision. Given the success of Microsoft researchers at Cambridge with Face Synthetics we decided that this approach would have the highest chance of success.
However using synthetic data has it own challenges, as it is very, if not impossible to simulate the diversity of the real-world. Like training a fighter pilot in a simulator, you simply need to have a certain number of “real” hours in a plane and experience real situations before you can say you can fly a plane.
In the synthetic data world this is called the “domain gap”. And this is a real problem for eye-tracking as the appearance of eyes vary a lot around the world across cultures and age. Asians have eyes which are narrower, and older people have changes to the skull from age 55 onwards causing geometric deformations. Creating a simulation pipeline to capture all this diversity simply isn’t possible.
5. Using AI to solve the “domain gap” faster and cheaper.
One of the key benefits of AI is driving efficiency – and that applies nowhere more strongly than in building the AI systems themselves. To solve the synthetic “domain gap” problem we decided that AI was the tool for the job.
The question we posed – could we not use AI to generate the massive diversity of how eyes appear in the real-world? Given OpenAI success (ChatGPT) of putting the internet of data (text, images, etc) into “foundation models” we should be able to use these tools to generate the missing data we needed. Given the release of Stable Diffusion in August 2022 we could finally generate realistic images of various eye types across age, gender, and ethnicity at scale!
With Xmas now only 3 months away – the elves got cranking generating 100’000 different AI eye shapes and forms.
6. Putting it all together.
Now the magic happens.
Training a large scale machine learning model involves finding a large set of parameters used to tune the model to work for a given setup. Multi-task setups are even more challenging as there are few methods to help you weight tasks and time was ticking for Xmas.
The elves were busy – generating a lot of data (both synthetic and domain gap data) and in the workshop Santa was preparing the parameters for training.
First results were impressive however where not meeting accuracy targets – model accuracy on in-domain data was a gaze error of 4.2. State-of-the-art transformer methods for gaze were at 4.0 degrees error.
We needed to do better.
At this point you don’t have too many options, elves were hitting a wall as there was little time left to further push data production. We wanted our model architecture to stay neuromorphic compliant and to hit 1000+ FPS performance we could not increase the number of parameters in the network.
So we decided to go “old school”.
7. Going “old school” – the final boost.
Before we had deep learning for eye and face tasks we had what we called Active Shape Models from Tim Cootes back from 1995! These models were built on a very old school technique based in the Principal Component Analysis (PCA) world. This technique worked iteratively to converge on a solution and the search involved solving a Jacobian matrix problem. This limited the amount of training data that could be effectively used before it became computationally too heavy (typically 10’000 examples was the limit). When deep learning came along, deep neural networks allowed us to put millions of examples into the learning process and ASMs were no longer state-of-the-art.
However ASMs had a nice property in that any result generated belonged to a statistical distribution of the training data based on PCA, which meant that typically everything was “smoothed” and close to a statistical average. So what if we could use the PCA version of our landmark data to guide our multi-task learning process – staying close to the mean early on and then using the full of deep learning to drive the performance home.
This approach worked. No need to expand the network width and we could squeeze further performance from a better balanced latent space representation and hit our target of gaze accuracy of less than 3 degrees on in-domain data!
8. Fire up the reindeer – Santa is coming to town.
The result is pretty amazing (see video above).
Eye tracking trained uniquely on synthetic or AI generated data, running on next generation neuromorphic hardware.
There are still a number of corner cases to solve, especially in the combination of certain facial expressions and eye movements however from the initial results the approach looks very promising.