Recurrent YOLO for object tracking

The Overview

The Network

Demo on Unseen Videos

Red: Ground truth of the testing sequence
Blue: Detection result of YOLO small model (trained on VOC, 20 classes, which does NOT include face)
Green: Tracking result of ROLO

Spatiao-temporal Robustness Against Occulusion

Visualization with Regression of Locations (Unseen Frames)

ROLO is effective due to several reasons: (1) the representation power of the high-level visual features from the convNets, (2) the feature interpretation power of LSTM, therefore the ability to detect visual objects, which is spatially supervised by a location or heatmap vector, (3) the capability of regressing effectively with spatio-temporal information.

Visualization with Regression of Heatmaps (Unseen Videos)

It is shown in the above figure that ROLO tracks the object in near-complete occlusions. Even though two similar targets simultaneously occur in this video, ROLO tracks the correct target as the detection module inherently feeds the LSTM unit with spatial constraint. Note that between frame 47-60, YOLO fails in detection but ROLO does not lose the track. The heatmap is involved with minor noise when no detection is presented as the similar target is still in sight. Nevertheless, ROLO has more confidence on the real target even when it is fully occluded, as ROLO exploits its history of locations as well as its visual features.

Experimental Results





[Arxiv] [Github]

If you find this work useful, please cite:
  title={Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking},
  author={Ning, Guanghan and Zhang, Zhi and Huang, Chen and He, Zhihai and Ren, Xiaobo and Wang, Haohong},
  journal={arXiv preprint arXiv:1607.05781},