Scene Classification in Images with CNN-SVM

Under Construction... Coming Soon

In this project, we are able to classify an input image into one of the 15 classes:
(1) Airport (2) Beach (3) Bedroom (4) BookHouse (5) Classroom (6) DinningRoom (7) Food (8) Hospital (9) Human Closeup (10) InsideCar (11) Office (12) Park (13) Restaurant (14) ShoppingMall (15) Toilet

There are several key points to note:

1. Finetuning the pre-trained model with our own data
Pre-training from scratch may take weeks. Therefore, we prefer finetuning on pre-trained models that are relavent to our task. One advantage of using Caffe is that there are varieties of models avalibale on CaffeeZoo.

2. During Finetuning, only finetune the last layer
For traditional CNN architecture, the neurons in the second last layer is the 4096-dimensional high-level visual feature representation. The pre-trained model is powerful in this general feature representation. In order not to ruin the robust feature, we set the learning rates of lower layers to be zero while keeping the learning rate of the last layer non-zero. Otherwise, the finetuning will ruin the feature and result in overtraining, due to the fact that the data used in finetuning are usually limited, compared to that used in pre-training, which is usually in imageNet scale. (If the data for finetuning is abundant, we could allow low learning rates in lower layers)

3. When do we use SVM for the classification of the CNN features?
Usually, a linear classifier on the top layer of CNN works fine. But for some cases, it is better if we train an additional SVM classifier on these CNN features.

In our case, the data used for finetuning is real-world images from movies, since our goal is to classiy scenes and attach a label to a camera take sequece (click here for the Camera Take Project). Therefore, the data is quite different from the data used in pre-training. It is believed in such cases that an additional linear classifier works better.
More importantly, due to the fact that scene classification is multi-level, which means several scenes share common attributes, we use multi-level SVMs for the image scene labeling. Specificall, we train 7 SVM models, as there are totally 3 levels of SVM classification in our scene classification task.

The scenes are firstly classified into: (1) indoors, (2) outdoors, and (3) human close-up. It is followed by a middle level classification for each class, based on the semantical similarity of scenes, e.g., home and hotels in one class, workplaces in another. At last, we have a final classification on each mid-class firing.
For each image, we have several labels. One example is shown here:
/home/Guanghan/test/dinningroom/002.jpg, /Indoor /[homehotel]dinningroom+bedroom+toilet+bookhouse /dinningroom