- YOLO CPU Running Time Reduction: Basic Knowledge and Strategies
- Build Personal Deep Learning Rig: GTX 1080 + Ubuntu 16.04 + CUDA 8.0RC + CuDnn 7 + Tensorflow/Mxnet/Caffe/Darknet
- Recurrent YOLO for Object Tracking [Project Page][Arxiv][Github]
- SSD in MxNet with C++ test modules [Github], by my roomie [Zhi Zhang]
- LightTrack: Online Human Pose Tracking [Project Page][Arxiv][Github]
YOLO, short for You Only Look Once, is a real-time object recognition algorithm proposed in paper You Only Look Once: Unified, Real-Time Object Detection , by Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi.
As was discussed in my previous post (in Chinese), the Jetson TX1 from NVIDIA is a boost to the application of deep learning on mobile devices and embedded systems. Many potentially inspiring products are approaching, one of which, to name with, is the real-time realization of computer vision tasks on mobile devices. Imagine the real-time abnormal action recognition under surveillance cameras, the real-time scene text recognition by smart glasses, or the real-time object recognition by smart vehicles or robots. Not excited? How about this, the real-time computer vision tasks on egocentric videos, or on your AR and even VR devices. Imagine you watch a clip of video shot by Kespry (What is this?) , you experience how Messi beat less than a dozen players and scored a goal. This can be used for educational purposes, where you stand in a player’s shoes, study how he/she observes the real-time circumstances and handles the ball. (If you are considering a patent, please put my name to the end of the inventors list.)
That being said, I assume you have at least some interest of this post. It has been illustrated by the author how to quickly run the code, while this article is about how to immediately start training YOLO with our own data and object classes, in order to apply object recognition to some specific real-world problems.
Here are two DEMOS of YOLO trained with customized classes:
The cfg that I used is here: darknet/cfg/yolo_2class_box11.cfg
The weights that I trained can be downloaded here: yolo_2class_box11_3000.weights
The pre-compiled software with source code package for the demo: darknet-video-2class.zip
You can use this as an example. The code above is ready to run the demo.
In order to run the demo on a video file, just type:
./darknet yolo demo_vid cfg/yolo_2class_box11.cfg model/yolo_2class_box11_3000.weights /video/test.mp4
If you would like to repeat the training process or get a feel of YOLO, you can download the data I collected and the annotations I labeled.
I have forked the original Github repository and modified the code, so it is easier to start with. Well, it was already easy to start with but I have so far added some additional niche that might be helpful, since you do not have to do the same thing again (unless you want to do it better):
(1). Read a video file, process it, and output a video with boundingboxes.
(2). Some utility functions like image_to_Ipl, converting the image from darknet back to Ipl image format from OpenCV(C).
(3). Adds some python scripts to label our own data, and preprocess annotations to the required format by darknet.
(…More may be added)
This fork repository also illustrates how to train a customized neural network with our own data, with our own classes.
1. Collect Data and Annotation
(1). For Videos, we can use video summary, shot boundary detection or camera take detection, to create static images.
(2). For Images, we can use BBox-Label-Tool to label objects.
2. Create Annotation in Darknet Format
(1). If we choose to use VOC data to train, use scripts/voc_label.py to convert existing VOC annotations to darknet format.
(2). If we choose to use our own collected data, use scripts/convert.py to convert the annotations.
At this step, we should have darknet annotations (.txt) and a training list (.txt).
Upon labeling, the format of annotations generated by BBox-Label-Tool is:
box1_x1 box1_y1 box1_width box1_height
box2_x1 box2_y1 box2_width box2_height
After conversion, the format of annotations converted by scripts/convert.py is:
class_number box1_x1_ratio box1_y1_ratio box1_width_ratio box1_height_ratio
class_number box2_x1_ratio box2_y1_ratio box2_width_ratio box2_height_ratio
Note that each image corresponds to an annotation file. But we only need one single training list of images. Remember to put the folder “images” and folder “annotations” in the same parent directory, as the darknet code look for annotation files this way (by default).
You can download some examples to understand the format:
3. Modify Some Code
(1) In src/yolo.c, change class numbers and class names. (And also the paths to the training data and the annotations, i.e., the list we obtained from step 2. )
If we want to train new classes, in order to display correct png Label files, we also need to moidify and rundata/labels/make_labels
(2) In src/yolo_kernels.cu, change class numbers.
(3) Now we are able to train with new classes, but there is one more thing to deal with. In YOLO, the number of parameters of the second last layer is not arbitrary, instead it is defined by some other parameters including the number of classes, the side(number of splits of the whole image). Please read the paper.
(5 x 2 + number_of_classes) x 7 x 7, as an example, assuming no other parameters are modified.
Therefore, in cfg/yolo.cfg, change the “output” in line 218, and “classes” in line 222.
(4) Now we are good to go. If we need to change the number of layers and experiment with various parameters, just mess with the cfg file. For the original yolo configuration, we have the pre-trained weights to start from. For arbitrary configuration, I’m afraid we have to generate pre-trained model ourselves.
4. Start Training
Try something like:
./darknet yolo train cfg/yolo.cfg extraction.conv.weights
If you find any problems regarding the procedure, contact me at email@example.com.
Or you can join the aforesaid Google Group; there are many brilliant people answering questions out there.
I also have a windows version for darknet available:
But you need to use Visual Studio 2015 to open the project. Also note that this windows version is only ready for testing. The purpose of this version if for fast testing of cpuNet.
Here is a quick hand-on guide:
1. Open VS2015. If you don't have it, you can install it for free from the offcial microsoft website. 2. Open: darknet-windows\darknet\darknet.sln 3. Comiple 4. Copy the exe file from: darknet-windows\darknet\x64\Debug\darknet.exe to the root folder: darknet-windows\darknet.exe 5. Open cmd Run: darknet yolo test [cfg_file] [weight_file] [img_name] 6. The image will be output to: darknet-windows\prediction.png darknet-windows\resized.png
Recently I have received a lot of e-mails asking about yolo training and testing. Some of the questions are towards the same issue. Therefore, I picked some representative questions for this FAQ section. If you find a similar question here, you may have an answer for yourself right away. Since I am also a student of the darknet, if you find any of my answers erroneous, please comment below. Thanks!
Sorry about last email, I just re-read your Github and found you used (5 x 2 + 2) x 11 x 11 = 1452.
I have another confusion about the subdivisions, the author set subdivisions = 64, can you give me some clues about what this variable mean?
In my understanding, if you have 128 images as a batch, for example, you batch update the weights upon processing 128 images.
And if you have subdivision to be set to 64, you have 2 images for each subdivision. For each division, you concatenate the ground truth image feature vectors into one and process it as a whole.
If you set subdivision to 2, the training is the fastest, but you see less results printed out.
– how to get the predicted bounding boxes coordinates, I am planning to write the complete results, detection class labels and the associated predicted bounding boxes coordinates in a flat text file. Which source file I should look into or modify to do this?
– when to stop training? – when I am training yolo it is not showing any information about the accuracy of the network?
– I am wondering how to test it on videos, because the test command takes an image at a time, so how to pass multiple images to yolo for testing?
As of the bounding boxes, you can find it in [yolo_kernels.cu], in function “void *detect_in_thread(void *ptr)”.
If you want to control the training, just modify the configuration file. For example, in [yolo_2class_box11.cfg], change the max_batches and steps in line 14 and 12, respectively.
In order to test a video file, just type this in the terminal: ./darknet yolo vid_demo yolo_2class_box11.cfg yolo_2class_box11_3000.weights /video/test.mp4
I downloaded all your images and lables, modified my training_list.txt and start to train, but always ends with a ‘Couldn’t open file: /root/cnn/guanghan/darknet/scripts/labels/stopsign/rural_027.txt’ problem.
Each time says the different file that can not open.
Please check if the folder [labels] and the folder [images] are in the same directory. I think this is the problem.
Yes, they are all in the darknet/scripts directory:
Oh, you made a typo… It is “labels”, not “lables”. That’s the issue. >_<!
Thanks for your reply, Ning. I tried to repeat the training results of stopsign and yieldsign but looks like I am not able to get the right results.
- After downloading your images and annotations,
- I changed the line 218 output=588 and classes=2 in cfg/yolo-tiny.cfg. In your github website, yolo_2class_box11.cfg line 218 is output=1452. This is not correct for 2 classes, right?
- After execute the training, for several tries, the AOU=nan….
- Anyway, I waited for a couple of hours and get a 2000 iteration weight results.
- By testing on this video https://www.youtube.com/watch?v=OCaRH_C_USg with setting threshold=0, no bounding box show up.
Any comments? I am afraid something wrong at step 3 because everything is nan….
- The second last layer is correct. For 11 by 11 splits of image (in order to be more accurate in small object recognition), the number of neurons at this layer should be: (5*2 + 2)* 11*11= 1452.
- During training, the “AOU= nan” sometimes occur. This is a little complicated to explain, but I will do my best to simplify the answer:
- First, try to reduce the learning rate to a smaller value, this usually helps.
- When the annotation data is not correct, by which I mean there exists a training image whose annotation is empty.
- If there is no object labeled, the code will try to update the weights at some point, with no actual data fed. This batch update will be nonsense, and ruin the current weights.
- But since you downloaded my data, this should not be the case.
- When the annotation data is correct, this sometimes happens because there is a hidden bug in darknet code.
- During training, the code tries to “randomly” sample image patches from the training image, as a way of data augmentation. If the bounding box is near the edge of the image, sometimes the sampled patch will cross the border. This invalid data will ruin the update of weights as well.
- One way to avoid this is to batch update the weights with many images instead of a single image, thus alleviate the effect.
- This is why in the cfg file I set subdivisions = 2 instead of default 64.
- If you use the tiny model, you should pre-train the model yourself or use darknet.conv.weights, rather than fine-tuning with the provided pre-trained model extraction.conv.weights. It is because the pre-trained model extraction.conv.weights had more layers than the tiny model.
I tried to add a 3rd class in the training, but somehow the test always gives me the first two. The 3rd class never showed up if I set the threshold to 0 or even -1.
Does your training code only works for 2 classes?
I changed the number of classes to 3 both in yolo.c and cfg.
Have you checked out yolo_kernels.cu? You should set CLS_NUM to be 3 in there as well.
I see that the bounding boxes for the yieldsign and stopsign are not tight. Also the top left corner seems to be where it should be Is there any specific reason for the bounding box to be bigger than what it should be and the exactness in location for the top left corner?
The image of YOLO is split to 7*7 partitions by default. In order to predict more precisely, you could modify the second last layer by setting it to be 11*11 partitions, for example. This only alleviates the problem but does not completely solve the problem, because the splits are still not arbitrary.
The fast rcnn can generate more precise detections because the bounding box can occur in arbitrary position, and that is why it is much slower than yolo.
Once thing we can do is to add a post-processing step to align the bounding box.
I wanted to ask what data you used for this..is this any public data you used, or did you annotate your own data..
how many frames did you have to train for each of these classes to get a descent performance?
The data I used was downloaded from [Google Images], and hand-labeled by my intern employees. Just kidding, I had to label it myself. It is covered in the README how to label data.
Since I am training with only two classes, and that the signs have less distortions and variances (compared to person or car, for example), I only trained around 300 images for each class to get a decent performance. But if you are training with more classes or harder classes, I suggest you have at least 1000 images for each class.
Is it possible to share the data and the annotations you collected for me to get a feel of YOLO?. I am limited by resources and hence not able to test it on large datasets as of now. Also, it will be easy for me to comprehend your instructions with this data.
I have uploaded the data that I trained for the demo, with the corresponding annotations. If you download them, you should be repeating the training process easily.
I have successfully run your example, but when switched to running my data with 10 class (1000 images) I got an error “cannot load image”
I have checked the image file, link to image and open normally, Do you have any suggest ?
I have encountered similar problems before. The solution to my problem was that I checked the txt file. It seems that there is a difference between Python2 and Python3 in treating spaces in text files. You cannot see the difference in the text file, though(unless you look at the char). I used Windows(python3) to generate the text file, and used Ubuntu(python2) to load the images, and that’s where I met this problem.
In short, the program failed to interpret and find the correct path.
Maybe you are having the same issue? Please check it out.