Detection of Militia Object in Libya by Using YOLO Transfer Learning

Corresponding Author Yosi Kristian Institut Sains dan Teknologi Terpadu Surabaya Indonesia Tel. +62 89680847005 yosi@stts.edu Humans can recognize and classify shapes, names, and provide responses to object that are received by visually quickly and accurately. More importantly, it is expected that the system created is able to help provide response in all tasks and time, for example when driving, walking in the crowd even when patrolling as a member of the military on dangerous terrain.This has become a problem in the system used on the battlefield. In the proposed system, the object detection model must be able to sort out the objects of armed humans (militia) with unarmed human objects. To overcome the problem the author uses the YOLO transfer learning algorithm which currently has the third version. It is stated that YOLOv3 has very extreme speed and accuracy. In mean Average Precision (mAP) obtained by 0.5 IOU, YOLOv3 is equivalent to 4x faster than Focal Loss. Moreover, YOLOv3 also offers optimal speed and accuracy simply by changing the size of the model, without the need for retraining


INTRODUCTION
Various methods are used for object detection have been developed. The object detection system is made as closely as possible with the human observation model that is received visually. Histogram of Oriented Gradients (HOG) with SVM [1], although this method is strong at detecting humans in limited quality images, but this method fails when the main object is partially closed by another object like vehicles, trees and others. This has become a problem in the system used on the battlefield. In the proposed system, the object detection model must be able to sort out the objects of armed humans (militia) with unarmed human objects. To overcome the problem the author uses the YOLO transfer learning algorithm which currently has the third version. It is stated that YOLOv3 has very extreme speed and accuracy. General purpose robots need the ability to interact with and manipulate objects in the physical world. Humans see novel objects and know immediately, almost instinctively, how they would grab them to pick them up. Robotic grasp detection lags far behind human performance [2]. In mean Average Precision (mAP) obtained by 0.5 IOU, YOLOv3 is equivalent to 4x faster than Focal Loss [3]. Moreover, YOLOv3 also offers optimal speed and accuracy simply by changing the size of the model [4], without the need for retraining. The objectives and benefits of the research that will be carried out on the topics taken are conduct training at YOLO to be able to recognize objects of armed humans (militias) in the image and helps classify militia objects and ordinary people. The usage contribution of research are: a. Aims to detect human objects that are not merely labeled "Person" but also as "militia" according to the picture given. b. This research is very important if developed in the future, for example, it is used to make special auto-turret technology for militia detection, automatic weapons that can shoot targets that have been arranged in the object category

RESEARCH METHOD
The following picture is a system diagram in this study, begin from input image until Militia Classification In order to build up to object detection, we first learn about object localization. a. Input Image Start by defining with the image classification task where an algorithm looks at this picture an might be responsible for saying this is a man with gun (militia). So that was classification [1]. The problem we learn to build a neural network is classification with localization [5] which means not only do we have to label this as say a militia or not militia but the algorithm also is responsible for putting a bounding box [6] or drawing a red rectangle around the position of the militia in the image. So that is called the classification with localization problem [7] where the term localization refers to figuring out where in the picture is the militia we have detected.
Jurnal Teknologi dan Manajemen Informatika (2020) [37] If we have a test image like this, we start by picking a certain windows size [6], shown down there, for example using a man with gun (militia) image: Then we would input into this ConvNet [8] a small rectangular region [9]. So, take just this below red square, input that into the convNet and have a ConvNet make a prediction.  (2) Presumably for that little region in the red square, it will say NO, that little red square does not contain a militia. In the Sliding Windows Detection Algorithm [6], what we do is then pass as input a second image now bounded by this red square shifted a little bit over and feed that to the convNet. So we are feeding just the region of the image in the red square of the convNet and run the convNet again. And for third image and so on.

Jurnal Teknologi dan Manajemen Informatika (JTMI)
Vol.  Keep going until we have slid the window across every position in the image. Every region of this size and pass lots of little cropped images into the convNet and have it classified 0 or 1 for each position as some stride. Next step, repeat it use a larger window. So this algorithm is called Sliding Windows Detection because we take these windows [10], square boxes, and slide them across the entire image and classify every square region with some stride as containing a militia or not. In particular, the Sliding Windows Object Detector can be implemented convolutionally or much more efficiently [11]. c. People Detection In order to build up to object (people) detection, we first do the object localization.

Image of People Classification
There might be multiple objects [12] in the picture and we have to detect them all and localize [13] them all. It might need to detect not just other cause but maybe other objects. The classification of localization problems usually have one object usually one big object in the middle of the image that we're trying to recognize or recognize and localize [14]. In contrast in the detection problem there can be multiple objects and in fact maybe even multiple objects of different categories within a single image. So the idea is we've learned about for image classification will be useful for classification with localization and then the idea is we learn for localization will then turn out to be useful for detection [15].

Militia or non militia classification with localization
For example there is an input image for detection below. There might be multiple objects in the picture and we have to detect them all and localize them all. It might need to detect not just other. In part 2 (Sliding Windows), we use a convolutional implementation of sliding windows and that is more computationally efficient but the store has a problem of not quite outputting the most accurate bounding boxes. In this part, how we can get our bounding box predictions to be more accurate. In the example before, none of the boxes really match up perfectly with position of the people. So maybe that box is the best match. And also, it looks like in drawn through, the perfect bounding box is not even quite square. It is actually has a slightly wider rectangle or slightly horizontal aspect ratio.  So we must get this algorithm to outputs more accurate bounding boxes. A good way to get this output more accurate bounding boxes is with the YOLO algorithm. When it comes to deep learning-based object detection, there are three primary object detectors we will encounter: • R-CNN and their variants, including the original R-CNN, Fast R-CNN, and Faster R-CNN • Single Shot Detector (SSDs) • YOLO R-CNNs are one of the first deep learning-based object detectors and are an example of a two-stage detector.
1. In the first R-CNN publication [13], Girshick et al. proposed an object detector that required an algorithm such as Selective Search [16] (or equivalent) to propose candidate bounding boxes that could contain objects. 2. These regions were then passed into a CNN for classification, ultimately leading to one of the first deep learning-based object detectors. The problem with the standard R-CNN method was that it was painfully slow and not a complete end-to-end object detector. The Fast R-CNN algorithm made considerable improvements to the original R-CNN, namely increasing accuracy and reducing the time it took to perform a forward pass; however, the model still relied on an external region proposal algorithm [17]. R-CNNs became a true end-to-end deep learning object detector by removing the Selective Search requirement and instead relying on a Region Proposal Network (RPN) that is (1) fully convolutional and (2) can predict the object bounding boxes and "objectness" scores (i.e., a score quantifying how likely it is a region of an image may contain an image). The outputs of the RPNs are then passed into the R-CNN component for final classification and labeling [18]. YOLO is a great example of a single stage detector [19], details an object detector capable of super real-time object detection, obtaining 45 FPS on a GPU. YOLO has gone through a number of different iterations, including YOLOv2, capable of detecting over 9,000 object detectors [20]. Redmon and Farhadi are able to achieve such a large number of object detections by performing joint training for both object detection and classification. Using joint training the authors trained YOLO9000 simultaneously on both the ImageNet classification dataset and COCO detection dataset. The result is a YOLO model, called YOLO9000, that can predict detections for object classes that don't have labeled detection data. YOLOv3 is significantly larger than previous models, the best one yet out of the YOLO family of object detectors [3]. e. Militia classification For this box over here hopefully, the value of y to the output for that box at the bottom left, hopefully would be something like zero for bounding box one. And then just open a bunch of numbers, just noise. Hopefully, we'll also output a set of numbers that corresponds to specifying a pretty accurate bounding box for the person/militia. So that's how the neural network will make predictions. Finally, we run this through non-max suppression.Non-max Suppression (for multiple detection problem) is one of the problems of object detection is that the algorithm may find multiple detections of the same objects. Rather than detecting an object just once, it might detect it multiple times. Non-max suppression is a way for us to make sure that our algorithm detects each object only once.

RESULT AND DISCUSSION
Several stages of implementation need to be done on 2 class objects, both militia and non-militia. Namely training dataset for each class, the process of labeling objects in the image dataset, training, and testing.

Dataset for training
The following are datasets for militia and non militia (person). Each dataset class consists of 100 image data, 100 images for militia, and 100 images for non militia.

Labeling
Given this label training set, we can then train a ConvNet that that inputs an image, like one of these closely cropped images. And then the job of the ConvNet is to output y, 0 or 1, is there a militia or not. For best test result, may need minimum 10000 epoch.

CONCLUSION
For accuracy, For our model, we have got 0.803 which means our model is approx. 80% accurate. For precision, We have got 0.788 precision which is pretty good. For recall, We