5
Indonesian Food Identification and Detection in the Smart Nutrition Box using Faster-RCNN Sigit Adinugroho [email protected] Faculty of Computer Science, Brawijaya University Malang, Indonesia Putra Pandu Adikara [email protected] Faculty of Computer Science, Brawijaya University Malang, Indonesia Edy Santoso [email protected] Faculty of Computer Science, Brawijaya University Malang, Indonesia Restu Amara [email protected] Faculty of Computer Science, Brawijaya University Malang, Indonesia Kresentia Septiana [email protected] Faculty of Computer Science, Brawijaya University Malang, Indonesia Kenza Dwi Anggita [email protected] Faculty of Computer Science, Brawijaya University Malang, Indonesia ABSTRACT Food detection and localization are useful to recognize consumer preference and the amount of consumption. In the end, they can be integrated to food production to reduce oversupply and food waste. In this paper, the Faster R-CNN model is exploited to locate and classify food objects in a tray box as a first stage for food loss quantization. To perform the objective, the model was trained on images scraped from the Internet. Then, performance evaluation was conducted on both complex image and tray box image. On tray box images data, the model was able to achieve Average Precision of 0.455 and Average Recall of 0.628 for IoU 0.50:0.95. CCS CONCEPTS Computing methodologies Object detection; Neural net- works. KEYWORDS food detection, Faster R-CNN, deep learning ACM Reference Format: Sigit Adinugroho, Putra Pandu Adikara, Edy Santoso, Restu Amara, Kresen- tia Septiana, and Kenza Dwi Anggita. 2020. Indonesian Food Identification and Detection in the Smart Nutrition Box using Faster-RCNN. In 5th Inter- national Conference on Sustainable Information Engineering and Technology (SIET ’20), November 16–17, 2020, Malang, Indonesia. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3427423.3427429 1 INTRODUCTION Food waste is a serious problem that happens in the world. Glob- ally, the Food and Agriculture Organization (FAO) estimates that one-third of global food production, equals 1.3 billion tons, ends Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIET ’20, November 16–17, 2020, Malang, Indonesia © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7605-1/20/09. . . $15.00 https://doi.org/10.1145/3427423.3427429 up as a waste [6]. Indonesia is positioned in the second rank, after Saudi Arabia, in terms of the amount of food wasted. On average, an Indonesian dumps 300 kilograms of food annually [1]. Apart from the consumption process, food waste is also caused by ineffi- cient food processing [13]. Specific to the consumption process, the primary source of food waste is expired food and oversupply [15]. Information about type and amount of food is useful to manage food production so that oversupply can be prevented. On the other hand, the information is also helpful for nutrition counseling and providing appropriate food for a particular need. The conventional method for analyzing food leftovers is the Com- stock method. The method is a visual one that requires a trained expert to assess the amount of leftovers in a plate into one of six scales: 0%, 25%, 50%, 75%, 90%, or 100% [14]. However, the Comstock method has several drawbacks. A trained expert is a necessity, and the accuracy of an observer might be diminished by fatigue. An alternate method has been developed by Computer Vision Research Group, Faculty of Computer Science, Brawijaya University to serve a similar function to the Comstock method in an automated ap- proach. It employs image segmentation in a device called Smart Nutrition Box to estimate the amount of food remaining in a food tray [17]. One limitation of the design is the inability to recognize the name of food so that a user must manually assign a label to a compartment in a tray. Image recognition, specifically the one for recognizing food, is available in various techniques. A common way is by extracting features from an image that are fed to a machine learning algorithm for training a model. Then it is followed by segmentation to mark areas in the image using the trained model. Zhang et al. use His- togram of Oriented Gradient (HOG) and Scale-Invariant Feature Transform (SIFT) as features for SVM classifier. At the same time, a selective search is implemented to extract region for classifica- tion. The method achieves above 85% of accuracy for recognizing 15 types of food [18]. Although the method achieves such a high degree of accuracy, it needs predefined types of features that may not fit for various types of foods. A type of deep learning method, Convolutional Neural Network (CNN), is able to generate appropriate features through its convo- lutional layers; thus, handcrafted features are no longer required. 113

Indonesian Food Identification and Detection in the Smart

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Indonesian Food Identification and Detection in the Smart

Indonesian Food Identification and Detection in the SmartNutrition Box using Faster-RCNN

Sigit [email protected]

Faculty of Computer Science,Brawijaya UniversityMalang, Indonesia

Putra Pandu [email protected]

Faculty of Computer Science,Brawijaya UniversityMalang, Indonesia

Edy [email protected]

Faculty of Computer Science,Brawijaya UniversityMalang, Indonesia

Restu [email protected]

Faculty of Computer Science,Brawijaya UniversityMalang, Indonesia

Kresentia [email protected] of Computer Science,

Brawijaya UniversityMalang, Indonesia

Kenza Dwi [email protected] of Computer Science,

Brawijaya UniversityMalang, Indonesia

ABSTRACTFood detection and localization are useful to recognize consumerpreference and the amount of consumption. In the end, they canbe integrated to food production to reduce oversupply and foodwaste. In this paper, the Faster R-CNN model is exploited to locateand classify food objects in a tray box as a first stage for food lossquantization. To perform the objective, the model was trained onimages scraped from the Internet. Then, performance evaluationwas conducted on both complex image and tray box image. On traybox images data, the model was able to achieve Average Precisionof 0.455 and Average Recall of 0.628 for IoU 0.50:0.95.

CCS CONCEPTS• Computing methodologies → Object detection; Neural net-works.

KEYWORDSfood detection, Faster R-CNN, deep learning

ACM Reference Format:Sigit Adinugroho, Putra Pandu Adikara, Edy Santoso, Restu Amara, Kresen-tia Septiana, and Kenza Dwi Anggita. 2020. Indonesian Food Identificationand Detection in the Smart Nutrition Box using Faster-RCNN. In 5th Inter-national Conference on Sustainable Information Engineering and Technology(SIET ’20), November 16–17, 2020, Malang, Indonesia. ACM, New York, NY,USA, 5 pages. https://doi.org/10.1145/3427423.3427429

1 INTRODUCTIONFood waste is a serious problem that happens in the world. Glob-ally, the Food and Agriculture Organization (FAO) estimates thatone-third of global food production, equals 1.3 billion tons, ends

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’20, November 16–17, 2020, Malang, Indonesia© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7605-1/20/09. . . $15.00https://doi.org/10.1145/3427423.3427429

up as a waste [6]. Indonesia is positioned in the second rank, afterSaudi Arabia, in terms of the amount of food wasted. On average,an Indonesian dumps 300 kilograms of food annually [1]. Apartfrom the consumption process, food waste is also caused by ineffi-cient food processing [13]. Specific to the consumption process, theprimary source of food waste is expired food and oversupply [15].

Information about type and amount of food is useful to managefood production so that oversupply can be prevented. On the otherhand, the information is also helpful for nutrition counseling andproviding appropriate food for a particular need.

The conventional method for analyzing food leftovers is the Com-stock method. The method is a visual one that requires a trainedexpert to assess the amount of leftovers in a plate into one of sixscales: 0%, 25%, 50%, 75%, 90%, or 100% [14]. However, the Comstockmethod has several drawbacks. A trained expert is a necessity, andthe accuracy of an observer might be diminished by fatigue. Analternate method has been developed by Computer Vision ResearchGroup, Faculty of Computer Science, Brawijaya University to servea similar function to the Comstock method in an automated ap-proach. It employs image segmentation in a device called SmartNutrition Box to estimate the amount of food remaining in a foodtray [17]. One limitation of the design is the inability to recognizethe name of food so that a user must manually assign a label to acompartment in a tray.

Image recognition, specifically the one for recognizing food, isavailable in various techniques. A common way is by extractingfeatures from an image that are fed to a machine learning algorithmfor training a model. Then it is followed by segmentation to markareas in the image using the trained model. Zhang et al. use His-togram of Oriented Gradient (HOG) and Scale-Invariant FeatureTransform (SIFT) as features for SVM classifier. At the same time,a selective search is implemented to extract region for classifica-tion. The method achieves above 85% of accuracy for recognizing15 types of food [18]. Although the method achieves such a highdegree of accuracy, it needs predefined types of features that maynot fit for various types of foods.

A type of deep learning method, Convolutional Neural Network(CNN), is able to generate appropriate features through its convo-lutional layers; thus, handcrafted features are no longer required.

113

Page 2: Indonesian Food Identification and Detection in the Smart

SIET ’20, November 16–17, 2020, Malang, Indonesia Adinugroho, et al.

In general, this method outperforms other methods in image classi-fication. Nevertheless, in object recognition problems, this methodcannot locate objects in an image [7]. R-CNN method is able togenerate 2000 regions automatically using the bottom-up techniqueand then passed to a CNN network to generate features that arefurther classified by the SVM method [9]. This method is knownto be complicated since three different models need to be trained,which leads to high computation and detection time. Fast R-CNNresolves the problem by avoiding per-region processing [8]. FastR-CNN processes an entire image by several pairs of convolutionand maxpooling layers. Then, the Region of Interest (RoI) layerreshapes each region to produce a fixed-length feature. After that,the features are forwarded to a fully connected layer with softmaxactivation function for recognition purposes. The next enhance-ment is done by Faster R-CNN [16]. This architecture involves adeep convolutional network to generate preliminary regions andR-CNN for region recognition. Faster R-CNN is well known for itsaccuracy and speed. However, we only examine Faster R-CNN’saccuracy in detecting food in an image. Speed comparison to othermethods is not covered in this research.

The performance of an image recognition model is highly influ-enced by the quantity and quality of training data. Specific to foodrecognition, the process demands training data that contain types offoods and their locations in training images. At the moment, thereare various food-themed training data, such as Food101 [3], Egocen-tric Food [2], and UECFood256 [11]. Despite their various coverage,those datasets do not contain Indonesian food. Thus, they cannotbe used to train a network aiming to recognize Indonesian food.Therefore, building a dataset of Indonesian food is an additionalaspect of this research.

The rest of this paper is organized as follows. The detail of thedataset and methods involved in this research are explained in Sec-tion 2. Section 3 provides the result of the experiments. Conclusionand future works are presented in the last section.

2 DATASET AND METHOD2.1 Overview of the DatasetOur dataset consists of hand-labeled image data. First, a set ofimages is downloaded from Google image search engine using foodnames as queries. Then, each image is examined to see if the imagematches the query. Any improper image is rejected, for instance, thequery “fried rice“ may return an instant seasoning product, ratherthan the final product. In order to reduce training time, crawledimages are resized. If a dimension of an image (either height orwidth) exceeds 512 pixels, the size is squeezed to 512 pixels at thatdimension while the other is resized proportionally. After all imagesare acquired, they are labeled using CVAT 1 based on four pointsrectangular marker. Each rectangle covers a minimum possibleentity of a food. For instance, for labeling fried tofu a rectanglecovers single tofu, while for grouped food such as rice, the sametechnique cannot be applied. Thus, for this kind of food, a rectanglecovers a blob of food. Due to limited resources to label the images,only a set of sample objects are labeled for each image. A sampleof images gathered from the Internet is given in Figure 1a. The

1https://github.com/opencv/cvat/

(a) Sample of image fromthe Internet

(b) Image of food in a traybox

Figure 1: Two types of images used in this research

type of foods and its corresponding number of labeled objects arepresented in Tabel 1.

Table 1: Categories of food and the number of labeled objects

No Food category Numberof labeledobjects

1 Ayam bakar (roasted chicken) 1262 Ayam goreng (fried chicken) 2013 Bakwan (corn fritters) 1814 Kering tempe (sweet and spicy fried tempeh) 785 Lumpia (spring roll) 1516 Mie goreng (fried noodle) 1137 Nasi goreng (fried rice) 568 Nasi kuning (yellow rice) 489 Nasi putih (rice) 13510 Perkedel (potato fritters) 13111 Rendang (beef rendang) 9712 Sate (satay) 13713 Tahu goreng (fried tofu) 15714 Telur mata sapi(sunny side up) 10415 Telur dadar (omelet) 14716 Tempe goreng (fried tempeh) 13217 Urap (salad with coconut dressing) 134

Total 2128

There are 1391 images obtained from Google search engine thatis used for training and testing purposes. 80% of the images areinvolved in a training process, while the remaining images arefor testing purpose. During the testing phase, there is a secondset of images employed to measure the performance of the model.These images contain food placed in a tray box captured by a digitalcamera located inside the Smart Nutrition Box. There are 32 foodobjects from 10 images captured in this condition. Figure 1b depictsan example of food image taken in a tray box.

2.2 Faster R-CNN for Object DetectionObject detection task consists of two main objectives: object local-ization and recognition. In Fast Region with Convolutional NeuralNetwork (Faster R-CNN) [16], those objectives are translated intotwo main tasks:

114

Page 3: Indonesian Food Identification and Detection in the Smart

Indonesian Food Identification and Detection in the Smart Nutrition Box using Faster-RCNN SIET ’20, November 16–17, 2020, Malang, Indonesia

(1) Region of Interest (RoI) extraction. A RoI is an area of animage that may contain an object. For Faster R-CNN, around2000 RoIs are extracted from each training image.

(2) Classification. In this step, each RoI is refined to better fitthe object. Each RoI is also given a predefined set of classes.

RoIs extraction from an image is conducted using Region Pro-posal Network that replaces selective search in previous model.RPN accepts an image as the input and outputs several RoIs andtheir corresponding object probability. Step by step process in anRPN is as follows:

(1) The RPN network receives an image and passes it to severalconvolutional layers.

(2) The convolutional layers emit a feature volume that furtherprocessed by a convolutional filter sized 3𝑥3𝑥𝐷 where 𝐷represents the depth of the volume.

(3) For each position of the feature volume, the filter generatesa 1𝑥𝐷 vector.

(4) RoI’s coordinates and their probability values are calculatedby two siblings 1𝑥1 convolutional layer.

During training, an RPN network aims tominimize the loss functionrepresented by Equation 1.

𝐿 (𝑝𝑖 , 𝑡𝑖 ) =1

𝑁𝑐𝑙𝑠

∑︁𝑖

𝐿𝑐𝑙𝑠(𝑝𝑖 , 𝑝

∗𝑖

)+ _ 1

𝑁𝑟𝑒𝑔

∑︁𝑖

𝑝∗𝑖 𝐿𝑟𝑒𝑔 (𝑡𝑖 , 𝑡∗𝑖 ) (1)

The classification process obtains a set of RoIs from RPN networkalong with their feature volumes. The classification is conductedby several convolutional layers that label and refine each RoI. Thearchitecture of convolutional layers may vary. Classification partconsists of the following steps:

(1) Receive feature map and RoI from RPN.(2) Resize each RoI to fit the fully connected layers.(3) Apply the fully connected layers to generate a feature vector.(4) Apply two types of convolutional layers. Each handles clas-

sification and RoI refinement, respectively.Classification process tries to minimize the loss function repre-sented by Equation 2.

𝐿(𝑝,𝑢, 𝑡𝑢 , 𝑣

)= 𝐿𝑐𝑙𝑠 (𝑝,𝑢) + _ [𝑢 ≥ 1] 𝐿𝐿𝑜𝑐 (𝑡𝑢 , 𝑣) (2)

Training for RPN and classification parts of Faster R-CNN ishandled by so-called 4-step alternating training which is simplifiedas follows:

(1) Train RPN to produce RoIs.(2) Train classification network using RoIs as inputs. At the end

of the training process, RPN and classification networks havedifferent weights since they are trained separately.

(3) Swap weight of convolutional layers in the RPN network us-ing ones from the classification network. Train the last layerof RPN without changing the weight of its convolutionallayers.

(4) Train the last layer of the classification network using outputfrom the RPN network.

2.3 Evaluation MethodIn order to examine the performance of the implemented method,several object detection performance metrics adapted from PASCAL

VOC Challange 2010 and COCO dataset are evaluated. For betterunderstanding, several key concepts of the evaluation metrics areexplained here.

Intersection over Union (IoU) is defined as a ratio of two ar-eas: the intersection of two bounding boxes (detection box 𝐵𝑝 andground truth box 𝐵𝑔𝑡 ) and union of those boxes. This metric isintended to measure how much overlapping between detection andground truth bounding boxes. IoU is expressed mathematically as:

𝐼𝑜𝑈 =𝑎𝑟𝑒𝑎(𝐵𝑝 ∩ 𝐵𝑔𝑡 )𝑎𝑟𝑒𝑎(𝐵𝑝 ∪ 𝐵𝑔𝑡 )

(3)

A detection bounding box is declared True Positive (TP) if the con-fidence score exceeds a threshold and the IoU exceeds a threshold.Otherwise, it is considered as False Positive (TP).

In typical machine learning classification tasks, two measures,called Precision and Recall, are derived from True Positive andTrue Negative [4]. Precision is defined as ratio of correctly pre-dicted positive class and the total of positive classes while Recalldescribes as the ratio of correctly predicted positive class and thetotal of correctly predicted classes. Precision and Recall is definedin Equation 4 and Equation 5, consecutively.

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑃

𝑇𝑃 + 𝐹𝑃 (4)

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁 (5)

Precision and Recall can be mapped into a precision-recall curveto examine the association between two measures. Based on thecurve, the average precision (AP) is derived. AP is the average ofprecision across all recall values [5]. Regularly, recall values areequally spaced from 0 to 1. Therefore the formula for AP is writtenas:

𝐴𝑃 =111

∑︁𝑟 ∈[0,0.1,...,1]

𝑝𝑖𝑛𝑡𝑒𝑟𝑝 (𝑟 ) (6)

𝑝𝑖𝑛𝑡𝑒𝑟𝑝 refers to the interpolated value of precision at each recallvalue of 𝑟 by calculating maximum precision found at recall valuehigher than 𝑟 , as denoted in Equation 7

𝑝𝑖𝑛𝑡𝑒𝑟𝑝 (𝑟 ) =𝑚𝑎𝑥𝑟 ′≥𝑟

𝑝 (𝑟 ′) (7)

AP only reflects the precision of a class. For multi classes 𝐾 detec-tion, mean average precision (mAP) is involved. mAP is defined asfollows:

𝑚𝐴𝑃 =

∑𝐾𝑖=1𝐴𝑃𝑖

𝐾(8)

Average recall (AR) defines the averaged recall for all 𝐼𝑜𝑈 ∈[0.5, 1] [10]. The formula of AR is shown in Equation 9

𝐴𝑅 = 2∫ 1

0.5𝑟𝑒𝑐𝑎𝑙𝑙 (𝑜) = 2

𝑛

𝑛∑︁𝑖=1

𝑚𝑎𝑥 (𝐼𝑜𝑈 (𝑔𝑡𝑖 ) − 0.5) (9)

where 𝑜 is IoU, 𝑟𝑒𝑐𝑎𝑙𝑙 (𝑜) is its corresponding recall, 𝐼𝑜𝑈 (𝑔𝑡𝑖 ) is IoUbetween ground truth and detection result.

PASCAL VOC Challange 2010 measures the performance of anobject detection method based on [email protected], which means AP for𝐼𝑜𝑈 ≥ 0.5, for each object category. It also measures [email protected],

115

Page 4: Indonesian Food Identification and Detection in the Smart

SIET ’20, November 16–17, 2020, Malang, Indonesia Adinugroho, et al.

which is simply the average of [email protected] for all categories. UnlikePASCAL VOC, the COCO dataset [12] measures several mAP values:

• 𝑚𝐴𝑃 𝐼𝑜𝑈=.50:.05:.95 is averaged mAP for 10 IoU threshold(from 0.5 to 0.95 with 0.05 increment).

• 𝑚𝐴𝑃 𝐼𝑜𝑈=0.5 is averaged mAP for 0.5 threshold. This metricis equal to Pascal’s [email protected].

• 𝑚𝐴𝑃 𝐼𝑜𝑈=0.75 is averaged mAP for 0.75 threshold.Apart from mAP, COCO challenge also measures average recall(AR) for IoU ranging from 0.5 to 0.95 (𝑚𝐴𝑅𝐼𝑜𝑈=.50:.05:.95).

2.4 Experiment SettingsOur experiment was conducted using Tensorflow Object DetectionAPI 2 on Tensorflow 2.2.0. Since there is a limitation in our hardware,which will be discussed later, we select ResNet50 with an imageinput of 640x640 as the convolutional architecture. The networkwas obtained from Tensorflow 2 Detection Model Zoo that has beenpre-trained using COCO 2017 dataset.

As mentioned before, the model was trained on a consumercomputer with limited capability. The GPU, as the core of trainingprocess, was Nvidia RTX 2060 6GB of RAM.With the GPU, we wereonly able to set batch size to 2, any value higher than that endedup insufficient memory error. We set the training step parameterto 35,000. Other parameters were left to their default values.

3 EXPERIMENTAL RESULTThe performance of the model is evaluated using the test split fromthe dataset at the first stage. This evaluation scheme is intended toevaluate how well the model works in general food images. Table 2shows performance evaluation using some metrics derived fromCOCO and PASCAL VOC Challenge. From Average Precision met-ric, the measure slightly drops as IoU becomes stricter, from 0.5 to0.75. However, the recorded decline is as low as 17.44%. The reduc-tion was a normal phenomenon caused by the stricter requirementfor judging True Positive state from the detection result. HigherIoU means that for a detection result to be marked as True Positive,it requires a wider overlapping area with the corresponding groundtruth. Thus, the number of True Positive are fewer, leading to lowerAverage Precision.

The evaluation result also reveals that there is a significant gapbetween Average Precision and Average Result at equal IoU =0.5:0.95. The Average Precision is 0.67, while the Average Recall isonly a third of it. From a detection perspective, high recall accompa-nied by low precision means that the detection method is excellentat finding objects, but when it comes to determining the type ofobject, it often fails. In other words, the localization performs well,but that is not the case for the classification.

In order to understand the reason for low Average Precision, theAverage Precision for each food category is presented in Table 3since the overall Average Precision is the mean of Average Precisionof all classes. The table uncovers several food categories havingAverage Precision lower than 1%. They are bakwan, tahu goreng,and tempe goreng. Figure 2 depicts examples of ground truth im-ages and detection results from those categories. While the groundtruths contain multiple objects, the detection process was only able

2https://github.com/tensorflow/models/tree/master/research/object_detection

Table 2: Precision and Recall at Various IoU

Metrics Value

AP@IoU=0.50:0.95 0.265AP@IoU=0.5 0.367AP@IoU=0.75 0.303AR@IoU=0.50:0.95 0.670

to locate one object for each image, although there are multiplesimilar objects in the image. The lack of ability to extract generalcharacteristics from training data indicates that the model suffersoverfitting problems. However, the model does not overfit for foodthat is served in the form of a single object, such as mie goreng, nasiputih, rendang, sate, and urap as they have high Average Precision.

Table 3: Average Precision for Each Food Class

No Class AP@IoU=0.5

1 Ayam bakar 0.1806472 Ayam goreng 0.1070653 Bakwan 0.0622254 Kering tempe 0.4033155 Lumpia 0.1312846 Mie goreng 0.6072287 Nasi goreng 0.319588 Nasi kuning 0.2081869 Nasi putih 0.79195110 Perkedel 0.10343611 Rendang 0.86943312 Sate 0.71991313 Tahu goreng 0.08562314 Telur mata sapi 0.3657215 Telur dadar 0.30933416 Tempe goreng 0.06980617 Urap 0.910522

The second stage of performance evaluation involves images offood on tray boxes. Table 4 highlights the Average Precision andAverage Result on that dataset. Compared to the test data, the resultshows huge improvement in Average Precision. The metrics surgeby 71.70%, 36.24%, 87,46% for IoU=0.50:0.95, IoU=0.5, and IoU=0.75respectively. Nevertheless, the Average Recall goes into differentdirection. It slightly declines by mere 6.27%. Overall performancejumps is contributed by the simplicity of the data. Unlike the imagestaken from the Internet, the captured tray box image only containssingle type of food inside a compartment in the tray box. Thiscondition differs so much from images gathered from the Internetwhich contain multiple objects inside a container or a plate.

4 CONCLUSIONThis research implements Faster R-CNN object detection to locateand classify food objects inside a tray box images. Limited to foodin tray boxes, the model achieves Average Precision of 0.455 andAverage Recall of 0.628 for IoU 0.50:0.95. However, for general

116

Page 5: Indonesian Food Identification and Detection in the Smart

Indonesian Food Identification and Detection in the Smart Nutrition Box using Faster-RCNN SIET ’20, November 16–17, 2020, Malang, Indonesia

(a) Ground truth (bakwan) (b) Detection (bakwan)

(c) Ground truth (tahu goreng) (d) Detection (tahu goreng)

(e) Ground truth (tempegoreng) (f) Detection (tempe goreng)

Figure 2: An example of ground truth images and detectionresults for bakwan, tahu goreng, and tempe goreng category

Table 4: Precision and Recall for Tray Box Images

Metrics Value

AP@IoU=0.50:0.95 0.455AP@IoU=0.5 0.638AP@IoU=0.75 0.568AR@IoU=0.50:0.95 0.628

purpose food object detection, more work should be conductedto improve the precision by adding more training data. Based onthe result, the difficult objects to recognize are those with multipleobject of a type of food in an image. Thus, more bounding boxesshould be extracted on those images to enrich the training data.

ACKNOWLEDGMENTSAuthors would like to thank Faculty of Computer Science, Brawi-jaya University for providing support and funding to this research.

REFERENCES[1] 2016. Food Waste Infographic. https://foodsustainability.eiu.com/food-loss-and-

waste/[2] Marc Bolaños and Petia Radeva. 2016. Simultaneous food localization and recog-

nition. In 2016 23rd International Conference on Pattern Recognition (ICPR). 3140–3145. https://doi.org/10.1109/ICPR.2016.7900117 ISSN: null.

[3] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 – MiningDiscriminative Components with Random Forests. In Computer Vision – ECCV2014 (Lecture Notes in Computer Science), David Fleet, Tomas Pajdla, Bernt Schiele,and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 446–461.https://doi.org/10.1007/978-3-319-10599-4_29

[4] POWERS D.M.W. 2011. EVALUATION: FROM PRECISION, RECALL AND F-MEASURE TO ROC, INFORMEDNESS, MARKEDNESS & CORRELATION. Jour-nal of Machine Learning Technologies 2, 1 (2011), 37,63. http://bioinfopublication.org/viewhtml.php?artid=BIA0001114

[5] Mark Everingham and John Winn. 2011. The pascal visual object classes chal-lenge 2012 (voc2012) development kit. Pattern Analysis, Statistical Modelling andComputational Learning, Tech. Rep 8 (2011).

[6] G FAO. 2011. Global food losses and food waste–Extent, causes and prevention.SAVE FOOD: an Initiative on Food Loss and Waste Reduction (2011).

[7] Spyros Gidaris and Nikos Komodakis. 2015. Object Detection via a Multi-Regionand Semantic Segmentation-Aware CNN Model. In Proceedings of the IEEE Inter-national Conference on Computer Vision (ICCV).

[8] Ross Girshick. 2015. Fast R-CNN. (April 2015). https://arxiv.org/abs/1504.08083v2[9] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich

feature hierarchies for accurate object detection and semantic segmentation.arXiv:1311.2524 [cs] (Oct. 2014). http://arxiv.org/abs/1311.2524 arXiv: 1311.2524.

[10] Jan Hosang, Rodrigo Benenson, Piotr Dollár, and Bernt Schiele. 2015. Whatmakes for effective detection proposals? IEEE transactions on pattern analysisand machine intelligence 38, 4 (2015), 814–830.

[11] Yoshiyuki Kawano and Keiji Yanai. 2015. Automatic Expansion of a Food ImageDataset Leveraging Existing Categories with Domain Adaptation. In ComputerVision - ECCV 2014 Workshops (Lecture Notes in Computer Science), LourdesAgapito, Michael M. Bronstein, and Carsten Rother (Eds.). Springer InternationalPublishing, Cham, 3–17. https://doi.org/10.1007/978-3-319-16199-0_1

[12] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Commonobjects in context. In European conference on computer vision. Springer, 740–755.

[13] Piergiuseppe Morone, Apostolis Koutinas, Nicholas Gathergood, Mehrdad Ar-shadi, and Avtar Matharu. 2019. Food waste: Challenges and opportunities forenhancing the emerging bio-economy. Journal of Cleaner Production 221 (June2019), 10–16. https://doi.org/10.1016/j.jclepro.2019.02.258

[14] Maxim Parent, Helen Niezgoda, Heather H. Keller, Larry W. Chambers, andShauna Daly. 2012. Comparison of Visual Estimation Methods for Regular andModified Textures: Real-Time vs Digital Imaging. Journal of the Academy ofNutrition and Dietetics 112, 10 (Oct. 2012), 1636–1641. https://doi.org/10.1016/j.jand.2012.06.367

[15] Julian Parfitt, Mark Barthel, and Sarah Macnaughton. 2010. Food waste withinfood supply chains: quantification and potential for change to 2050. PhilosophicalTransactions of the Royal Society B: Biological Sciences 365, 1554 (Sept. 2010),3065–3081. https://doi.org/10.1098/rstb.2010.0126

[16] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.arXiv:1506.01497 [cs] (Jan. 2016). http://arxiv.org/abs/1506.01497 arXiv:1506.01497.

[17] Yuita Arum Sari, Ratih Kartika Dewi, Jaya Mahar Maligan, Anindya SasriAnanta, and Sigit Adinugroho. 2019. Automatic Food Leftover Estimationin Tray Box Using Image Segmentation. In 2019 International Conference onSustainable Information Engineering and Technology (SIET). 212–216. https://doi.org/10.1109/SIET48054.2019.8986104

[18] Weiyu Zhang, Qian Yu, Behjat Siddiquie, Ajay Divakaran, and Harpreet Sawhney.2015. “Snap-n-Eat”: Food Recognition and Nutrition Estimation on a Smartphone.Journal of Diabetes Science and Technology 9, 3 (May 2015), 525–533. https://doi.org/10.1177/1932296815582222

117