Generating Video Description using RNN with Semantic Attention · 2017-02-22 · 0 1 #. Û '¨ G º ± $Î/²1= e7 º v Generating Video Description using RNN with Semantic Attention

言語処理学会第23回年次大会発表論文集 (2017年3月)

Generating Video Description using RNN with Semantic Attention

Natsuda Laokulrat1, Naoaki Okazaki2,1 and Hideki Nakayama3,1

1AIRC, AIST2Tohoku University

3The University of Tokyo

Abstract

Being able to understand videos can havea great impact and can be useful tomany other applications. However, gen-erated descriptions by computers oftenfail to mention correct objects appear-ing in videos. This work aims to alle-viate this problem by including externalfine-grained visual information detectedfrom all video frames. In this paper,we propose an LSTM-based sequence-to-sequence model with semantic attentionfor video description generation. The re-sults show that using semantic attention toselectively focus on external fine-grainedvisual information can guide the systemto correctly mention objects in videos andhave a better quality of video descriptions.

1 Introduction

Automatic video description generation has beentackled by the combination of RNN and CNN.Venugopalan et al. (2015b) proposed the first end-to-end system to translate a video into naturallanguage by extending the CNN-RNN encoder-decoder framework for image captioning proposedby Vinyals et al. (2014) to generate descriptionsfor videos. They performed a mean pooling overCNN feature vectors of frames to generate a singlevector representation for a video, and then use thevector as input to the RNN decoder to generate asentence.

Later, they have proposed an RNN-basedsequence-to-sequence model for generating de-scriptions of videos (Venugopalan et al., 2015a).They used 2 layers of RNN for both encoding thevideos and decoding into sentences, so their modelis able to learn both a temporal structure of a se-quence of video frames and a sequence model forgenerating sentences.

However, one problem of video descriptiongeneration is that generated descriptions by com-puters often fail to mention correct objects andactions appearing in videos. Inspired by the im-age captioning model with semantic attention pro-posed by You et al. (2016), in this paper, wepresent a sequence-to-sequence encoder-decodermodel with semantic attention mechanism, whichis a novel approach to integrate fine-grained visualinformation appearing in video frames to help themodel generate descriptions. The results show thatthe semantic attention mechanism can guide thesystem to correctly mention objects and actions,and a have better quality of video descriptions.

2 LSTM encoder-decoder model

Given an input xt, at time step t, one unit of anLSTM can be formulated as

it = sigmoid(Wxixt +Whiht−1 + bi)

ft = sigmoid(Wxfxt +Whfht−1 + bf )

ot = sigmoid(Wxoxt +Whoht−1 + bo)

gt = tanh(Wxgxt +Whght−1 + bg)

ct = ft � ct−1 + it � gt

ht = ot � tanh(ct)

(1)

where it, ft and ot are input gates, forget gates,and output gates. The symbol � represents theelement-wise multiplication. Wxi, Whi, Wxf ,Whf , Wxo, Who, Wxg, Whg and bi, bf , bo, bg arethe parameters to be learned during training. htis the hidden state at time step t which will be aninput to the next time step’s LSTM unit.

2.1 Non-attention model

Figure 1 depicts our two-layer LSTM model forgenerating description sentence from a video.Given a video as a sequence of frames V ={v1, v2, ..., vn} where the video V has n framesand vt is the tth frame of the video. The input

Copyright(C) 2017 The Association for Natural Language Processing. All Rights Reserved.　　　　　　　　　　　　　　　　　　　　― 330 ―

<BOS> a woman is cooking in the kitchen <EOS>context vector

visual concept detector

cook, bowl, kitchen, woman, oven

= concatenation

Figure 1: System architecture of our model. In the figure, we omit the image embedding layer, the wordembedding layer, and the softmax layer, due to the space constraint.

frames xt can be described as

xt =

{vt , t ≤ n~0 , t > n

(2)

The video frames are taken as input one by one atencoding time, and are set to ~0 at decoding time.Then, we can formulate our the first (upper) LSTMlayer as

h(1)t = LSTM (1)(xt, h

(1)t−1) (3)

where h(1)t is the hidden state of the first LSTM

layer, defined as LSTM (1), at time step t.

The input to the second (lower) LSTM layer isthe concatenation of the word generated on theprevious time step wt−1 and the hidden state ofthe first LSTM layer.

h(2)t = LSTM (2)([wt−1;h

(1)t ], h

(2)t−1) (4)

where h(2)t is the hidden state of the second LSTMlayer, defined as LSTM (2), at time step t.

At encoding time, wt−1 is set to ~0 since there isno word being generated. The distribution over allthe words at time step t can be computed by

p(wt|w1, ..., wt−1, V ) = softmax(Wsh(2)t + bs)

(5)

As with Venugopalan et al. (2015a), we havext = ~0 at encoding time and wt−1 = ~0 at decod-ing time in order to use a single LSTM (for onelayer) that learns both encoding and decoding, sothe weights can be shared.

2.2 Semantic attention modelGiven a set of visual concepts of the video S ={s1, s2, ..., sk} where si can be represented by aword vector in the same space as word input. Thesecond-layer LSTM at decoding time can be for-mulated as

h(2)t = LSTM (2)([wt−1; ct;h

(1)t ], h

(2)t−1) (6)

where the context vector ct, at the time step t inthe decoding stage, is the weighted sum of visualconcepts.

ct =k∑

i=1

at(i)si (7)

The weight at(i) is computed at every time step tby

at(i) =escore(h

(2)t−1,si)∑k

j=1 escore(h

(2)t−1,sj)

(8)

where score(h(2)t−1, si) is the score function used tocalculate alignment weights between every visualconcept si and the hidden state h

(2)t−1.

score(h(2)t−1, si) = v>a tanh(Wa[h

(2)t−1; si]) (9)

The parameters Wa and va of the score functionare jointly learned during training.

3 Experiment

3.1 Dataset and pre-processingWe use Microsoft Research Video Descrip-tion Corpus (MSVD) (Chen and Dolan, 2011)which is a set of 1,970 Youtube clips with


Model BLEU METEOR CIDEr ROUGE-LResults reported by Venugopalan et al. (2015a)Mean pooling (VGG16) - 0.277 - -Sequence to sequence (VGG16) - 0.292 - -Sequence to sequence (VGG16) + Flow (AlexNet) - 0.298 - -Our system (VGG16) - non-attention 0.393 0.308 0.580 0.666Our system (VGG16) - semantic attention 0.386 0.313 0.589 0.665

Table 1: Scores of video description generation results on the MSVD dataset.

≈40 captions/clip. We split the dataset intotrain/validation/test sets following Venugopalan etal. (2015b).

We downsample the video clips by selecting ev-ery 8th frame and resize them to 224x224. Then,we extract features for each frame using a pre-trained image classification model provided inCaffe Model Zoo (Jia et al., 2014). In this work, weuse the 4096-dimensional fc7 layer of the VGG16model (Simonyan and Zisserman, 2014) as framefeatures and embed them into 512-dimensionalembeddings.

For text input, we represent words withGloVe pre-trained word embeddings, proposedby Pennington et al. (2014). We map the300-dimensional GloVe word vectors into 1000-dimensional vectors. The visual concepts aretreated in the same way as text input.

3.2 Experiment setting

In order to enable batch training, we constrain thenumber of encoding and decoding time steps tobe 60 and 20, respectively. We use the Adam op-timizer with the learning rate of 0.0001 and themini-batch size of 200. The LSTM hidden layersize is set to 1,000. To avoid overfitting, we ap-ply the dropout strategy with the ratio of 0.3 at theframe input layer. All the parameters are jointlylearned at training time.

3.3 Visual concept detection

We use the pre-trained model provided by Fanget al. (2015) to detect visual concepts from ev-ery frame of the downsampled videos. The vi-sual concepts includes actions, objects, attributesof objects, and also locations.

The detected visual concepts of all frames of avideo are combined into one collection. For onevideo, we select 20 concepts from the collectionand treat them equally, ignoring their scores pro-vided by the concept detector, as shown in theboxes in Figure 2.

3.4 Evaluation

We performed a quantitative analysis of re-sults based on four evaluation metrics, includingBLEU, METEOR, CIDEr, and ROUGE-L.

We implemented our system using Chainer(Tokui et al., 2015) and used the caption evalu-ation package provided by the Microsoft COCOImage Captioning Challenge (Chen et al., 2015).

3.5 Experimental results

Table 1 shows the experimental results of ourproposed system. METEOR and CIDEr scoresslightly increased while BLEU and ROUGE-Lscores dropped when using semantic attention.

Even though the semantic attention mechanismcannot clearly improve the scores of the test set,we can see some promising results in Figure 2.The relevant visual concepts were focused and thealignment weights changed properly when eachword of the sentences were being generated.

In the bottom-right example, though themodel mistakenly focused on the visual concept‘woman’, the mis-mentioned object (bicycle) innon-attention model can be correctly identified(bike) by the model with semantic attention.

We can see that many irrelevant visual conceptswere detected. This is because we used the vi-sual concept detector that was trained on otherdatasets, so it could not perform well in MSVDvideo frames. We believe that if we re-train theconcept detector with our dataset, we can achievebetter results.

4 Conclusion

In this paper, we have proposed an LSTM-basedsequence-to-sequence model with semantic atten-tion for video description generation. The scoresdo not have obvious improvement; however, themodel is able to learn to focus on external fine-grained information of videos and a have betterquality of video descriptions.


0

0.02

0.04

0.06

0.08

0.1

a woman is riding a boat

water sitting woman boat

0

0.02

0.04

0.06

0.08

0.1

a girl is doing makeup

woman brushing hair mirror

Non-attention: a man is riding a car Attention: a woman is riding a boat

Non-attention: a man is drinking Attention: a girl is doing makeup

Visual concepts scissors, man, bathroom, woman, brushing, holding, person, teeth, toothbrush, hair, cat, her, his, up, mirror, baseball, close, mouth, face

Visual concepts boat, water, sitting, cat, plate, pizza, table, man, surfboard, snow, wave, woman, girl, umbrella, baseball, bathroom, people, laptop, cake

0

0.02

0.04

0.06

0.08

0.1

a man is riding a bike

man riding bike woman

0

0.02

0.04

0.06

0.08

0.1

a boy is playing with a dog

dog boy

Non-attention: a dog is playing with a dog Attention: a boy is playing with a dog

Non-attention: a man is riding a bicycle Attention: a man is riding a bike

Visual concepts man, bathroom, cat, dog, boy, toilet, baby, standing, kitchen, young, his, little, riding, brown, child, elephant, bed, sitting, holding, plate

Visual concepts motorcycle, flowers, man, riding, field, green, horse, people, bike, motorcycles, baseball, ball, player, woman, holding, bat, playing, person, down

Figure 2: Example of generated descriptions and alignment weights of visual concepts when each wordof the sentences was generated. The values are clipped at 0.1 for easier reading.

Acknowledgement

This paper is based on results obtained from aproject commissioned by the New Energy andIndustrial Technology Development Organization(NEDO).

ReferencesDavid L. Chen and William B. Dolan. 2011. Collect-

ing highly parallel data for paraphrase evaluation. InACL 2011.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollar, andC. Lawrence Zitnick. 2015. Microsoft COCOcaptions: Data collection and evaluation server.arXiv:1504.00325.

Hao Fang, Saurabh Gupta, Forrest N. Iandola, Ru-pesh Kumar Srivastava, Li Deng, Piotr Dollar, Jian-feng Gao, Xiaodong He, Margaret Mitchell, John C.Platt, C. Lawrence Zitnick, and Geoffrey Zweig.2015. From captions to visual concepts and back.In CVPR 2015.

Yangqing Jia, Evan Shelhamer, Jeff Donahue, SergeyKarayev, Jonathan Long, Ross Girshick, SergioGuadarrama, and Trevor Darrell. 2014. Caffe: Con-volutional architecture for fast feature embedding.arXiv:1408.5093.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP 2014.

Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. arXiv:1409.1556.

Seiya Tokui, Kenta Oono, Shohei Hido, and JustinClayton. 2015. Chainer: a next-generation opensource framework for deep learning. In Proceedingsof Workshop on Machine Learning Systems (Learn-ingSys) in The Twenty-ninth Annual Conference onNeural Information Processing Systems (NIPS).

Subhashini Venugopalan, Marcus Rohrbach, Jeff Don-ahue, Raymond Mooney, Trevor Darrell, and KateSaenko. 2015a. Sequence to sequence – video totext. In ICCV 2015.

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue,Marcus Rohrbach, Raymond Mooney, and KateSaenko. 2015b. Translating videos to natural lan-guage using deep recurrent neural networks. InNAACL-HLT 2015.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2014. Show and tell: A neural im-age caption generator. arXiv:1411.4555.

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang,and Jiebo Luo. 2016. Image captioning with seman-tic attention. In CVPR 2016.


Documents

Generating Video Description using RNN with Semantic Attention · 2017-02-22 · 0 1 #. Û '¨ G º ± $Î/²1= e7 º v Generating Video Description using RNN with Semantic Attention