Video Caption Generation (Attention-based)

假設video只有4個frame：

但這樣有時候不好，例如：

第二個時間點跟第四個時間點，他都會focus在第二個frame上，產生出一個字 woman

所以結果就可能變成：a woman and a woman are doing a woman之類的

因為他attention在同一個地方好幾次，導致他都沒看到cooking

一個好的attention應該要cover要input的各個frame，每個frame都attent一下，但又不要某個太多

為了解決這個問題，我們可以對attention設regularization，強迫attention是我們喜歡的樣子