Improving Video Semantic Segmentation using Deep Neural Networks and Optical Flow

Document Type : Persian Original Article

Authors

1 MSc Student, Faculty of Electrical and Computer Engineering, Malek-e-Ashtar University of Technology,

2 Computer and Artificial Intelligence Center, Faculty of Computer and Electronics, Malek Ashtar University of Technology, Tehran, Iran

Abstract

Nowadays, video semantic segmentation is used in many applications such as automatic driving, navigation systems, virtual reality systems, etc. In recent years, significant progress has been observed in semantic segmentation of images. Since the consecutive frames of a video must be processed with high speed, low latency, and in real time, using semantic image segmentation methods on individual video frames is not efficient. Therefore, semantic segmentation of video frames in real time and with appropriate accuracy is a challenging topic. In order to encounter the mentioned challenge, a video semantic segmentation framework has been introduced. In this method, the previous frames semantic segmentation has been used to increase speed and accuracy. For this manner we use the optical flow (change of continuous frames) and a GRU deep neural network called ConvGRU. One of the GRU input is estimation of current frames semantic segmentation (resulting from a pre-trained convolutional neural network), and the other one is warping of previous frames semantic segmentation along the optical flow. The proposed method has competitive results on accuracy and speed. This method achieves good performances on two challenging video semantic segmentation datasets, particularly 83.1% mIoU on Cityscapes and 79.8% mIoU on CamVid dataset. Meanwhile, in the proposed method, the semantic segmentation speed using a Tesla P4 GPU on the Cityscapes and Camvid datasets has reached 34 and 36.3 fps, respectively.

Keywords


[1] F.J.Chang, Y.Y.Lin, and K.-J. Hsu, “Multiple structured-instance learning for semantic segmentation with uncertain training data”, Proceedings of the IEEE Computer Vision and Pattern Recognition, pp. 360-367, 2014.
[2] X. Zhu, Y, Xiong, J, Dai, L, Yuan, and Y. Wei,“Deep feature flow for video recognition”, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358, 2017.
[3] D. Lin Y. Li J. Shi, “Low-Latency Video Semantic Segmentation”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[4] P.Hu, F.Caba, O.Wang, Z.Lin, S.Sclaroff and F.Perazzi, “Temporally distributed networks for fast video semantic segmentation”, CVPR, pp. 8818–8827, 2020.
[5] M.FanSh.LaiJ.HuangX.WeiZ.ChaiJ.Luo and X.Wei, “Rethinking BiSeNet For Real-time Semantic Segmentation”, CVPR, 2021.
[6] H.Wang, W.Wang and J.Liu, “TEMPORAL MEMORY ATTENTION FOR VIDEO SEMANTIC SEGMENTATION”, CVPR, 2021.
 [7] A.TaoK.Sapra and B.Catanzaro, “Hierarchical Multi-Scale Attention for Semantic Segmentation”, CVPR, 2020.
[10] M.Khalooei, M.Fakhredanesh, M.Sabokrou, “Dominant and rare events detection and localization in video using Generative Adversarial Network”,Journal of Soft Computing and Information Technology (JSCIT), Volume 8, Number 3, pp. 40-51, 2019.
[11] M.Fakhredanesh, S.Roostaei, “Action Change Detection in Video Based on HOG”, Journal of Electrical and Computer Engineering Innovations (JECEI), pp. 135-144, 2020.
[12] M. Fayyaz, M. H. Saffar, M. Sabokrou, M. Fathy and R. Klette, “STFCN: spatio-temporal FCN for semantic video segmentation”, CoRR,2016.
[13] P. Fischer, A. Dosovitskiy, E. Ilg, P. Hausser, C. Hazırbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox,“Flownet: Learning optical flow with convolutional networks”, IEEE International Conference on Computer Vision (ICCV), 2015.
[14] E. L. Denton, S. Chintala, R. Fergus, et al., “Deep generative image models using a laplacian pyramid of adversarial networks”, in Proc. Neural Information Processing Systems(NIPS), pp 1486-1494, 2017.
[15] F.Galasso, M.Keuper, T.Brox and B. Schiele, "Spectral graph reduction for efficient image and streaming video segmentation", IEEE Conference on Computer Vision and Pattern Recognition, pp. 49-56, 2014.
[16] A.Khoreva, F.Galasso, M.Hein and B.Schiele, "Classifier based graph construction for video segmentation", Computer Vision and Pattern Recognition (CVPR) 2015 IEEE Conference, pp. 951-960, 2015.
[17] S. Hickson, S. Birchfield, I. Essa, and H. Christensen, "Efficient hierarchical graph-based segmentation of RGBD videos", IEEE Conference on Computer Vision and Pattern Recognition, pp. 344-351, 2014.
[18] S.Ardeshir, K.Malcolm and M.Shah, "Geo-semantic segmentation", IEEE Conference on Computer Vision and Pattern Recognition, pp. 2792-2799, 2015.
[19] G.Bertasius, L.Torresani, S.X.Yu and J.Shi, "Convolutional Random Walk Networks for Semantic Image Segmentation" , arXiv:1605.07681, 2016.
[20] M.P.Kumar, H.Turki, D.Preston and D.Koller, "Parameter estimation and energy minimization for region-based semantic segmentation", IEEE transactions on pattern analysis and machine intelligence, vol. 37, pp. 1373-1386, 2015.
[21] M.Volpi and V.Ferrari, "Semantic segmentation of urban scenes by learning local class interactions", IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1-9, 2015.
[22] A.Sharma, O.Tuzel and D.W.Jacobs, "Deep hierarchical parsing for semantic segmentation", IEEE Conference on Computer Vision and Pattern Recognition, pp. 530- 538, 2015.
[23] Z.Liu, X. Li, P. Luo, C.-C. Loy and X. Tang, "Semantic image segmentation via deep parsing network", IEEE International Conference on Computer Vision, pp. 1377- 1385, 2015.
[24] B. Liu, X. He, and S. Gould, "Multi-class semantic video segmentation with exemplar-based object reasoning", IEEE Winter Conference on Applications of Computer Vision, pp. 1014- 1021, 2015.
[25] L. Sevilla-Lara, D. Sun, V. Jampani, and M. J. Black, "Optical flow with semantic segmentation and localized layers", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[26] G. Csurka and F. Perronnin, "An efficient approach to semantic segmentation", International Journal of Computer Vision, vol. 95, pp. 198-212, 2011.
[27] C.-F. Tsai, K. McGarry, and J. Tait, "Image classification using hybrid neural networks", 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 431-432, 2003.
[28] T. Blaschke, C. Burnett, and A. Pekkarinen, "Image segmentation methods for object-based analysis and classification", Remote sensing image analysis: Including the spatial domain, ed: Springer, pp. 211-236, 2004.
[29] S.Hochreiter and J.Schmidhuber, “Long short-term memory”, Neural computation, pp. 1735–1780, 1997.
[30] K.Cho, B.Merrienboer, C.Gulc¸ F.Bougares, H.Schwenk and Y.Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation”, EMNLP, 2014.
[31] J.Long, E.Shelhamer, and T.Darrell, “Fully convolutional networks for semantic segmentation”, CVPR, pp. 3431– 3440, 2015.
[32] S.Zheng , “Conditional random fields as recurrent neural networks”, IEEE Int. Conf. Computer Vision, pp. 1529-1537, 2015.
[33] V.Badrinarayanan, A.Kendall and R.Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation”, CoRR, 2015.
[34] H. Zhao, J. Shi, X. Qi, X. Wang and J. Jia “Pyramid scene parsing network”, CVPR, 2017.
[35] A.Kundu, V.Vineet and V.Koltun, “Feature space optimization for semantic video segmentation”, CVPR, 2016.
[36] B.Mahasseni, S.Todorovic, A.Fern, “Budget-Aware Deep Semantic Video Segmentation”, IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[37] X.Jin, X.Li, H.Xiao, X.Shen, Z.Lin, J.Yang, Y.Chen, J.Dong, L.Liu and Z.Jie, “Video scene parsing with predictive feature learning”, ICCV, 2017.
[38] S.Jain, X.Wang and J.Gonzalez, “Accel: A corrective fusion network for efficient semantic segmentation on video”, CVPR, 2019.
[39] E. Shelhamer, K. Rakelly, J. Hoffman, and T,“Darrell. Clockwork convnets for video semantic segmentation”, European Conference on Computer Vision (ECCV) Workshops, pp. 852-868 , 2016.
[40] J.Carreira, V.Patraucean, L.Mazare, A.Zisserman and S.Osindero, “Massively parallel video networks”, ECCV, 2018.
[41] Y.He, W.Chiu, M.Keuper and Mario Fritz, “Std2p: Rgbd semantic segmentation using spatio-temporal data-driven pooling”, CVPR, 2017.
[42] G.Hinton, O.Vinyals and J.Dean, “Distilling the knowledge in a neural network”, arXiv:1503.02531, 2015.
[43] G.Huang, Z.Liu, L.V.Maaten and K.Weinberger, “Densely connected convolutional networks”, CVPR, 2017.
[44] S.Chandra, C.Couprie and I.Kokkinos, “Deep Spatio-Temporal Random Fields for Efficient Video Segmentation”, IEEE Conference of Computer Vision and Pattern Recognition, pp. 8915–8924, 2018.
[45] A.Handa, V.Patraucean and R.Cipolla, “Spatio-temporal video autoencoder with differentiable memory”, ICLR Workshop, 2016.
[46] N. Ballas, L. Yao, C. Pal, and A.Courville, “Delving deeper into convolutional networks for learning video representations”, 2016.
[47] R. Gadde, V. Jampani, and P. V. Gehler,“Semantic video cnns through representation warping”,IEEE International Conference on Computer Vision (ICCV), 2017.
[48]E.IlgN.MayerT.SaikiaM.KeuperA.Dosovitskiy and T.Brox, “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks”, CVPR, 2016.
[49] https://www.cityscapes-dataset.com, Accessed: Feb. 21, 2019.
[51] Yu and F.Koltun, “Multi-scale context aggregation by dilated convolutions”, ICLR, 2016.
[52] T.W.HuiX.Tang and C.Ch.Loy, “LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[53] X.Li, A.You, Z.Zhu, H.Zhao, M.Yang, K.Yang, Sh.Tan andY.Tong, ‘Semantic Flow for Fast and Accurate Scene Parsing”, ECCV 2020, pp. 775-793, 2020.
[54] Y.NirkinL.Wolf and T.Hassner, “HyperSeg: Patch-wise Hypernetwork for Real-time
Semantic Segmentation”, CVPR, 2021.
[55] D.Nilsson and C.Sminchisescu, “Semantic Video Segmentation by Gated Recurrent Flow Propagation”, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[56] Ch.Yu, J.Wang, Ch.Peng and Ch.Gao, “BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation”, ECCV 2018, pp. 334-349, 2018.
[57] M.D.Yang, J.Boubin, H.P.Tsai and H.Tseng, “Adaptive autonomous UAV scouting for rice lodging assessment using edge computing with deep learning EDANet”, Computers and Electronics in Agriculture, 2020.
[58] Y.ZhuK.SapraF.RedaK.ShihSh.NewsamA.Tao and Bryan Catanzaro, “Improving Semantic Segmentation via Video Propagation and Label Relaxation”, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[59] Y.LiuCh.ShenCh.Yu and J.Wang, “Efficient Semantic Video Segmentation with Per-Frame Inference”, ECCV, pp.352-368, 2020.
[60] Y.HongH.PanW.Sun and Y.Jia, “Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes”, CVPR, 2021.
[61] Ch.Yu, Ch.Gao, J.Wang, G.Yu, Ch.Shen and N.Sang, “BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation”, International Journal of Computer Vision volume 129, p. 3051–3068, 2021.
 [62] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs”, ICLR, 2015.