Abstract
This paper aims to caption daily life – i.e., to create a textual description of people’s activities and interactions with objects in their homes. Addressing this problem requires novel methods beyond traditional video captioning, as most people would have privacy concerns about deploying cameras throughout their homes. We introduce RF-Diary, a new model for captioning daily life by analyzing the privacy-preserving radio signal in the home with the home’s floormap. RF-Diary can further observe and caption people’s life through walls and occlusions and in dark settings. In designing RF-Diary, we exploit the ability of radio signals to capture people’s 3D dynamics, and use the floormap to help the model learn people’s interactions with objects. We also use a multi-modal feature alignment training scheme that leverages existing video-based captioning datasets to improve the performance of our radio-based captioning model. Extensive experimental results demonstrate that RF-Diary generates accurate captions under visible conditions. It also sustains its good performance in dark or occluded settings, where video-based captioning approaches fail to generate meaningful captions.(For more information, please visit our project webpage: http://rf-diary.csail.mit.edu)
L. Fan, T. Li—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adib, F., Kabelac, Z., Katabi, D., Miller, R.C.: 3D tracking via body radio reflections. In: 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2014, pp. 317–329 (2014)
Adib, F., Katabi, D.: See through walls with WiFi!, vol. 43. ACM (2013)
Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1657–1666 (2017)
Barbrow, L.: International lighting vocabulary. J. SMPTE 73(4), 331–332 (1964)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chen, X., et al.: Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chetty, K., Chen, Q., Ritchie, M., Woodbridge, K.: A low-cost through-the-wall FMCW radar for stand-off operation and activity detection. In: Radar Sensor Technology XXI, vol. 10188, p. 1018808. International Society for Optics and Photonics (2017)
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Du, Y., Fu, Y., Wang, L.: Skeleton based action recognition with convolutional neural network. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 579–583. IEEE (2015)
Fakoor, R., Mohamed, A., Mitchell, M., Kang, S.B., Kohli, P.: Memory-augmented attention modelling for videos. arXiv preprint arXiv:1611.02261 (2016)
Fan, L., Li, T., Fang, R., Hristov, R., Yuan, Y., Katabi, D.: Learning longterm representations for person re-identification using radio signals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10699–10709 (2020)
Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: StyleNet: generating attractive visual captions with styles. In: CVPR, pp. 3137–3146 (2017)
Gan, Z., et al.: Semantic compositional networks for visual captioning. In: CVPR, pp. 5630–5639 (2017)
Hsu, C.Y., Ahuja, A., Yue, S., Hristov, R., Kabelac, Z., Katabi, D.: Zero-effort in-home sleep and insomnia monitoring using radio signals. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 3, pp. 1–18 (2017)
Hsu, C.Y., Hristov, R., Lee, G.H., Zhao, M., Katabi, D.: Enabling identification and behavioral sensing in homes using radio reflections. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, p. 548. ACM (2019)
Hsu, C.Y., Liu, Y., Kabelac, Z., Hristov, R., Katabi, D., Liu, C.: Extracting gait velocity and stride length from surrounding radio signals. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2116–2126 (2017)
Hu, Y., Chen, Z., Zha, Z.J., Wu, F.: Hierarchical global-local temporal modeling for video captioning. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 774–783 (2019)
Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 786–792 (2018)
Li, T., Fan, L., Zhao, M., Liu, Y., Katabi, D.: Making the invisible visible: action recognition through walls and occlusions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 872–881 (2019)
Lien, J.: Soli: ubiquitous gesture sensing with millimeter wave radar. ACM Trans. Graph. (TOG) 35(4), 142 (2016)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Long, X., Gan, C., de Melo, G.: Video captioning with multi-faceted attention. Trans. Assoc. Comput. Linguist. 6, 173–184 (2018)
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Pasunuru, R., Bansal, M.: Reinforced video captioning with entailment rewards. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 979–985 (2017)
Peng, Z., Muñoz-Ferreras, J.M., Gómez-García, R., Li, C.: FMCW radar fall detection based on ISAR processing utilizing the properties of RCS, range, and Doppler. In: 2016 IEEE MTT-S International Microwave Symposium (IMS), pp. 1–3. IEEE (2016)
Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015)
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_31
Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017)
Stove, A.G.: Linear FMCW radar techniques. In: IEE Proceedings F (Radar and Signal Processing), vol. 139, pp. 343–350. IET (1992)
Tian, Y., Lee, G.H., He, H., Hsu, C.Y., Katabi, D.: RF-based fall monitoring using convolutional neural networks. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 2, no. 3, p. 137 (2018)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1494–1504 (2015)
Wang, X., Chen, W., Wu, J., Wang, Y.F., Yang Wang, W.: Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4213–4222 (2018)
Wu, X., Li, G., Cao, Q., Ji, Q., Lin, L.: Interpretable video captioning via trajectory structured localization. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6829–6837 (2018)
Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593 (2016)
Zhang, Z., Tian, Z., Zhou, M.: Latern: dynamic continuous hand gesture recognition using FMCW radar sensor. IEEE Sens. J. 18(8), 3278–3289 (2018)
Zhao, M., Adib, F., Katabi, D.: Emotion recognition using wireless signals. In: Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, pp. 95–108. ACM (2016)
Zhao, M., et al.: Through-wall human pose estimation using radio signals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7356–7365 (2018)
Zhao, M., et al.: Through-wall human mesh recovery using radio signals. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10113–10122 (2019)
Zhao, M., et al.: RF-based 3D skeletons. In: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pp. 267–281. ACM (2018)
Zhao, M., Yue, S., Katabi, D., Jaakkola, T.S., Bianchi, M.T.: Learning sleep stages from radio signals: a conditional adversarial architecture. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 4100–4109. JMLR. org (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Fan, L., Li, T., Yuan, Y., Katabi, D. (2020). In-Home Daily-Life Captioning Using Radio Signals. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12347. Springer, Cham. https://doi.org/10.1007/978-3-030-58536-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-58536-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58535-8
Online ISBN: 978-3-030-58536-5
eBook Packages: Computer ScienceComputer Science (R0)