In-Home Daily-Life Captioning Using Radio Signals

Fan, Lijie; Li, Tianhong; Yuan, Yuan; Katabi, Dina

doi:10.1007/978-3-030-58536-5_7

Lijie Fan¹²,
Tianhong Li¹²,
Yuan Yuan¹² &
…
Dina Katabi¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12347))

Included in the following conference series:

European Conference on Computer Vision

6631 Accesses
38 Citations

Abstract

This paper aims to caption daily life – i.e., to create a textual description of people’s activities and interactions with objects in their homes. Addressing this problem requires novel methods beyond traditional video captioning, as most people would have privacy concerns about deploying cameras throughout their homes. We introduce RF-Diary, a new model for captioning daily life by analyzing the privacy-preserving radio signal in the home with the home’s floormap. RF-Diary can further observe and caption people’s life through walls and occlusions and in dark settings. In designing RF-Diary, we exploit the ability of radio signals to capture people’s 3D dynamics, and use the floormap to help the model learn people’s interactions with objects. We also use a multi-modal feature alignment training scheme that leverages existing video-based captioning datasets to improve the performance of our radio-based captioning model. Extensive experimental results demonstrate that RF-Diary generates accurate captions under visible conditions. It also sustains its good performance in dark or occluded settings, where video-based captioning approaches fail to generate meaningful captions.(For more information, please visit our project webpage: http://rf-diary.csail.mit.edu)

L. Fan, T. Li—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Indoor Scenes Video Captioning

Lifelogging caption generation via fourth-person vision in a human–robot symbiotic environment

Article Open access 24 September 2020

An End-to-End Deep Learning Approach for Video Captioning Through Mobile Devices

References

Adib, F., Kabelac, Z., Katabi, D., Miller, R.C.: 3D tracking via body radio reflections. In: 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2014, pp. 317–329 (2014)
Google Scholar
Adib, F., Katabi, D.: See through walls with WiFi!, vol. 43. ACM (2013)
Google Scholar
Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1657–1666 (2017)
Google Scholar
Barbrow, L.: International lighting vocabulary. J. SMPTE 73(4), 331–332 (1964)
Article Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chetty, K., Chen, Q., Ritchie, M., Woodbridge, K.: A low-cost through-the-wall FMCW radar for stand-off operation and activity detection. In: Radar Sensor Technology XXI, vol. 10188, p. 1018808. International Society for Optics and Photonics (2017)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Google Scholar
Du, Y., Fu, Y., Wang, L.: Skeleton based action recognition with convolutional neural network. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 579–583. IEEE (2015)
Google Scholar
Fakoor, R., Mohamed, A., Mitchell, M., Kang, S.B., Kohli, P.: Memory-augmented attention modelling for videos. arXiv preprint arXiv:1611.02261 (2016)
Fan, L., Li, T., Fang, R., Hristov, R., Yuan, Y., Katabi, D.: Learning longterm representations for person re-identification using radio signals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10699–10709 (2020)
Google Scholar
Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: StyleNet: generating attractive visual captions with styles. In: CVPR, pp. 3137–3146 (2017)
Google Scholar
Gan, Z., et al.: Semantic compositional networks for visual captioning. In: CVPR, pp. 5630–5639 (2017)
Google Scholar
Hsu, C.Y., Ahuja, A., Yue, S., Hristov, R., Kabelac, Z., Katabi, D.: Zero-effort in-home sleep and insomnia monitoring using radio signals. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 3, pp. 1–18 (2017)
Google Scholar
Hsu, C.Y., Hristov, R., Lee, G.H., Zhao, M., Katabi, D.: Enabling identification and behavioral sensing in homes using radio reflections. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, p. 548. ACM (2019)
Google Scholar
Hsu, C.Y., Liu, Y., Kabelac, Z., Hristov, R., Katabi, D., Liu, C.: Extracting gait velocity and stride length from surrounding radio signals. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2116–2126 (2017)
Google Scholar
Hu, Y., Chen, Z., Zha, Z.J., Wu, F.: Hierarchical global-local temporal modeling for video captioning. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 774–783 (2019)
Google Scholar
Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 786–792 (2018)
Google Scholar
Li, T., Fan, L., Zhao, M., Liu, Y., Katabi, D.: Making the invisible visible: action recognition through walls and occlusions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 872–881 (2019)
Google Scholar
Lien, J.: Soli: ubiquitous gesture sensing with millimeter wave radar. ACM Trans. Graph. (TOG) 35(4), 142 (2016)
Article Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Google Scholar
Long, X., Gan, C., de Melo, G.: Video captioning with multi-faceted attention. Trans. Assoc. Comput. Linguist. 6, 173–184 (2018)
Article Google Scholar
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Pasunuru, R., Bansal, M.: Reinforced video captioning with entailment rewards. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 979–985 (2017)
Google Scholar
Peng, Z., Muñoz-Ferreras, J.M., Gómez-García, R., Li, C.: FMCW radar fall detection based on ISAR processing utilizing the properties of RCS, range, and Doppler. In: 2016 IEEE MTT-S International Microwave Symposium (IMS), pp. 1–3. IEEE (2016)
Google Scholar
Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015)
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_31
Chapter Google Scholar
Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017)
Stove, A.G.: Linear FMCW radar techniques. In: IEE Proceedings F (Radar and Signal Processing), vol. 139, pp. 343–350. IET (1992)
Google Scholar
Tian, Y., Lee, G.H., He, H., Hsu, C.Y., Katabi, D.: RF-based fall monitoring using convolutional neural networks. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 2, no. 3, p. 137 (2018)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1494–1504 (2015)
Google Scholar
Wang, X., Chen, W., Wu, J., Wang, Y.F., Yang Wang, W.: Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4213–4222 (2018)
Google Scholar
Wu, X., Li, G., Cao, Q., Ji, Q., Lin, L.: Interpretable video captioning via trajectory structured localization. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6829–6837 (2018)
Google Scholar
Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Google Scholar
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593 (2016)
Google Scholar
Zhang, Z., Tian, Z., Zhou, M.: Latern: dynamic continuous hand gesture recognition using FMCW radar sensor. IEEE Sens. J. 18(8), 3278–3289 (2018)
Article Google Scholar
Zhao, M., Adib, F., Katabi, D.: Emotion recognition using wireless signals. In: Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, pp. 95–108. ACM (2016)
Google Scholar
Zhao, M., et al.: Through-wall human pose estimation using radio signals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7356–7365 (2018)
Google Scholar
Zhao, M., et al.: Through-wall human mesh recovery using radio signals. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10113–10122 (2019)
Google Scholar
Zhao, M., et al.: RF-based 3D skeletons. In: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pp. 267–281. ACM (2018)
Google Scholar
Zhao, M., Yue, S., Katabi, D., Jaakkola, T.S., Bianchi, M.T.: Learning sleep stages from radio signals: a conditional adversarial architecture. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 4100–4109. JMLR. org (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

MIT CSAIL, Cambridge, USA
Lijie Fan, Tianhong Li, Yuan Yuan & Dina Katabi

Authors

Lijie Fan
View author publications
You can also search for this author in PubMed Google Scholar
Tianhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Dina Katabi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianhong Li .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5860 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, L., Li, T., Yuan, Y., Katabi, D. (2020). In-Home Daily-Life Captioning Using Radio Signals. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12347. Springer, Cham. https://doi.org/10.1007/978-3-030-58536-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-58536-5_7
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58535-8
Online ISBN: 978-3-030-58536-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

In-Home Daily-Life Captioning Using Radio Signals

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Indoor Scenes Video Captioning

Lifelogging caption generation via fourth-person vision in a human–robot symbiotic environment

An End-to-End Deep Learning Approach for Video Captioning Through Mobile Devices

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 5860 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

In-Home Daily-Life Captioning Using Radio Signals

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Indoor Scenes Video Captioning

Lifelogging caption generation via fourth-person vision in a human–robot symbiotic environment

An End-to-End Deep Learning Approach for Video Captioning Through Mobile Devices

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 5860 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation