In-Home Daily-Life Captioning Using Radio Signals | SpringerLink
Skip to main content

In-Home Daily-Life Captioning Using Radio Signals

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12347))

Included in the following conference series:

Abstract

This paper aims to caption daily life – i.e., to create a textual description of people’s activities and interactions with objects in their homes. Addressing this problem requires novel methods beyond traditional video captioning, as most people would have privacy concerns about deploying cameras throughout their homes. We introduce RF-Diary, a new model for captioning daily life by analyzing the privacy-preserving radio signal in the home with the home’s floormap. RF-Diary can further observe and caption people’s life through walls and occlusions and in dark settings. In designing RF-Diary, we exploit the ability of radio signals to capture people’s 3D dynamics, and use the floormap to help the model learn people’s interactions with objects. We also use a multi-modal feature alignment training scheme that leverages existing video-based captioning datasets to improve the performance of our radio-based captioning model. Extensive experimental results demonstrate that RF-Diary generates accurate captions under visible conditions. It also sustains its good performance in dark or occluded settings, where video-based captioning approaches fail to generate meaningful captions.(For more information, please visit our project webpage: http://rf-diary.csail.mit.edu)

L. Fan, T. Li—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Adib, F., Kabelac, Z., Katabi, D., Miller, R.C.: 3D tracking via body radio reflections. In: 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2014, pp. 317–329 (2014)

    Google Scholar 

  2. Adib, F., Katabi, D.: See through walls with WiFi!, vol. 43. ACM (2013)

    Google Scholar 

  3. Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1657–1666 (2017)

    Google Scholar 

  4. Barbrow, L.: International lighting vocabulary. J. SMPTE 73(4), 331–332 (1964)

    Article  Google Scholar 

  5. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  6. Chen, X., et al.: Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

  7. Chetty, K., Chen, Q., Ritchie, M., Woodbridge, K.: A low-cost through-the-wall FMCW radar for stand-off operation and activity detection. In: Radar Sensor Technology XXI, vol. 10188, p. 1018808. International Society for Optics and Photonics (2017)

    Google Scholar 

  8. Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation, pp. 376–380 (2014)

    Google Scholar 

  9. Du, Y., Fu, Y., Wang, L.: Skeleton based action recognition with convolutional neural network. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 579–583. IEEE (2015)

    Google Scholar 

  10. Fakoor, R., Mohamed, A., Mitchell, M., Kang, S.B., Kohli, P.: Memory-augmented attention modelling for videos. arXiv preprint arXiv:1611.02261 (2016)

  11. Fan, L., Li, T., Fang, R., Hristov, R., Yuan, Y., Katabi, D.: Learning longterm representations for person re-identification using radio signals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10699–10709 (2020)

    Google Scholar 

  12. Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: StyleNet: generating attractive visual captions with styles. In: CVPR, pp. 3137–3146 (2017)

    Google Scholar 

  13. Gan, Z., et al.: Semantic compositional networks for visual captioning. In: CVPR, pp. 5630–5639 (2017)

    Google Scholar 

  14. Hsu, C.Y., Ahuja, A., Yue, S., Hristov, R., Kabelac, Z., Katabi, D.: Zero-effort in-home sleep and insomnia monitoring using radio signals. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 3, pp. 1–18 (2017)

    Google Scholar 

  15. Hsu, C.Y., Hristov, R., Lee, G.H., Zhao, M., Katabi, D.: Enabling identification and behavioral sensing in homes using radio reflections. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, p. 548. ACM (2019)

    Google Scholar 

  16. Hsu, C.Y., Liu, Y., Kabelac, Z., Hristov, R., Katabi, D., Liu, C.: Extracting gait velocity and stride length from surrounding radio signals. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2116–2126 (2017)

    Google Scholar 

  17. Hu, Y., Chen, Z., Zha, Z.J., Wu, F.: Hierarchical global-local temporal modeling for video captioning. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 774–783 (2019)

    Google Scholar 

  18. Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 786–792 (2018)

    Google Scholar 

  19. Li, T., Fan, L., Zhao, M., Liu, Y., Katabi, D.: Making the invisible visible: action recognition through walls and occlusions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 872–881 (2019)

    Google Scholar 

  20. Lien, J.: Soli: ubiquitous gesture sensing with millimeter wave radar. ACM Trans. Graph. (TOG) 35(4), 142 (2016)

    Article  Google Scholar 

  21. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)

    Google Scholar 

  22. Long, X., Gan, C., de Melo, G.: Video captioning with multi-faceted attention. Trans. Assoc. Comput. Linguist. 6, 173–184 (2018)

    Article  Google Scholar 

  23. Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)

    Google Scholar 

  24. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)

    Google Scholar 

  25. Pasunuru, R., Bansal, M.: Reinforced video captioning with entailment rewards. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 979–985 (2017)

    Google Scholar 

  26. Peng, Z., Muñoz-Ferreras, J.M., Gómez-García, R., Li, C.: FMCW radar fall detection based on ISAR processing utilizing the properties of RCS, range, and Doppler. In: 2016 IEEE MTT-S International Microwave Symposium (IMS), pp. 1–3. IEEE (2016)

    Google Scholar 

  27. Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015)

  28. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_31

    Chapter  Google Scholar 

  29. Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017)

  30. Stove, A.G.: Linear FMCW radar techniques. In: IEE Proceedings F (Radar and Signal Processing), vol. 139, pp. 343–350. IET (1992)

    Google Scholar 

  31. Tian, Y., Lee, G.H., He, H., Hsu, C.Y., Katabi, D.: RF-based fall monitoring using convolutional neural networks. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 2, no. 3, p. 137 (2018)

    Google Scholar 

  32. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  33. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)

    Google Scholar 

  34. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1494–1504 (2015)

    Google Scholar 

  35. Wang, X., Chen, W., Wu, J., Wang, Y.F., Yang Wang, W.: Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4213–4222 (2018)

    Google Scholar 

  36. Wu, X., Li, G., Cao, Q., Ji, Q., Lin, L.: Interpretable video captioning via trajectory structured localization. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6829–6837 (2018)

    Google Scholar 

  37. Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)

    Google Scholar 

  38. Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593 (2016)

    Google Scholar 

  39. Zhang, Z., Tian, Z., Zhou, M.: Latern: dynamic continuous hand gesture recognition using FMCW radar sensor. IEEE Sens. J. 18(8), 3278–3289 (2018)

    Article  Google Scholar 

  40. Zhao, M., Adib, F., Katabi, D.: Emotion recognition using wireless signals. In: Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, pp. 95–108. ACM (2016)

    Google Scholar 

  41. Zhao, M., et al.: Through-wall human pose estimation using radio signals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7356–7365 (2018)

    Google Scholar 

  42. Zhao, M., et al.: Through-wall human mesh recovery using radio signals. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10113–10122 (2019)

    Google Scholar 

  43. Zhao, M., et al.: RF-based 3D skeletons. In: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pp. 267–281. ACM (2018)

    Google Scholar 

  44. Zhao, M., Yue, S., Katabi, D., Jaakkola, T.S., Bianchi, M.T.: Learning sleep stages from radio signals: a conditional adversarial architecture. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 4100–4109. JMLR. org (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tianhong Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5860 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fan, L., Li, T., Yuan, Y., Katabi, D. (2020). In-Home Daily-Life Captioning Using Radio Signals. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12347. Springer, Cham. https://doi.org/10.1007/978-3-030-58536-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58536-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58535-8

  • Online ISBN: 978-3-030-58536-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics