StreetAware: A High-Resolution Synchronized Multimodal Urban Scene Dataset
Abstract
:1. Introduction
- Multimodal: video, audio, LiDAR;
- Multi-angular: four perspectives;
- High-resolution video: 2592 × 1944 pixels;
- Synchronization across videos and audio streams;
- Fully anonymized: human faces blurred.
- The StreetAware dataset, which contains multiple data modalities and multiple synchronized high-resolution video viewpoints in a single dataset;
- A new method to synchronize high-sample rate audio streams;
- A demonstration of use cases that would not be possible without the combination of features contained in the dataset;
- A description of real-world implementation and limitations of REIP sensors.
2. Related Work
2.1. Datasets
2.2. Deep Learning Applications
3. The StreetAware Dataset
3.1. REIP Sensors
3.2. Data Collection
- Commodore Barry Park. This intersection is adjacent to a public school. It has a low-to-medium frequency of traffic making it an uncrowded intersection.
- Chase Center. This intersection is adjacent to the Chase Bank office building within Brooklyn’s MetroTech Center. It is also an active pedestrian intersection.
- DUMBO. The intersection of Old Fulton Street and Front Street is under the Brooklyn Bridge. Being a tourist destination, this intersection is the busiest of the three. Because of smaller crosswalks and heavy traffic, it provides challenges such as occlusion and a diverse range of pedestrian types.
3.3. Data Synchronization
3.3.1. Audio
3.3.2. Video
4. Use Cases
4.1. Audio Source Localization
4.2. Audiovisual Association
4.3. Occupancy Tracking & Pedestrian Speed
5. Discussion
Limitations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Kostić, Z.; Angus, A.; Yang, Z.; Duan, Z.; Seskar, I.; Zussman, G.; Raychaudhuri, D. Smart City Intersections: Intelligence Nodes for Future Metropolises. Computer 2022, 55, 74–85. [Google Scholar] [CrossRef]
- World Health Organization. Global Status Report on Road Safety. 2018. Available online: https://www.who.int/publications/i/item/9789241565684 (accessed on 31 January 2023).
- Sighencea, B.I.; Stanciu, R.I.; Căleanu, C.D. A Review of Deep Learning-Based Methods for Pedestrian Trajectory Prediction. Sensors 2021, 21, 7543. [Google Scholar] [CrossRef] [PubMed]
- Ballardini, A.L.; Hernandez Saz, A.; Carrasco Limeros, S.; Lorenzo, J.; Parra Alonso, I.; Hernandez Parra, N.; García Daza, I.; Sotelo, M.A. Urban Intersection Classification: A Comparative Analysis. Sensors 2021, 21, 6269. [Google Scholar] [CrossRef] [PubMed]
- Piadyk, Y.; Steers, B.; Mydlarz, C.; Salman, M.; Fuentes, M.; Khan, J.; Jiang, H.; Ozbay, K.; Bello, J.P.; Silva, C. REIP: A Reconfigurable Environmental Intelligence Platform and Software Framework for Fast Sensor Network Prototyping. Sensors 2022, 22, 3809. [Google Scholar] [CrossRef] [PubMed]
- Google LLC. Google Street View. Available online: https://www.google.com/streetview/ (accessed on 20 February 2023).
- Warburg, F.; Hauberg, S.; López-Antequera, M.; Gargallo, P.; Kuang, Y.; Civera, J. Mapillary Street-Level Sequences: A Dataset for Lifelong Place Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2623–2632. [Google Scholar] [CrossRef]
- Miranda, F.; Hosseini, M.; Lage, M.; Doraiswamy, H.; Dove, G.; Silva, C.T. Urban Mosaic: Visual Exploration of Streetscapes Using Large-Scale Image Data. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 20–30 April 2020; pp. 1–15. [Google Scholar] [CrossRef]
- Cartwright, M.; Cramer, J.; Méndez, A.E.M.; Wang, Y.; Wu, H.; Lostanlen, V.; Fuentes, M.; Dove, G.; Mydlarz, C.; Salamon, J.; et al. SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context. arXiv 2020, arXiv:2009.05188. [Google Scholar]
- Fuentes, M.; Steers, B.; Zinemanas, P.; Rocamora, M.; Bondi, L.; Wilkins, J.; Shi, Q.; Hou, Y.; Das, S.; Serra, X.; et al. Urban Sound & Sight: Dataset And Benchmark For audio–visual Urban Scene Understanding. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 141–145. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 11621–11631. [Google Scholar]
- Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 2446–2454. [Google Scholar]
- Ben Khalifa, A.; Alouani, I.; Mahjoub, M.A.; Rivenq, A. A novel multi-view pedestrian detection database for collaborative Intelligent Transportation Systems. Future Gener. Comput. Syst. 2020, 113, 506–527. [Google Scholar] [CrossRef]
- Braun, M.; Krebs, S.; Flohr, F.; Gavrila, D.M. The EuroCity Persons Dataset: A Novel Benchmark for Object Detection. arXiv 2018, arXiv:1805.07193. [Google Scholar]
- Rasouli, A.; Kotseruba, I.; Kunic, T.; Tsotsos, J. PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6261–6270. [Google Scholar] [CrossRef]
- Singh, K.K.; Fatahalian, K.; Efros, A.A. KrishnaCam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–9. [Google Scholar] [CrossRef]
- Corona, K.; Osterdahl, K.; Collins, R.; Hoogs, A. MEVA: A Large-Scale Multiview, Multimodal Video Dataset for Activity Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1060–1068. [Google Scholar]
- Chakraborty, A.; Stamatescu, V.; Wong, S.C.; Wigley, G.B.; Kearney, D.A. A data set for evaluating the performance of multi-class multi-object video tracking. In Proceedings of the Automatic Target Recognition XXVII, Anaheim, CA, USA, 9–13 April 2017; SPIE: Cergy-Pontoise, France, 2017; Volume 10202, pp. 112–120. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
- Neumann, L.; Karg, M.; Zhang, S.; Scharfenberger, C.; Piegert, E.; Mistr, S.; Prokofyeva, O.; Thiel, R.; Vedaldi, A.; Zisserman, A.; et al. NightOwls: A pedestrians at night dataset. In Proceedings of the Asian Conference on Computer Vision, Perth, WA, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 691–705. [Google Scholar]
- Dahmane, K.; Essoukri Ben Amara, N.; Duthon, P.; Bernardin, F.; Colomb, M.; Chausse, F. The Cerema pedestrian database: A specific database in adverse weather conditions to evaluate computer vision pedestrian detectors. In Proceedings of the 2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT), Hammamet, Tunisia, 18–20 December 2016; pp. 472–477. [Google Scholar] [CrossRef]
- Zhang, C.; Fan, H.; Li, W.; Mao, B.; Ding, X. Automated Detecting and Placing Road Objects from Street-level Images. Comput. Urban Sci. 2019, 1, 18. [Google Scholar] [CrossRef]
- Doiron, D.; Setton, E.; Brook, J.; Kestens, Y.; Mccormack, G.; Winters, M.; Shooshtari, M.; Azami, S.; Fuller, D. Predicting walking-to-work using street-level imagery and deep learning in seven Canadian cities. Sci. Rep. 2022, 12, 18380. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. arXiv 2016, arXiv:1612.01105. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Charitidis, P.; Moschos, S.; Pipertzis, A.; Theologou, I.J.; Michailidis, M.; Doropoulos, S.; Diou, C.; Vologiannidis, S. StreetScouting: A Deep Learning Platform for Automatic Detection and Geotagging of Urban Features from Street-Level Images. Appl. Sci. 2023, 13, 266. [Google Scholar] [CrossRef]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. arXiv 2017, arXiv:1712.00726. [Google Scholar]
- Deng, J.; Guo, J.; Zhou, Y.; Yu, J.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-stage Dense Face Localisation in the Wild. arXiv 2019, arXiv:1905.00641. [Google Scholar]
- Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. arXiv 2021, arXiv:2110.06864. [Google Scholar]
- Xue, F.; Zhuo, G.; Huang, Z.; Fu, W.; Wu, Z.; Ang, M.H., Jr. Toward Hierarchical Self-Supervised Monocular Absolute Depth Estimation for Autonomous Driving Applications. arXiv 2020, arXiv:2004.05560. [Google Scholar]
- Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv 2018, arXiv:abs/1802.02611. [Google Scholar]
- Sukel, M.; Rudinac, S.; Worring, M. Urban Object Detection Kit: A System for Collection and Analysis of Street-Level Imagery. In Proceedings of the 2020 International Conference on Multimedia Retrieval, New York, NY, USA, Dublin, Ireland, 8–11 June 2020; Association for Computing Machinery ICMR’20. pp. 509–516. [Google Scholar] [CrossRef]
- Zhao, T.; Liang, X.; Tu, W.; Huang, Z.; Biljecki, F. Sensing urban soundscapes from street view imagery. Comput. Environ. Urban Syst. 2023, 99, 101915. [Google Scholar] [CrossRef]
- Lumnitz, S.; Devisscher, T.; Mayaud, J.R.; Radic, V.; Coops, N.C.; Griess, V.C. Mapping trees along urban street networks with deep learning and street-level imagery. ISPRS J. Photogramm. Remote. Sens. 2021, 175, 144–157. [Google Scholar] [CrossRef]
- Tokuda, E.K.; Lockerman, Y.; Ferreira, G.B.A.; Sorrelgreen, E.; Boyle, D.; Cesar, R.M., Jr.; Silva, C.T. A new approach for pedestrian density estimation using moving sensors and computer vision. arXiv 2018, arXiv:abs/1811.05006. [Google Scholar]
- Chen, L.; Lu, Y.; Sheng, Q.; Ye, Y.; Wang, R.; Liu, Y. Estimating pedestrian volume using Street View images: A large-scale validation test. Comput. Environ. Urban Syst. 2020, 81, 101481. [Google Scholar] [CrossRef]
- Nassar, A.S. Learning to Map Street-Side Objects Using Multiple Views. Ph.D. Theses, Université de Bretagne Sud, Brittany, France, 2021. [Google Scholar]
- Korbmacher, R.; Tordeux, A. Review of Pedestrian Trajectory Prediction Methods: Comparing Deep Learning and Knowledge-Based Approaches. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24126–24144. [Google Scholar] [CrossRef]
- Tordeux, A.; Chraibi, M.; Seyfried, A.; Schadschneider, A. Prediction of Pedestrian Speed with Artificial Neural Networks. arXiv 2018, arXiv:1801.09782. [Google Scholar] [CrossRef]
- Ahmed, S.; Huda, M.N.; Rajbhandari, S.; Saha, C.; Elshaw, M.; Kanarachos, S. Pedestrian and Cyclist Detection and Intent Estimation for Autonomous Vehicles: A Survey. Appl. Sci. 2019, 9, 2335. [Google Scholar] [CrossRef] [Green Version]
- Girshick, R.B. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1497. [Google Scholar] [CrossRef] [Green Version]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.; Berg, A.C. SSD: Single Shot MultiBox Detector. arXiv 2015, arXiv:1512.02325. [Google Scholar]
- Fourkiotis, M.; Kazaklari, C.; Kopsacheilis, A.; Politis, I. Applying deep learning techniques for the prediction of pedestrian behaviour on crossings with countdown signal timers. Transp. Res. Procedia 2022, 60, 536–543. [Google Scholar] [CrossRef]
- Sainju, A.M.; Jiang, Z. Mapping Road Safety Features from Streetview Imagery: A Deep Learning Approach. ACM/IMS Trans. Data Sci. 2020, 1, 1–20. [Google Scholar] [CrossRef]
- Wang, Y.; Liu, D.; Luo, J. Identification and Improvement of Hazard Scenarios in Non-Motorized Transportation Using Multiple Deep Learning and Street View Images. Int. J. Environ. Res. Public Health 2022, 19, 14054. [Google Scholar] [CrossRef]
- GStreamer. Available online: https://gstreamer.freedesktop.org/ (accessed on 20 February 2023).
- City Report, Inc. New York Rolling Out Noise Law, Listening Tech for Souped-Up Speedsters. Available online: https://www.thecity.nyc/2022/2/24/22949795/new-york-rolling-out-noise-law-listening-tech-for-souped-up-speedsters (accessed on 16 January 2023).
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. arXiv 2019, arXiv:1908.07919. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar]
- Zhang, J.; Zheng, M.; Boyd, M.; Ohn-Bar, E. X-World: Accessibility, Vision, and Autonomy Meet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9762–9771. [Google Scholar]
- Xu, Y.; Yan, W.; Sun, H.; Yang, G.; Luo, J. CenterFace: Joint Face Detection and Alignment Using Face as Point. arXiv 2019, arXiv:1911.03599. [Google Scholar] [CrossRef]
- NVIDIA. Deepstream SDK. Available online: https://developer.nvidia.com/deepstream-sdk (accessed on 31 January 2023).
Dataset | Location | Size | Description | Annotations? |
---|---|---|---|---|
Google Street View [6] | >100 countries | >220 B | Vehicle-mounted camera images; download not free | No |
Mapillary Street-Level Sequences [7] | 30 cities on 6 continents | >1.6 M | Vehicle-mounted camera images; condition-diverse; GPS-logged | No |
Urban Mosaic [8] | New York | 7.7 M | Vehicle-mounted camera images | No |
SONYC [9] | New York | 150 M | 10-s audio samples | Yes |
Urbansas [10] | European cities and Uruguay | 15 h | 10-s audio & video samples | Yes |
KITTI [11] | Germany | 1 k | Vehicle-mounted camera images; laser scans; GPS-logged | Yes |
NuScenes [12] | Boston, MA | 1.4 M | Vehicle-mounted camera images; radar & LiDAR; multi-camera | Yes |
Waymo Open Dataset [13] | California & Arizona | 1 M | Vehicle-mounted camera images; LiDAR; condition-diverse | Yes |
Infrastructure to Vehicle Multi-View Pedestrian Detection Database (I2V-MVPD) [14] | Tunisia | 9.48 k | Vehicle-mounted & stationary synchronized images | Yes |
EuroCity Persons [15] | 31 cities in 12 European countries | 47 k | Vehicle-mounted camera images; condition-diverse; pedestrian-oriented | Yes |
Pedestrian Intention Estimation (PIE) [16] | Toronto | 911 k | Vehicle-mounted camera images; pedestrian & vehicle-oriented | Yes |
KrishnaCam [17] | Pittsburgh, PA | 7.6 M | Images from Google Glasses on pedestrian | No |
Multi-view Extended Video with Activities (MEVA) [18] | Facility in Indiana, USA | 9.3 kh | Stationary RGBIR & UAV video | Yes |
Neovision2 Tower [19] | Hoover Tower at Stanford University | 20 k | Stationary camera images | Yes |
Cityscapes [20] | 50 cities, most in Germany | 25 k | Vehicle-mounted camera images | Yes |
NightOwls [21] | Germany, Netherlands, & UK | 279 k | Vehicle-mounted camera images at night | Yes |
Cerema [22] | Controlled testing environment | 62 k | Stationary camera images of pedestrians; varied rain/fog/light conditions | Yes |
StreetAware | Brooklyn, NY | 7.75 h | Stationary audio & video; synchronized, multi-perspective | No |
Feature | Specification |
---|---|
Internal Storage | 250 GB |
Power capacity | 300 Wh |
Camera resolution | 5 MP |
Camera field-of-view | 160 (85 max per camera) |
Camera frame rate | 15 fps (nominal) |
Audio channels | 12 (4 × 3 array) |
Audio sampling rate | 48 kHz |
NVIDIA Jetson Nano GPU and CPU cores | 128 and 4 |
NVIDIA Jetson Nano CPU processor speed | 1.43 GHz |
NVIDIA Jetson Nano RAM | 4 GB LPDDR4 |
Feature | Specification |
---|---|
Number of geographic locations | 3 |
Number of recording sessions | 11 |
Typical recording length | 30–45 min |
Total unique footage time | 465 min (7.75 h) |
Total number of image frames | ≈403,000 |
Video resolution | 2592 × 1944 pixels |
Number of data modalities | 3 |
Synchronized and anonymized | True |
Video synchronization tolerance | 2 frames |
Audio synchronization tolerance | 1 sample |
Total audio & video size | 236 GB |
Total LiDAR size | 291 GB |
Total size | 527 GB |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Piadyk, Y.; Rulff, J.; Brewer, E.; Hosseini, M.; Ozbay, K.; Sankaradas, M.; Chakradhar, S.; Silva, C. StreetAware: A High-Resolution Synchronized Multimodal Urban Scene Dataset. Sensors 2023, 23, 3710. https://doi.org/10.3390/s23073710
Piadyk Y, Rulff J, Brewer E, Hosseini M, Ozbay K, Sankaradas M, Chakradhar S, Silva C. StreetAware: A High-Resolution Synchronized Multimodal Urban Scene Dataset. Sensors. 2023; 23(7):3710. https://doi.org/10.3390/s23073710
Chicago/Turabian StylePiadyk, Yurii, Joao Rulff, Ethan Brewer, Maryam Hosseini, Kaan Ozbay, Murugan Sankaradas, Srimat Chakradhar, and Claudio Silva. 2023. "StreetAware: A High-Resolution Synchronized Multimodal Urban Scene Dataset" Sensors 23, no. 7: 3710. https://doi.org/10.3390/s23073710
APA StylePiadyk, Y., Rulff, J., Brewer, E., Hosseini, M., Ozbay, K., Sankaradas, M., Chakradhar, S., & Silva, C. (2023). StreetAware: A High-Resolution Synchronized Multimodal Urban Scene Dataset. Sensors, 23(7), 3710. https://doi.org/10.3390/s23073710