XR can be visualized through portable devices such as smartphones and tablets, or through HMDs that can be worn on the head or integrated into helmets (akin to helmet-mounted displays for aviation pilots). HMDs contain a display and lens assembly in front of either one (monocular HMD) or both eyes (binocular HMD). The employed display technologies include liquid-crystal displays (LCDs), organic light-emitting diodes (OLED), liquid crystal on silicon (LCos), or multiple micro-displays to increase total resolution and field of view.
Virtual reality HMDs can only display computer-generated imagery (CGI) and feature an electronic inertial measurement unit (IMU) that uses a combination of accelerometers, gyroscopes, and sometimes magnetometers to keep track of their specific acceleration, angular rate, and orientation.
AR and MR headsets, instead, must overlay CGI onto the view of the real world; therefore, they feature optical head-mounted displays (OHMDs). OHMDs employ optical mixers, which consist of partly silvered mirrors that let the user look through the lens while reflecting artificial imagery produced by the device.
Currently, portable XR devices still have limited capacity in terms of the size and complexity of displayed files because the performance required for continuous 3D rendering in terms of CPU, RAM, and storage is ill-suited for miniaturization and low battery consumption without resorting to an external computing unit such as a high-end PC. The current limitations of XR technology, namely the compromise between desktop or portable solutions in terms of performance, fidelity, and mobility, can be effectively addressed by resorting to cloud computing, meaning that a key role will be played by the progress in wireless communication, in both local (802.11ax or Wi-Fi 6) and mobile networks (5G).
4.1. VR Technologies
VR systems normally include more than one hardware device to allow full operation. Generally speaking, the hardware components of any VR system can be classified into three categories according to their function, namely, display, controllers, and motion capture (mocap) devices. The display outputs stereoscopic images to users and is an essential element of the VR system. Commonly used display types include HMDs, mobile devices, and display walls. A typical VR platform can generally be classified into three categories, namely head-based, stationary, and hand-based operation.
Head-based VR devices consist of helmets or HMDs in which CGI is displayed on the internal screen or pair of screens, one for each eye, with an embedded position-tracking sensor that keeps track of where the user is looking (see
Figure 2). Conversely, stationary VR platforms are usually fixed in place and employ projectors and/or large screens to display CGI to viewers. Lastly, hand-based VR devices are held by the viewers up to their eyes with their own hands and include smartphones or tablets.
Concerning head-mounted systems, three types of VR headsets on the market are suitable for the AECO sector (see
Table 6): tethered VR (also known as desktop VR or PC VR), standalone VR (also known as wireless VR or all-in-one VR), and smartphone VR headsets (also known as mobile VR or VR viewers).
Tethered VR headsets (HTC VIVE Pro 2, HP Reverb G2 (HP, Palo Alto, CA, USA) and Varjo AERO (Varjo Technologies Oy, Helsinki, Finland)) are designed to be connected to the PC, with or without wires, to exploit the resources inside them. They need compatibility with operating systems and specific hardware requirements. Standalone VR headsets (Meta Quest 2 (Meta, Cambridge, MA, USA), HTC VIVE Focus 3/Plus (HTC, Taoyuan, Taiwan), and Pico Neo 3 Pro (VR Expert, Utrecht, The Netherlands)) are completely independent viewers designed to operate without the need for other peripherals and without wires. They are equipped with internal memory and integrate all functions. Bluetooth or a smartphone connection may be required for configuration. Smartphone VR headsets are designed to be connected to the smartphone, which can be physically inserted into the viewer or connected via Bluetooth. In this regard, we have viewers designed exclusively for a smartphone brand or universal VR viewers that adapt to multiple models and operating systems (BNEXT VR PRO,
https://www.aniwaa.com/product/vr-ar/bnext-vr-pro/ (accessed on 15 March 2022)).
Notwithstanding the type of headset, the immersivity level that can be achieved depends on different characteristics of the device, which include the field of view (FoV) and the quality of the display (pixel density, color accuracy, dynamic range, and brightness), the refresh rate (or frame rate) of the CGI, the number of movements allowed to the user (degree of freedom, DoF), the accuracy tracking system, the presence of controllers, and the audio system. Generally speaking, a satisfactory immersive experience requires a FoV of a minimum of 100 degrees (for reference, the human eye has about 220 degrees FoV), a refresh rate between 90 Hz and 120 Hz, six degrees of freedom (3DoF rotational freedom to 360 degrees head rotation and 3DoF positional freedom to allow up/down, left/right, and forward/backward movement), as well as an accurate motion tracking system.
Wide FoV displays enable viewers to experience the virtual environment in a more lifelike way, i.e., focusing on what is in front of them while also perceiving peripheric objects. Higher refresh rates and low latency are instead recommended to avoid motion sickness symptoms (so-called cybersickness). 6DoF controllers allow for more advanced interactions compared to less sophisticated point-and-click controllers (limited to 3DoF).
Motion tracking is a crucial function in the VR system because it ensures user movements and orientation are effectively replicated on screen, in turn enabling a satisfactory interaction with the immersive VR environment [
116]. VR Mocap systems continuously capture and process real-world motions of the user in order to track their current view and provide positioning for interaction with the virtual environment. A more precise motion tracking translates to a more seamless and lifelike immersion in VR. Likewise, any perceived gap or lag between the user’s actions in real life and their reproduction in VR may greatly disrupt the immersive experience. Positional tracking can be either external or internal.
External tracking (also known as outside-in tracking) employs external sensors and/or cameras to keep track of the VR headset’s position and orientation within a user’s defined space (room-scaling). For full room scale, more than two sensors are installed to avoid any occlusion. Internal tracking (also known as inside-out tracking) uses one or more front-facing cameras embedded into the VR headset to detect its position and may function with the support of external markers. Internal tracking generally has poorer performance and is less accurate than external tracking; however, it is much more convenient to set up.
Mocap technology commonly employs optical, inertial, mechanical, and magnetic sensors. In portable VR systems, sensors are usually embedded in the headset itself and their data feed is processed by algorithms to provide motion tracking. Other solutions may rely on more complex systems including volumetric capture (MS Azure Kinect, Microsoft, Redmond, WA, USA), hand tracking and haptics (Ultraleap, Leap Motion, San Francisco, CA, USA), eye tracking, or even full body tracking thanks to special suits.
Finally, interactive controllers (joysticks or wands, data gloves, and haptic devices) are an equally important media for enhancing the reality of VR environments as they are responsible for how users interact with objects in the virtual environment and for which sensory feedback they receive, including haptic and auditory feelings.
4.2. AR and MR Technologies
Similarly to VR, hardware components required for augmented and mixed reality fruition include processing units, displays, several sensors, and dedicated input devices [
54].
On the other hand, both AR and MR technologies are context-aware instead of immersive: their key characteristic is the capacity to combine the reality that users see with their own eyes with CG objects that are seamlessly overlaid in their specific position, differently from VR in which users are completely isolated from the real world upon entering the virtual environment. Therefore, a crucial function of context-aware technologies is geolocation, that is, ensuring the correct alignment of the virtual environment onto the real world in order for the device to properly display the virtual content in its expected position.
In particular, AR and MR devices can calculate the coordinates of their actual 3D position in the real world by processing the spatial relationship between themselves, external markers, and key points through the method of Simultaneous Localization and Mapping (SLAM). As soon as the AR/MR device turns on, its sensor equipment (cameras, gyroscope, and accelerometer) scan the surroundings and feed their data to an algorithm able to reconstruct a 3D model of the real-world environment and then position itself within it. Following this process, the system understands its environment well enough to be able to display CG objects that are realistically placed, oriented, and illuminated to feel part of the real world, with the viewer able to move close and inspect them from multiple directions. Differently from AR, after the SLAM process is complete and the CG content is properly positioned in the real space, MR additionally allows the virtual objects to be occluded from the view when they would be obscured by real ones (such as walls, floors, and columns that stand between the viewer and their expected position). MR also allows this occlusion to be controlled to display, for example, the pipes beneath a floor or wall surface as an X-ray view by regulating the transparency of the related shaders in the graphical engine.
In particular, there are several methods to display concealed utilities in AR/MR, which mainly differ in how they provide the perception of depth of the virtual object in relation to the viewer, and therefore in how accurate and intuitive the AR/MR scene is to the user. Muthalif et al. [
57] focused on the AR/MR visualization of underground pipe utilities and investigated six main display techniques, namely X-ray view, topographic view, shadow view, image rendering, transparent view, and cross-sectional view.
Among these, the X-ray view, which superimposes a cutout of the ground—a virtual “excavation box”—in which utilities are shown in their correct depth and position, was found to be the most accurate and intuitive for the viewer, allowing them to distinguish even multiple utilities in the same view. However, this method requires an accurate and detailed 3D model (a BIM file or even a 3D point cloud of a previous excavation), as well as pre-captured data to operate at best, and the virtual excavation box may end up covering a large part of the real-world view, which may pose hazards in working conditions.
The topographic view instead superimposes a 2D map of underground utilities directly on the ground akin to traditional paint markings on street and paved surfaces. Such a technique is very intuitive, leaves most of the real-world view non-occluded, and does not require any 3D model; however, it lacks any information on depth and may prove confusing if more utilities are shown at once. The shadow view integrates depth information into the topographic view by means of shadow lines and projections on the surface, but this adds complexity to the scene.
Image rendering improves the 3D visualization of objects in space by integrating additional reference points in context, e.g., adding virtual edges to the real world and masking virtual objects beneath real ones (occlusion). This improves the understanding of the scene at the expense of additional computing power and 3D model accuracy needed for localization. Lastly, the transparency view and cross-sectional view see little application in the field.
Table 7 shows the different static and dynamic methods with which virtual content can be superimposed to the real-world view.
In order to support human activities effectively, AR and MR devices should be, preferably, head mounted, as this allows to hands-on interaction to be eliminated with the device, should be equipped with semi-transparent lenses or optical displays, to allow CGI to be superimposed on the real-world view, and should feature cameras and sensors to scan the real environment continuously and allow mixed reality experiences. AR MR headsets can also be equipped with directional speakers and active noise reduction microphones, with more advanced products allowing vocal control by virtual assistants.
Multiple models of AR and MR devices are currently on the market [
55], designed to meet different needs (see
Table 8).
Overall, AR can be implemented according to four methods: optical see-through, video see-through, eye multiplexed, and projection based. The former two are the most widely adopted in AR headsets available on the market [
56]. In optical see-through systems, AR is achieved by superimposing virtual images over the direct view of the real world, commonly by projecting CG content through half mirrors or prisms. With this method, the real-time view of the world is maintained while seeing AR content. In video see-through systems, on the other hand, the camera of the AR device continuously captures the real world in front of it, processes each frame by adding CG content, and finally displays the AR image to the viewer on the device’s screen. Concerning the types of display featured in AR devices, there are three main solutions: monocular (a screen in front of one eye), binocular (the same screen in front of both eyes), or dichoptic (a different screen in front of each eye, to enable depth perception).
Regarding AR HMDs, Google Glass Enterprise Edition 2 (see
Figure 3) is among the most diffused optical see-through devices in the industry. It is used to access documents on the go, which may include texts, images annotated with detailed instructions, training videos, or quality assurance checklists. It can also connect with other AR devices to communicate, for instance livestreaming one’s view to enable real-time collaboration and remote assistance. Control methods include voice commands to launch applications allowing full hands-free operation.
Moverio BT-300FPV AR glasses by Epson (Suwa City, Japan) are optimized for drone flying: the video feed from the aircraft is displayed in first-person view (FPV) on a transparent screen, which allows the viewer to pilot the UAV, monitor its flight statistics, and keep the UAV always in sight (see
Figure 3). In 2022, XYZ (London, UK) released The Atom, a next-generation engineering-grade AR headset for construction. Combining a safety-certified hard hat, augmented reality displays, and the in-built computing power of the platform HoloSite, the device can position holograms of BIM models on-site to millimeter accuracy.
In addition to headsets, smartphones or tablets can provide handheld AR experiences: the user directs the device’s camera at the real-world environment and the AR app installed superimposes content such as an image, animation, or data on the main screen. Indeed, most mobile devices on the market feature high resolution cameras as well as accelerometers, a GNSS receiver, solid state compass, and even LiDAR (Measure Australia, Surry Hills, Australia), making them fully equipped for AR operation. AR applications and software development kits (SDKs) are already available from the largest consumer tech companies such as Apple, Facebook, Google, and others, and some AR headsets are explicitly designed to mount smartphones (Mira Prism, Vuzix M300,
https://www.aniwaa.com/product/vr-ar/vuzix-m300/, accessed on 15 March 2022).
On the other hand, the market for MR devices is still smaller than AR, and currently consists of Microsoft HoloLens and Magic Leap (West Sunrise Boulevard Plantation, FL, USA) with the recent introduction of Varjo XR-3 (see
Figure 4). Indeed, MR goggles are high-performing HMDs, which must include sensors such as gyroscopes, accelerometers, Wi-Fi antennas, digital compasses, GNSS, and conventional and depth sensing cameras [
1,
2] in order to scan and capture the surroundings and integrate them with fully interactive CG content.
The flagship MR device on the market is arguably MS HoloLens, currently in its second iteration, which takes advantage of eye and hand movement tracking to support the positioning of virtual content and a user interface that does not rely on external controllers. Users can also log into HoloLens seamlessly using eye recognition. Moreover, HoloLens features smart microphones and natural language speech processing algorithms to ensure vocal controls work properly even in noisy industrial environments. Park et al. [
58] conducted a literature review to investigate the current status and trends in HoloLens studies published over the past five years (2016–2020), showing a growing use of MS HoloLens multiple fields, from medical and surgical aids and systems, and medical education and simulation, to industrial engineering, architecture, and civil engineering, and other engineering fields.
A recent example of a video pass-through MR headset is represented by Varjo XR3, a tethered HMD that claims the industry’s highest resolution (2880 × 2720) and the widest field of view (115°), and provides depth awareness powered by LiDAR for pixel-perfect real-time occlusion and 3D world reconstruction. The device allows, in addition, a full VR experience.
Apart from standalone MR headsets, several smartphone-based MR headsets are already available or about to enter the market, including Tesseract Holoboard Enterprise Edition, Occipital Bridge (
https://www.aniwaa.com/product/vr-ar/occipital-bridge/, accessed on 15 March 2022), and Zappar ZapBox (
https://www.aniwaa.com/product/vr-ar/zappar-zapbox/, accessed on 15 March 2022). These can provide MR experiences using the camera and display of the smartphone or semi-transparent mirrors to overlay the CG content to the real world.