1. Introduction
Augmented reality (AR) technology, which shows virtual objects overlapping in the real world, is gaining increasing attention and various AR applications are being developed. These applications were initially developed as mobile AR based on mobile devices, such as smartphones and tablet PCs. The mobile AR systems used the camera of the smartphone to show the real world while augmenting the virtual objects in the camera screen. Since then, smartphones have become widely available, making it easy for anyone to enjoy AR applications. For example, Pokemon Go is the most successful AR mobile game based on the location-based service (LBS). In this game, creatures are registered nearby and the user hunts the creatures that can be augmented through touch gestures [
1]. IKEA’s catalog application is a typical example of a mobile-based AR application. Early versions of this application used a real catalog of IKEA as a kind of marker. If a user places the catalog in a location where they want to pre-position the furniture, the mobile device recognizes the marker and augments the furniture. With the development of computer vision technology, the floor and wall can be distinguished by a mobile device with Apple ARKit or Google AR Toolkit [
2] and the furniture can be placed in the desired position without markers. However, when using mobile-based AR applications, the user is required to continue to hold the device to superimpose the virtual object, which is the most significant limitation of AR based on mobile devices.
Since then, AR Google Glass, a wearable device, has been launched by Google and applications based on a head-mounted display (HMD) have been proposed in earnest. Unlike mobile AR, which needs to confirm virtual objects with a narrow display screen, HMD-based AR applications provide a convenient and immersive experience. HMD-based AR can be applied to a wide variety of fields because when the HMD is worn, the object augmentation and manipulation can be performed at the same time by the user. In particular, various AR applications using HMD are expected to be used in teleconferencing, tele-education, military, maintenance, repair operation (MRO) and exhibition [
3]. As such, Google Glass has pioneered the market for wearable AR devices by implementing a simple level of augmented reality functionality. Microsoft, on the other hand, has further improved the realization of augmented reality with HoloLens, which provides immersive 3D screens with high-quality realism. Since the HoloLens is actually equipped with the simultaneous localization and mapping (SLAM) function, it can acquire the geometry information of the space in which the user is located and grasp its position. This makes it possible to place virtual objects on the desk or mid-air in 3D space. Then, the virtual object is correctly registered at a specific point and it is possible for the user to watch the object, such as a hologram, by moving around it.
Interfaces developed for 2D display [
4], such as a keyboard, mouse, touch screen and a stylus pen in existing computer and mobile device environments, are difficult to use smoothly in an AR environment where objects are registered and manipulated in real 3D space, such use HMD-based AR [
5]. This is because these user interfaces have a 2D input method that cannot control the z-Axis (depth) [
6]. So, with the increasing number of applications and usage scenarios involving augment 3D virtual objects in the real world and interactions with them, there is a growing demand for bespoke AR interfaces that work naturally within a 3D AR environment.
Even the most advanced HMD AR device, the HoloLens, lacks an interface that allows for easy interaction. Although the precise selection and manipulation of objects are among the most important issues in the 3D AR environment [
6], the current gaze-assisted selection (GaS) interface of the HoloLens is somewhat inaccurate and has poor usability [
7]. This is because it generates a ray that is not visible in the user’s head direction and uses the point where the ray intersects as a cursor, such as a mouse pointer. The user must continue to move their head to control the pointer, resulting in fatigue (especially of the user’s neck) and the accuracy of selecting objects over long distances will be greatly reduced [
5]. To use the fingertip gesture, the user’s hand must be located within the field of view of the camera mounted on the HoloLens. The users must continuously raise their hand high, resulting in fatigue in shown as
Figure 1. Ultimately, this fatigue limits the user’s application usage time.
On the other hand, a rather unintuitive way of interacting with the object is used. When you select an object, eight editing boxes in the form of a wireframe are activated and the user can adjust the size of the object by pulling these editing boxes, one by one [
7]. This is borrowed from the method used to adjust the size of a 2D media file such as a photograph or graphic in office programs, such as word processing and presentation programs. Since it uses a similar method to the interface used in the existing computer environment (since it is already learned), there is no need to separately learn an interface for object manipulation. However, this interface is less accurate and usable for use in 3D environments. Because it is necessary to select a relatively small edit box compared to the object, the ray-casting based interface is generally difficult to select for small object selection [
5].
There is also a limit to the object registration method. With the existing HoloLens interface, there are no significant problems when registering an object that is in contact with a plane, such as a floor, desk or wall. However, virtual reality (VR) and AR environments frequently register objects floating in mid-air differently from the real world. These include educational objects, flying objects, personal widgets and multi-purpose windows. In the HoloLens AR interface, the area in which these objects can be registered is limited to the distance the user can reach. In other words, for a user to register an object at a long distance, the user must move directly to the vicinity of the location. In an application that requires various frequent object registration and manipulation, the user must move a long distance each time to move a remote object. Therefore, if the interface is used in an AR environment, the authoring situation and task becomes somewhat inefficient.
In this paper, we propose an AR Pointer, a new user interface for convenient registration and manipulation of remote objects in the AR environment. This interface is designed to be intuitive and highly usable based on ray-casting using the Depth-variable method [
5]. In particular, it will be useful to replace existing interfaces in see-through HMD-based AR applications, such as the HoloLens. Instead of creating an input device with a separate sensor, we made it possible for users to use their mobile device as an input device. This is because mobile devices are basically equipped with 6-Degree of Freedom (DoF) Inertial Measurement Unit (IMU) sensors and a touch screen. The proposed interface can use the Depth-variable method to register an object at a specific point in the distance and in the air. This AR interface also allows the user to manipulate objects using a familiar mobile and user-friendly swipe/pinch gesture. By applying this familiar interaction method, users who have no expertise or experience with the AR environment can use it easily. This design allows users to interact actively with AR contents.
In
Section 2, we describe related work on 3D object manipulation.
Section 3 describes how to implement AR Pointer and object manipulation interactions.
Section 4 and
Section 5 presents two experimental results for each two methods. In the final section, we discuss the conclusion and future work of this paper.
3. AR Pointer
3.1. Implementation
Our proposed AR Pointer was implemented using ray-casting technology and employing a mobile device as a sensor-based controller. The Ray-casting method can reduce fatigue when using the HMD because the user does not have to raise their hand by force and it is useful for controlling the remote objects. One of the biggest problems with ray-casting technology interfaces, manipulation, was solved by using the smartphone’s swipe/pinch gesture. A smartphone with an acceleration, gyro sensor and touch screen was used as the input devices. Therefore, it was possible to accurately generate the direction of the ray through the sensor data. The following is a detailed implementation of the proposed AR Pointer.
AR Pointer is an interface designed for object registration and manipulation in AR environments. Here we used the IMU sensor (gyro sensor and accelerometer sensor) of a smartphone (we used the Apple iPhone 8 but it can be implemented on all iPhone and Android phones with 6-DoF IMU sensors.) to accurately point in the desired direction and guarantee the high accuracy required for the sensor-based approach. We also implement the user interface so that it can be used similarly to a laser pointer, to provide a natural and intuitive interface. To create a method similar to a laser pointer, we constructed an interface based on ray-casting technology.
Ray-casting is a 3D user interface implementation technique often used in VR environments [
14,
18] and is a technology for discriminating the intersection between rays extending from the user and a virtual object. Ray-casting is mainly used to select 3D virtual objects that intersect with the ray [
19] and is rarely used in manipulations such as the registration, translation, scaling and rotation of objects. In traditional ray-casting, instead of specifying a point in mid-air in a 3D space, we assume that the ray will extend indefinitely, as with the physical characteristic of a laser pointer. Therefore, it is impossible to perform interaction methods, such as registering an object in mid-air on setting its movement path. We solve this problem by including the ray’s segment length information, which we define as ray-depth. Also, this ray-depth can be adjusted to specify the desired position in 3D space. In our prototype, ray-depth control and content manipulation were implemented using the smartphone touch screen.
The information needed to specify the location of the AR Pointer (i.e., the starting point of the ray) is based on user location. Because of the high accuracy of the SLAM, the HoloLens can determine the position of the user in the 3D space created. Next, we need to determine the location of the mobile device the user is holding. As described above, the HoloLens can detect only the hand located in a narrow area in the viewing direction. In other words, it cannot keep track of the exact location of the user’s mobile device. Therefore, in order to perform a second experiment 2 comparing the basic interface of the HoloLens with the proposed AR Pointer, we measured the relative position of the mobile device and the HMD when using the AR Pointer. An external camera was needed to determine the precise location of the AR Pointer. Because our prototype was able to use predefined and reconstructed space, we installed a Microsoft Kinect v2 device to detect the user’s hand position. The Kinect device measures and stores the user’s 3D skeleton joints information, determines position using pre-calibrated values and then converts the camera coordinate system to the world coordinate system. This information can be applied when using the AR Pointer in HMD-based AR used in predefined environments or other AR environments without SLAM capabilities (such as spatial AR like a projection-based AR). We used this 3D skeleton information throughout experiment 1. Experiment 1 compares the manipulation method of the Unity Engine [
20] with the AR Pointer method in the monitor screen environment instead of the HMD.
To implement the AR Pointer, we estimated its horizontal and vertical angles through the IMU sensor data of the smartphone. We used the gyro sensor because it provides accurate values within a short time while the user is moving. The accelerometer was also used because it provides an accurate slope value for a long period when the user stops. However, the mobile device’s naive rotational data presents certain problems. As the gyro sensor uses the integral value of angular velocity, the error increases with time. In the case of the acceleration sensor, the user’s acceleration is added to the sensor value and an error occurs. Therefore, a complementary filter is used for the sensor value [
1].
In the AR Pointer, ray-depth information is included in addition to sensor information, because ray-depth must be adjusted for pointing in mid-air in a 3D space. Touch gesture methods are used for depth control: swiping the finger up (pointing direction) on the touch screen allows the user to point farther, while swiping the finger down (towards the user) allows them to point closer in shown as
Figure 2. It is possible to point to a particular point 3D space by combining the position of the user and AR Pointer, IMU sensor value and ray-depth information.
In
Figure 2a
D means the end point of the AR Pointer,
and
denote the vertical and horizontal angles, respectively and
R represents the AR Pointer’s ray-depth information. By using these three pieces of information, the end point can be obtained in the orthogonal coordinate system when the user’s position
S is fixed. The height of the pointing device held by the user is set to
and included in the end point formula. In addition, when the AR Pointer is used while the person is moving,
and
are added to the X and Z coordinates, respectively. As shown in
Figure 2b–g, we were finally able to visualize the user’s location in 3D space, as well as the direction and ray-depth of AR Pointer.
3.2. User Interaction for Object Manipulations
AR Pointer can specify ray direction and the direction’s ray-depth can be adjusted to calculate the ray’s endpoint. Because this method is based on ray-casting, when a touch event is an input, it checks whether any virtual object intersects with the ray. When the ray does not intersect with any object (
Figure 3c), object registration mode is entered, in which virtual objects can be registered in space. To register an object at a specific location (
Figure 3d), a double-tap gesture is used on the smartphone. The virtual object to be registered in the real world can be confirmed in advance through a single tap. When a ray intersects with an object (
Figure 3f), the object is selected using a single tap (
Figure 3g) and the object manipulation mode is entered, in which various manipulations can be performed. In this mode, core gestures (
Figure 3h,i), mainly those used in smartphones, are used similarly to manipulate the 3D object [
2,
21]. The object control method takes place as follows: a pinch gesture, which places two fingers together, enables the object shrinking operation. Conversely, a magnification manipulation can be performed by a spread gesture. Also, two-finger drag gestures can be used to rotate in either direction of the 3-DoF. To move the object, a long tap gesture is used to de-register the object and cause it to be ready to re-register. Finally, a single tap releases the selected object and exits the manipulation mode.
3.3. Use-Cases
For existing AR contents and applications, a professional authoring tool must be used to register 3D virtual objects in the real world. Therefore, only advanced application programmers capable of handling an authoring tool (3D graphics software) can register contents, such as 3D virtual objects, in a specific location and users can just check the registered objects passively. As such, the interaction that can be experienced by ordinary users in AR is greatly limited. However, with the proposed AR Pointer, users can directly participate in registering and manipulating virtual objects, making it possible to create a wider variety of applications. We expect our proposed interface to be useful in 3D AR environments, including the registration and manipulation of 3D virtual objects. We also introduce some use-case examples where the proposed interface can be useful.
First, there are deployment scenarios for furniture and appliances. Because these products occupy a large area in the home, there are limited products that can be placed according to the free space. Also, the atmosphere of the room and the harmony of the furniture are important matters in the interior of the room. So if users can render these products and place them in the room in advance, they will save the effort of actually moving and rotating large and heavy objects. In addition, when you decorate your home, you will be able to expand they choices and improve their satisfaction by deployment virtual furniture. There are already a few applications that have implemented this scenario. For example, there is an IKEA furniture layout application [
20] that includes 3D virtual object registration and manipulation scenarios. In this application, the catalog is used as a marker for mobile marker recognition and furniture is placed at the corresponding position by recognizing the marker. Therefore, the user must move the catalog to the desired position each time and it is difficult to place several furniture items at the same time. In such scenarios, using the AR Pointer can overcome the inconvenience of existing applications. It was possible to compute and generate a ray from the AR Pointer according to the actual user position. In this application, users can place virtual furniture in the desired location, even where there are no markers. Also, the object manipulation methods introduced in
Section 3.2 can be applied to enable interaction with virtual furniture.
The next scenario is the smart AR classroom. In this smart class, 3D virtual objects can be used to provide students with an immersive learning environment. Because the virtual objects are the multimedia materials that complement the contents of the class, the teacher should be able to freely and effectively register and manipulate each object for lecture and explanation. Among the STEM domain, AR education in spatial orientation skill [
22,
23] or chemistry can be a good example. A variety of studies have been conducted on various AR technologies to support the understanding of atomic and molecular structures [
1,
24]. However, these applications almost always use markers or computer vision-based feature point matching to augment atoms at the location. As a result, the space available to place the objects was limited to the desk and the additional interactions that the teacher or the students could perform were merely moving the markers closer to each other to make the atoms into a molecule. In this scenario, if the AR Pointer is used, the teacher can register and augment the atom to be described at the student’s desk or the middle of the classroom. Then, for example, hydrogen and oxygen atoms can be moved closer to each other to illustrate the form of the H2O molecular bond. They also can rotate the molecule object slowly to identify the three-dimensional structure of the molecule bond.
Another example is a scenario in astronomy class that describes and observes the movements and physical laws of the celestial bodies. To explain the orbits of the solar system and the orbital movement, the teacher can use the AR Pointer to register solar system objects in the middle of the classroom. Then, for example, to explain the rings of Saturn, it would be possible to select and scale up the object. In the situation where the teacher and students exchange questions and answers, this time, students can interact with these objects in the same way. At this time, instead of passing the interface used by the teacher to the students, the students’ own mobile devices can be used as AR Pointers to participate in the object manipulation tasks.
4. Experiments 1
We conducted two experiments to evaluate the performance and usability of the proposed AR interface, AR Pointer. The first experiment is a comparison with the existing authoring tools. In this experiment, object manipulation using the authoring tools, such as OpenGL [
25] and Unity Engine [
20], which are used by programmers, was compared with object manipulation using our proposed AR Pointer. Then, we also tested whether there is any meaningful difference between the experimental results of users who are familiar with 3D AR environments (Users with 3D game/rendering/graphics programming experience) and those who are not. In the second experiment, we performed three interactions (Rotation, Scaling and Translation) using the GaS-assisted wireframe interface and our proposed AR Pointer in the HoloLens environment and compared the results in terms of task completion time, fatigue and usability.
4.1. Design
The evaluations of the 3D user interface focus on three main aspects [
26]: object manipulation [
21,
27], viewpoint manipulation [
28] and application control [
29]. Since the AR Pointer is proposed for the efficient registration and manipulations of 3D virtual objects, our experiment was conducted with a focus on the first aspect only. Also, interviews were carried out about three categories: speed, intuition and ease of learning. A total of 26 participants participated in this experiment 1. One group was an experienced group of 3D graphics tools, aged between 25 and 33 years old (mean [M] + standard deviation [SD] = 28.2 + 2.2, male = 10, female = 3, one left-handed). The other group was an inexperienced group of 3D graphics tool, aged between 25 and 31 years old (mean [M] + standard deviation [SD] = 27.8 + 1.9, male = 9, female = 4, all right-handed). Each group performed four tasks: registration, translation, scaling and rotation of 3D virtual objects. Also, each task was performed four times with proposed AR Pointer interactions and keyboard and mouse combinations, respectively. In this experiment, we implemented the furniture layout application prototype shown in
Section 3.3 and used it for the experiment. Completion time was calculated according to how quickly the furniture was moved to the suggested location in the room and transformed into the correct scale and pose.
The subjects manipulated the objects using the interfaces of the Unity Engine [
20]. They could register and manipulate the objects while continuing to view the actual surrounding environment through the PC screen. The objects could be manipulated using the mouse in the scene view or by entering the number through the keyboard in the inspector view. The basic usage of the authoring tool (Unity Engine) was briefly explained about 5 min before the experiment started. The experiment proceeded in two stages. First, a ground truth object is randomly generated in the Unity test scene. The subjects had to register the object at the corresponding location (registration experiment). In the next step, a guide object was created at a new location and the re-registration (translation) task is performed by selecting the object registered in the previous step and then manipulate it to orientation (rotation) and size (scaling). The first experiment was performed in the Unity Engine, which is the authoring tool, without wearing the HMD. When manipulating objects using the AR Pointer, subjects still had to check their work on a PC monitor. The first experiment was usually completed within 5 min.
4.2. Results
We found that, when using the proposed AR Pointer, the users achieved a faster completion time than the keyboard and mouse combinations represented by the existing 2D interface for all interaction methods. The participants familiar with the 3D AR environment did not notice a significant difference in completion time between the keyboard and mouse combinations and using AR Pointer. Certain manipulation methods even showed similar or superior completion times using the existing 3D graphics tool method. However, participants who were not familiar with 3D AR environments achieved significantly superior results when using the AR Pointer in shown as
Figure 4.
4.3. Data Analysis and Discussion
We documented and analyzed the reactions and responses of the participants throughout the experiments and during the interviews conducted afterward. Participants who were not familiar with the 3D AR environment found it very difficult to understand the 3D space. This confusion continued throughout the four experimental attempts and we noticed that the participants continued trying to identify where each axis was but with little success. Therefore, we conclude that the existing 3D graphics tool method is difficult to learn, as well as having a low level of intuitiveness for general users. Conversely, users familiar with the 3D AR environment are also familiar with the 3D graphics tool, so completion time does not differ significantly. Since only professionals capable of using the 3D graphics tool could register and manipulate the 3D AR application contents as desired, ordinary users were able to experience only limited, passive interaction with object manipulation. This results in limited contents and interaction with objects with various AR applications.
However, with the AR Pointer, a user could intuitively register the object in the pointing direction, without considering the x-, y- and z-axes, thereby reducing completion time. Also, the AR Pointer continued to show a shorter completion time than the keyboard and mouse combination, in which completion time did not change significantly even when attempts are repeated. Furthermore, while the keyboard and mouse combination did not significantly change the completion time, even with repeated attempts, the AR Pointer showed usability in ease of learning, as the completion time was continuously reduced.