1. Introduction
Attitude estimation using low-cost sensors plays an important role in many consumer electronic applications and has attracted much research attention. For example, in rehabilitation and biomedical engineering, the attitude information is applied for elderly fall detection [
1]. In indoor pedestrian dead reckoning application, the attitude of the measurement unit is used for step detection and heading estimation [
2].
MARG (magnetic, angular rate, and gravity) sensors and monocular cameras are two kinds of low-cost sensors that are widely used in consumer electronic applications to provide attitude information [
3,
4,
5]. However, using MARG sensors or camera standalone for attitude estimation has some limitations.
On the one hand, the MARG sensors contain a magnetometer which is used to correct heading drift of the attitude estimation. Generally speaking, the magnetometer is factory calibrated to compensate for any error sources that are internal to the device. However, for the errors that are introduced externally by mounting structures or adjacent devices, an additional calibration process is essential [
6]. In order to calibrate the magnetometer, the device needs to be moved in all possible directions to collect data. This is not user-friendly. Moreover, when it works in an environment with abnormal magnetic fields, the attitude estimation performance will deteriorate significantly.
On the other hand, when there are artificial vision fiducials arranged in the environment, the attitude and position of a camera can be recovered from one image by solving the perspective-and-point (PnP) problem [
7,
8]. The point correspondence-based algorithms require at least four feature points in the field of view. However, in a dynamic and possibly cluttered environment, the number of feature points may be less than four.
The goal of this work is to propose a low-cost fusion method to achieve absolute attitude estimation in an environment with only two pre-calibrated artificial vision fiducials. The sensors to be fused include gyroscope, accelerometer, and camera. The advantages of such a sensor combination are twofold:
Compared with the MARG sensor combination, our method can work in an abnormal magnetic field environment, especially in an indoor environment.
Compared with the combination of gyroscope and camera, our method still converges when there are only two feature points in the image.
The second one is important for the resource-constrained artificial fiducial system, such as the AprilTag system [
9] and the visible light communication reference system [
10,
11].
The main contributions of our work are summarised as:
We derive an implicit geometry measurement for camera-based attitude estimation. This measurement is associated with attitude and is independent of the position of the camera.
A nonlinear complementary filter is proposed to fuse angle rate measurement, vector measurement, and geometry measurement. There are only two parameters to be adjusted.
The remainder of this paper is organized as follows.
Section 2 explores the literature of attitude estimation algorithms based on MARG sensors, camera standalone, and visual-inertial fusion, respectively.
Section 3 presents the sensor models including angle rate measurement, vector measurement, and the proposed geometry measurement model.
Section 4 presents the nonlinear complementary filter fusing inertial sensors and a camera. The stability analysis of the proposed filter is in this section. An attitude initial alignment method is proposed to provide the initial value of the filter. The discrete implementation of the filter on quaternion is also given in this section. The algorithms are validated using data collected by a smartphone with built-in inertial sensors and a rear camera. Three other representative methods of attitude estimation algorithms are also implemented on the collected data. The results are shown in
Section 5. Finally, concluding remarks and future work are presented in
Section 6.
2. Related Works
Various solutions have been proposed for attitude estimation using low-cost sensors, including (1) MARG (magnetic, angular rate, and gravity) sensors-based methods, (2) monocular camera-based methods.
For MARG sensors-based methods, the gyroscope provides angle rate measurement. The accelerometer and magnetometer provide attitude associated vector measurements. When the initial attitude is known, attitude can be computed by integrating the angle rate measurement [
12]. Meanwhile, the attitude can be constructed directly from the vector measurements [
13]. To estimate the attitude from vector measurements is to solve a least square problem, the Wahba’s problem. A unique closed-form solution can be provided by the QUEST (quaternion estimator) algorithm in [
14] and singular value decomposition (SVD)-based method in [
15].
The attitude estimation accuracy obtained by numerical integration of the angle rate measurement is good in a short time. However, the bias and noise of the gyroscope make the estimated value deviate more and more from the true value over time. On the other hand, the attitude recovered from vector measurements hosts long-term stability. However, the instantaneous linear acceleration and magnetic field anomalies will decrease the estimation accuracy.
To achieve good bandwidth and long-term stability, many MARG sensors-based attitude estimation algorithms resort to fuse the angle rate measurement and the vector measurements. The classic fusion algorithms are based on the extended Kalman filter [
16,
17]. These stochastic approaches involve the update of the error covariance matrix and gain matrix which will lead to a large computational burden.
The Mahony’s nonlinear complementary filter formulates the fusion problem as deterministic nonlinear observer kinematics on the special orthogonal group [
18]. The observer kinematics include a prediction term based on the angle rate measurement and a correction term derived from the estimation residual. To calculate the correction term, the direct and passive versions of Mahony’s complementary filter rely on the algebraic reconstruction of attitude from vector observations, while the explicit version explicitly uses the cross product of the reference vectors and the observed vectors. Mahony’s complementary filter has only two adjustable parameters and ensures almost global asymptotic stability. Two passive nonlinear complementary filters algorithms that are implemented on quaternion are proposed in [
19,
20]. The algebraic reconstruction of attitude from vector observation is based on Levenberg–Marquardt optimization algorithm in [
19] and singular value decomposition (SVD) in [
20], respectively.
Different from the nonlinear complementary filter, the linear complementary filter linearly combines the attitude quaternion integrated from the angular velocity with the one reconstructed from the vector observations. Therefore, the linear complementary filter has a frequency domain interpretation. The Madgwick’s linear complementary filter in [
21] applies the gradient descent algorithm to solve the quaternion version of Wahba’s problem. The optimization algorithm only computes one iteration per time sample, provided that the convergence rate of the estimated attitude is equal to or greater than the change rate of physical orientation. In [
22], a fast complementary filter is proposed by deriving a quaternion increment that is free of iterations. An improved gradient descent based attitude complementary filter in [
23] provides fast error convergence and robustness by decoupling the magnetic field variance from roll and pitch. Gain-scheduled or adaptive complementary filters are more robust to strong accelerations and magnetic field disturbances than gain-fixed complementary filters [
4,
24].
For attitude estimation, in addition to MARG sensors, the monocular camera is also an attractive low-cost sensor. Using a camera standalone for pose estimation is quite a standard task. When there are artificial vision fiducials arranged in the environment, the attitude and position of the camera can be recovered from one image by solving the PnP problem [
7,
8]. In an environment without artificial fiducials, the relative rotation and scaled translation can be restored from two images with natural features. This problem has been extensively researched, and a large number of algorithms have been developed. The most well-known ones are the 8-point algorithm [
25] and the 5-point minimal algorithm [
26].
Considering that the frame rate of the low-cost camera is relatively low, it is expected to fuse inertial sensors and a camera to get a higher data rate. The monocular visual-inertial system (VINS) based on extended Kalman filter [
27] or bundle adjustment formulation [
28] can provide relative attitude estimation. However, VINS or standalone camera pose estimation simultaneously calculates the attitude and position of the camera. When the attitude is the only one to be interested, position and attitude should be decoupled to avoid unnecessary calculations.
Recently, a generalized linear complementary filter for attitude estimation from multi-sensor measurements is proposed in [
29]. The point-correspondence constraints of the camera are modeled as vector measurements. This model allows the camera to be fused with a gyroscope in the same way as an accelerometer and a magnetometer. However, to satisfy the premise of vector measurement, the position of the camera must be close enough to the origin of the reference coordinate system.
An implicit measurement model proposed in [
30] enables a strict decoupling of attitude and position. The measurements of the camera are fused with angle rate measurement using a nonlinear observer. However, this implicit measurement model is based on line-correspondences instead of point-correspondence. Compared with the point feature, tracking the line feature is computationally more intensive.
In this paper, we use two feature points to derive an implicit geometry measurement that has the same expression as the implicit measurement in [
30], but without tracking the line-correspondences. The new geometry measurement is fused with the vector measurement from accelerometers and the angle-rate measurement from gyroscopes. This combination of sensors makes it possible to determine the absolute attitude with only two feature points in the field of view. Meanwhile, the fusion of these three kinds of measurements removes the restrictions on the position of the camera in [
29].
4. Attitude Estimation Algorithm
In this section, a nonlinear complementary filter on the special orthogonal group is introduced. This filter fuses the angle-rate measurement, vector measurement, and implicit geometry measurement to estimate the attitude in continuous dynamics. The attitude estimation algorithm for real-world signals is also considered in this section. Specifically, it includes the initial alignment and the discrete realization of the filter based on the unit quaternion.
4.1. Nonlinear Complementary Filter on SO(3)
The proposed attitude estimation algorithm is based on a nonlinear observer. The goal of the attitude estimation observer is to provide a set of dynamics for an estimated attitude to drive the estimation error converges.
The observer kinematics include a prediction term and a correction term. In this paper, the prediction term is based on the angle rate measurement. The correction term is added to the measured angle rate as it does in classic attitude observer design [
18,
30].
Let
denote the estimated direction cosine matrix
. The proposed attitude observer is
where the correction term
is
is the correction term caused by the vector measurement. Referring to the explicit version of Mahony’s complementary filter [
18], the cross product of the estimated vectors
and the observed vectors
is used to construct the correction term.
is the correction term caused by the geometry measurement. This correction term is referred from the observer design in [
30] that fuses angle-rate measurement and geometry measurement. The rationale for the
is the following. From the implicit geometry measurement Equation (
9), the ideal
satisfies
. If it is not satisfied, then a corrective angle rate should be applied. The angle rate needed for this is directed along
. The magnitude of the correction is simply the
.
In the above attitude observer,
and
are two fixed parameters. The stability of the new attitude observer is analyzed in
Section 4.2.
As [
18] does, we also term the observer as a nonlinear complementary filter here.
4.2. Stability Analysis
The estimation error
is defined as the relative rotation from the true navigation frame
to the estimated navigation frame
, that is
Differentiate both sides of estimation error definition Equation (
12) and it is straightforward to verify that the error system is
Substitute the angle rate measurement
using its true values
. Using
, we obtain:
Based on the correction term in (
11), the error system is given by
Substitute the vector measurement
and geometry measurement
using their true values
and
. Here, the explicit geometry measurement model is used. It is straightforward that
For any vector
, there is
. Using this relationship,
is eliminated from the error dynamics. The error system is
The error system in (
17) is a nonlinear time-varying system because the feature plane’s unit normal vector
is varying with the position of the camera.
It is easily verified that the identity matrix
is an equilibrium point of the error system. According to the definition of the small perturbing rotation [
32], the attitude error around the equilibrium point
is approximated by
where vector
is the angle-axis form of the small perturbing rotation.
The linearization of the error dynamics is computed to analyze the local stability of the equilibrium point [
33]. Substitute (
18) into (
17) with
, we get
Using the properties of linear operations, including the skew-symmetric matrix transformation and the vector cross product, the state equation describing the evolving of the
is obtained as follows:
For , , the linearized system is asymptotically stable as long as the third component of is not 0. Otherwise, the error of the yaw angle will not converge to zero. According to the geometric relationship, is the vector located in the feature plane and perpendicular to the line of two feature points. This geometric structure is determined by the positions of features and camera. Since the linearized error system is asymptotically stable, the nonlinear error system ensures locally asymptotic stability around the equilibrium point .
The nonlinear complementary filter proposed in this paper is based on observer design. The local asymptotic stability of the filter ensures that the initial attitude estimation error will converge to zero when the gains are any constant greater than zero. Generally speaking, the larger the gain, the faster the error convergence. However, the observer design method is a deterministic method in which the measurement errors are assumed to be zero. The measurement noise in practice will lead to a big attitude estimation variance when the gains are tuned too large inappropriately. Therefore, gains tuning is a compromise process. Moreover, a proper initial attitude guess is important for the filter with local asymptotic stability.
4.3. Initial Alignment
The initial alignment is the problem of attitude determination in the initial static stage using accelerometers and a camera. It is important since the attitude estimation filter in (
10) requires a proper initial value.
According to the chain rule, the direction cosine matrix’s transpose
can be written as the multiplication of two rotation matrixes
where
is an intermediate frame system, referred to as the horizontal frame, whose z-axis coincides with the z-axis of
. Under the definition of the navigation frame and the body frame in this paper,
and
can be written in the following form:
The is a set of Euler angles. is the pitch angle with . and are the roll angle and yaw angle, where . The yaw angle equals to the angle formed by the y-axis of and the y-axis of .
The initial alignment method proposed in this paper includes horizontal alignment and azimuth alignment and solution confirmation.
First, use normalized accelerometer vector measurement to calculate the horizontal attitude. It is easily verified that the third column of
is the projection of the normalized “up” vector in the body frame
. Therefore,
can be recovered roughly from the acceleration measurement by following expressions
Second, determine the azimuth. After calculating of the
, use camera implicit geometry constraint
to construct one linear equation about
and
. With the constraint
, two sets of solutions can be found, one of which can be confirmed by the constraints of the camera’s field of view.
Third, confirm the true direction cosine matrix. According to the camera model in (
5) and (
6), it is straightforward that
In (
25),
and
are the depth of feature points P1 and P2 in the camera frame. Using (
25), three linear equations about
and
can be obtained. The
that leads to positive
and
will be accepted as the initial direction cosine matrix.
4.4. Discrete Implementation on Quaternion
The filter in (
10) is a continuous system. In practical implementation, sensor data will be sampled and the filter needs to be integrated in discrete time. The unit quaternion representation of the rotations is commonly used for the realization of algorithms on SO(3) since it offers considerable efficiency in code implementation [
18]. The proposed attitude observer in quaternion representation is
where
Since the term
can be seen as the corrected angle rate in the frame of
, according to the quaternion-based attitude update algorithm [
12], the discrete implementation of filter (
26) is
∘ in (
28) is the quaternion multiplication operator.
is defined by
in (
28) is the quaternion increment from time
to time
and is calculated by
in (
30) is the angle increment vector,
and
where
is the time interval between
and
.
is calculated by the average value of the gyroscope measurements in
and
.
in angle increment vector (
31) is the angle rate correction term constructed by the estimation in time
and measurements in time
as
is the coordinate transformation matrix and equals to
To get a high-bandwidth system, the sample interval of the gyroscope can be selected as the discretization time interval of the filter. Considering that the measurements from the camera and accelerometer play a role in providing long-term stability, they can be updated at a low frequency. If vector or geometry measurement is not available in current sample time , the corresponding correction term is set to zero, where x is a or c.
This structure of the proposed filter is the so-called “explicit version” of the nonlinear attitude observer. As we discussed in the section “Related works”, Mahony provides three versions of nonlinear attitude observer in his work: direct, passive, and explicit. The direct and passive versions depend on the algebraic reconstruction of the attitude. For MARG sensors, the sampling frequencies of accelerometer and magnetometer are the same. This will not cause any problems. However, for inertial sensors/camera combination, the difference in sampling frequency makes the algebraic reconstruction have to align with the sensor sampled in low-frequency. The consequence is that a lot of information in the high-frequency sensor is lost. On the contrary, the angle rate correction term of the explicit complementary filter in (32) is aligned with the sensor sampled in high-frequency. This is the advantage of the explicit version of the nonlinear attitude observer in handling sensors with different sample frequencies.
5. Evaluation
5.1. Experiment Setup
To evaluate the performance of the proposed attitude estimation algorithm, an experiment system is constructed as shown in
Figure 2. A smartphone with built-in inertial sensors and a rear camera is used as the measurement equipment. Two artificial fiducials from AprilTag family Tag36H11 are placed on the ground. The 3DM-GX3-25 Attitude and Heading Reference System (AHRS) attached to the smartphone is used as the ground truth provider.
An Android application is developed to capture the data from inertial sensors and images from the camera. The sampling frequency of inertial sensors and camera are 100 Hz and 5 Hz respectively. All the data are time-stamped and stored in the smartphone’s SD card. The recorded data are processed offline on the laptop so that the different algorithms and control variables can be evaluated on the same recorded data.
To keep the AHRS from magnetic disturbance, the experiment is implemented in an outdoor environment and the magnetometer of the 3DM AHRS is re-calibrated after installation. The y-axis of the navigation frame is chosen as the magnetic north.
The intrinsic parameters and distortion parameters of the camera are pre-calibrated using the geometric method proposed in [
34]. The process of extracting the normalized image coordinates of the feature points from the image is as follows. Undistort the image according to the distortion parameters. Detect the AprilTag features using the Python package pupil_apriltags. Recover the normalized image coordinates of features from the corresponding index coordinates using camera intrinsic parameters.
In
Section 2 and
Section 3, we assume that the axis direction of the camera frame
is the same as that of the body frame
. However, in our smartphone experiment platform, the optical axis of the camera is opposite to the z-axis of the body frame. Moreover, in data collection software, the image orientation is “Landscape” relative to the smartphone. The actual direction relationship between the body frame
-
-
, image pixel frame
u-
v, and camera frame
-
-
is as shown in
Figure 3. Therefore, the normalized image coordinate values should be transformed to adapt to the complementary filter algorithm. The coordinate transformation matrix
is
Under the definition of the reference frames in
Figure 3, the relationship between the normalized image coordinates of the feature points and their camera coordinates is as follows.
This makes the third step of the initial alignment a little bit different from
Section 4.3. Specifically, the “depth equations” in (
25) should be replaced by (
34). The attitude solutions that lead to negative
and
will be accepted as the initial attitude.
Two sets of measurement data are collected to evaluate the static and dynamic performance of the proposed attitude estimation algorithm.
- (1)
In the static case, the smartphone keeps static in each fixed attitude about 15 s. Moreover, the smartphone in this case only makes the rotational motion between each fixed attitude without displacement relative to the navigation frame.
- (2)
In the dynamic case, the smartphone makes arbitrary rotation and translation movements while ensuring that the two visual labels are always within the camera’s field of view.
Representative algorithms are implemented to process the collected data for performance comparison.
- (1)
SINS: Attitude update algorithm of strap-down inertial navigation in [
12].
- (2)
VIN-EKF: This is a vision-aided inertial navigation algotithm based on extended Kalman filter (EKF). The system state vector of the filter includes unit quaternion, velocity, position, gyroscope and accelerometer measurement bias. The measurement residual is computed by the measured normalized image coordinates of the feature points with the predicted normalized image coordinates in the filter. The linearized system model for the IMU error-state and the linearized measurement model about the estimates for the attitude and position of the camera are as described in [
27]. The standard deviations of sensor noise are
,
, and
, where
is the camera focal length comes from the calibrated intrinsic parameters. Fusing camera and inertial sensors in a tightly coupled scheme is the core idea of [
10] so that the vision-based visible light communication positioning system is able to continually provide location service when the number of the feature points is less than four.
- (3)
Proposed CF: This algorithm fuses inertial sensors and a camera using the quaternion-based discrete implementation in (28) and (32). Two gain parameters of the proposed nonlinear complementary filter are and .
- (4)
CF-1: This algorithm also fuses inertial sensors and a camera using nonlinear complementary filter. The main difference between CF-1 and the proposed CF is that in CF-1, the measurement models of accelerometer and camera are all the vector measurements. The reference vectors are the normalized gravity vector and the normalized image coordinates of feature points in the initial frame [
29]. Therefore, this is a relative attitude estimation method. To get the full attitude, the initial attitude should be known. This filter is implemented as a quaternion filter in (28), but the correction terms are all constructed using cross product of the estimated vectors and the observed vectors. There are three parameters for CF-1 when two feature points are captured in one image, that is,
,
and
.
- (5)
CF-2: This algorithm fuses the gyroscope with a camera using nonlinear complementary filter. Same as the proposed CF, CF-2 also uses the camera measurements as the implicit geometry measurement. However, there is no gravity measurement in CF-2 [
30]. This filter is implemented as a quaternion filter in (28), but the accelerometer correction term is zero. There is only one parameter for complementary filter when two feature points are captured in one image, that is,
.
The gains of the proposed CF are tuned to achieve relatively good attitude estimation results in our experimental conditions. To be fair, the gains of the other two complementary filter algorithms are set to the same constant as the proposed CF. The parameters of VINS-EKF are chosen according to the sensor measurement noise characteristics.
5.2. Result and Discussion
Figure 4 shows the attitude ground truth from 3DM AHRS, the attitude updated from the gyroscope, and the attitude estimation results from four estimation algorithms, in the static case. The attitude errors with respect to reference angles from AHRS are shown in
Figure 5. To further verify the performances,
Table 1 gives the root-mean-squared errors (RMSEs) of various estimation algorithms.
As can be seen from
Figure 5 and
Table 1, in the static case, the accuracy of the proposed algorithm is as good as that of VIN-EKF. For the CF-1, the estimation error of pitch and roll is close to that of the proposed filter, but the yaw angle error is the maximum among all the algorithms. This is due to the fact that the CF-1 algorithm corrects the gyroscope prediction error resort to vector measurement and the reference vectors of two feature points are near the gravity vector during the experiment. The result of CF-2 is just the opposite. The yaw angle error is close to the proposed algorithm, but the pitch and roll errors are almost equal to the errors of the SINS algorithm. The reason is that artificial vision fiducials are arranged on the ground, which leads to a horizontal reference vector of
in the implicit geometry measurement. Since no gravity vector or other reference vectors with vertical components are fused in the filter, the pitch and roll errors cannot converge.
Figure 6 shows the attitude ground truth from 3DM AHRS, the attitude updated from the gyroscope, and the attitude estimation results from four estimation algorithms, in the dynamic case. The attitude errors with respect to reference angles from AHRS are shown in
Figure 7.
Table 2 gives the root-mean-squared errors (RMSEs) of various estimation algorithms.
As can be seen from
Figure 7 and
Table 2, in the dynamic case, due to the translation motion of the smartphone, the accuracy of the VIN-EKF and CF-1 is significantly reduced. The translation motion even increases the attitude error of SINS in the early stage. In the dynamic case, the proposed algorithm offers the best performance in estimation accuracy.
For the CF-1 algorithm, the translation motion increases the error of the camera vector measurement model. The estimation accuracy of pitch and roll angle is obviously affected since the reference vector is near the vertical line. Different from the CF-1, the proposed nonlinear complementary filter fuses the implicit geometry measurement so that the position of the camera is decoupled with the attitude estimation.
When it comes to the VIN-EKF, another reason for the performance degradation that must be mentioned is the update rate of the filter. The filter propagates the state in 100 Hz and performs update in 5 Hz. Although the gravity vector constraint is implicit in the velocity equation of the system model, the innovation of this constraint becomes available only when a new image comes and the filter is updated. The estimation error of pitch and roll angle using our method is less than VIN-EKF since the gravity vector constraint is an explicit measurement and the corresponded error is corrected in 100 Hz.
Figure 8 shows the instantaneous magnitude of the accelerometer output during the static case and dynamic case.
The mean time consumption and related standard deviation of different algorithms are presented in
Table 3. Here, time consumption refers to the execution time between two consecutive camera sample updates. The mean values and the standard deviation are calculated after 700 image sample updates. It can be seen that three nonlinear complementary filters show comparable time consumption. The execution time of CF-2 is the smallest since there is no accelerometer correction. The execution time of CF-2 is less than the proposed CF which means the update of vector measurement is simpler than the update of implicit geometry measurement. The execution time of VIN-EKF is the largest due to the calculation of the gain matrix and covariance matrix.
6. Conclusions
In this paper, we consider the attitude estimation problem of fusing camera and inertial sensors in an environment with only two pre-calibrated artificial vision fiducials. The main contributions of this paper are twofold. First, we derive an implicit geometry measurement for camera-based attitude estimation. Second, a nonlinear complementary filter with only two parameters to be adjusted is proposed to fuse angle rate measurement, vector measurement, and geometry measurement.
Compared to MARG sensor-based attitude estimation, the method in this paper doesn’t rely on a magnetometer. It provides an attractive solution for the environment with complex magnetic field distribution. Compared to camera- and gyroscope-based attitude estimation, our method still converges when there are only two feature points in the field of view. The experimental results show that our algorithm outperforms all the compared counterparts in estimated accuracy and provides competitive computational complexity.
We believe that the proposed method can potentially benefit related navigation applications. The main drawback of our method is that the designed nonlinear complementary filter only ensures locally asymptotic stability. Future work will focus on system improvements to achieve unrestricted local stability and even global stability.