1. Introduction
Robots are complex systems composed of various technologies and widely used applications
in areas such as homes, factories, companies, and disaster areas. In these fields,
accurately determining a robot's current position is crucial. Whether it's a robotic
vacuum cleaner, autonomous vehicle, or disaster response robot, all face the significant
challenge of precisely locating themselves on a given map. To address this challenge,
a technology being researched is called Simultaneous Localization And Mapping (SLAM).
SLAM utilizes information obtained from single or multiple sensors to determine the
position of a robot in real-time while simultaneously mapping the surrounding environment
to create a map.
To enable SLAM, cameras, LiDAR, or Inertial Measurement Units (IMUs) are commonly
utilized as sensors. Cameras used in SLAM implementation have the advantage of being
cost-effective, easily detecting and tracking objects, and providing visual semantic
information. LiDAR, due to its laser directly interacting with objects, offers high
accuracy and performs relatively well even in environments with limited visual features.
IMU sensors have the advantage of estimating relative poses.
However, each of these sensors has its own drawbacks when used as a single sensor.
Visual sensors are vulnerable to light, such as in light or change of light. Additionally,
they may face initialization issues. LiDAR has limitations in the range of its laser,
and the accuracy of poses can be affected when the LiDAR device rotates while capturing
scenes. Moreover, it may be challenging to use LiDAR in environments with sparse features,
such as open areas. IMU also has its drawbacks; when used as a single sensor, it is
challenging to optimize the accumulated drift. Consequently, using a single IMU sensor
alone may not be sufficient to operate SLAM effectively.
Therefore, many researches integrate individual sensors as a means to complement their
respective drawbacks while leveraging their strengths. By doing so, each sensor's
limitations are compensated for by others, leading to improved performance. Subsequently,
the information from each sensor is utilized to estimate the position, and the estimated
position is optimized using graph-based or filter-based methods.
In this paper, we aim to explore researches that integrates various sensors to address
the limitations of individual sensors. We investigated the methods used to fuse sensor
information in each case. Indeed, each method is not exclusive to sensors it was initially
designed for. Just as techniques employed in Camera-IMU systems can be adapted for
LiDAR-IMU systems, a choice of a specific method doesn't necessarily restrict its
applicability to a single multi-sensor.
The structure of this paper is as follows: In Section 2, we will discuss the maximum
A posteriori (MAP) method, maximum likelihood estimation (MLE), Kalman filter for
the integration of information primarily used for sensor data fusion. Then, in Section
3, we will examine how information is fused for each sensor used. Finally, in Section
4, we will conclude and wrap up the paper.
Table 1. Pros & cons of sensors.
|
|
LiDAR
|
Camera
|
Inertial Measurement Unit (IMU)
|
|
Pros
|
• Accuracy
• Reliability
|
• Cost
• Visual Information
|
• Relative Pose Estimation
|
|
Cons
|
• Limited Laser Range
• Pose Estimation
• Open Area
|
• Light
• Texture of Material
• Sparse Feature Environments
• Initialization
|
• Drift Accumulation
|
2. Related Work
2.1. Notation
Notations used in this paper are as follows: those denoted by subscripts or superscripts,
such as $I$, $L$, $C$ respectively represent frames of IMU, LiDAR, and Camera. Moreover,
a state of a robot is denoted by $x$. Additionally, measurements from sensors are
represented by $z$. A difference between observed and predicted values is referred
to as a residual, denoted as $r$. residuals for each sensor, LiDAR, Camera, and IMU,
are expressed as $r_L$, $r_C$, $r_I$ respectively. Feature points obtained from a
Camera or LiDAR are denoted as $p$, and if their motion is predicted or estimated,
it is represented as $\hat{p}$. Also, a distance between points or between a point
and an edge or a plane is denoted as $d$.Lastly, Extrinsic Parameters, composed of
rotation and translation information, are represented by $T$.
2.2. Maximum A Posteriori (MAP)
Before introducing Maximum A Posteriori (MAP), we would like to briefly discuss a
state of a robot in this paper. The state of the robot refers to its position and
related information, which can vary slightly depending on how researchers define it.
For instance, it includes the robot's current position, rotation matrix, translation
matrix, and velocity. While there are several methods for estimating such states in
SLAM, one of them is MAP.
MAP is a Bayesian-based method that, given measured and predicted values, calculates
a probability distribution of a position at a time of acquiring information. It provides
a way to determine a probability distribution of a location when measurement and prediction
values are available. MAP is a method designed to improve limitations of Maximum Likelihood
Estimation (MLE). Likelihood is a probability of an outcome given certain data or
models. On the other hand, Posterior is to find a probability that, given a particular
result, which model among many data or models can cause this result. To illustrate
with a cause-and-effect example, $x$ is a cause and $z$ is an effect, Likelihood is
a probability of a result $z$ occurring when a cause $x$ has occurred. Posterior,
on the other hand, seeks the probability of the cause $x$ occurring given that the
result $z$ has occurred. The MAP method using Posterior takes into account the prior
probability of the already known $x$, however the MLE method using Likelihood does
not consider a prior probability of $x$. Therefore, MAP is considered to make more
accurate judgments compared to MLE. In SLAM, $x$ represents a state of the robot,
and $z$ becomes the observed sensor data.
During a state estimation process in MAP, the difference between the calculated and
measured values obtained through sensors is computed. In this process, information
from multiple sensors is collectively used. The formula for MAP can be expressed as
follows:
where $z$ represents an observed value, $h(x)$ denotes a state function, and $\|\cdot\|$
indicates the Mahalanobis norm. Therefore, the estimated state is determined based
on a difference between an observed value and a state function. This difference becomes
a residual, denoted by $r$. And a goal is to find the state $x$ that minimizes this
residual.
When employing the MAP method in SLAM, the residuals of each sensor are obtained and
computed. Therefore, based on these, this paper aims to explain using the following
formula, where $a$, $b$, and $c$ represent the types of sensors, which can vary depending
on the number of sensors involved
While additional formulas or values may be introduced or omitted in this equation,
this paper aims to proceed with explanations primarily based on Eq. (2).
2.3. Kalman Filter
The Kalman Filter is also a method used to estimate states. This approach involves
predicting a state of a robot and calculating the error between a measured values
obtained through sensors and the predicted values. In this process, information from
multiple sensors can be fused and utilized. The Kalman Filter and MAP utilize the
errors between predicted and measured values to estimate a current position of the
robot. However, MAP calculates estimated values by maximizing the posterior probability
and a result is a probability, while the Kalman filter does not directly calculate
the probability, but it calculates the estimate through the mean and covariance of
the state variables.
The Kalman Filter is a system that remove noise from uncertain observed values. When
calculating a current state using the filter, the state is predicted using a previous
motion state and motion input, and then updated with the estimated state obtained
from sensor information. This process is repeated to determine the current state.
State prediction involves predicting a distribution of the current state using probability
distributions calculated from the previous measurement update and the motion input.
Measurement update entails updating a posterior probability distribution using the
predicted probability distribution of the current state and the probability distribution
of the observed values measured at the current position. (refer to Fig. 1).
Fig. 1. Example of filter based SLAM.
There are primarily two types of Kalman Filters: the Linear Kalman Filter and the
Extended Kalman Filter. The Linear Kalman Filter is a Bayesian filter for linear system
models that follow Gaussian probability distributions. On the other hand, the Extended
Kalman Filter addresses a limitation of the Linear Kalman Filter, which cannot be
applied to nonlinear systems. It achieves this by extending the Linear Kalman filter
using Jacobian matrices to handle nonlinear systems.
In this paper, we focus on the extended Kalman filter (EKF). A basic equation of the
extended Kalman filter are as follows:
State Prediction is performed using Eqs. (3) and (4). Here, $f(x$, $\mu)$ represents the nonlinear motion model, and its Jacobian matrix
is denoted by $J_f$. The covariance matrix for the state variables is $\overline{P}$,
and $P$ represents the covariance matrix from the previous state. Additionally, $P_Q$
represents the covariance for robot motion.
Measurement Update is carried out through Eqs. (5)-(7). Eq. (5) represents the expression for the Kalman gain, where $J_h$ is the Jacobian matrix
of the nonlinear observation model $h(x)$, and $P_R$ is the covariance of the observations.
In Eq. (6), $z$ represents the measurement, and $(z-h(x))$ denotes the measurement residual.
Eq. (6) updates the current state of the robot, while Eq. (7) computes the covariance for the current state.
Furthermore, definitions of (3-7) can vary depending on how it is defined, leading
to different research outcomes.
3. Multi-Sensor Fusion in SLAM
In most studies, a current position is determined by utilizing an error between a
predicted value, calculated based on a previous position, and a measured value obtained
through sensors such as cameras or LiDAR, as described in Eq. (2). In each study, pre-integration is performed with IMU, which has a higher frequency
compared to cameras or LiDAR, to synchronize it with other sensors. However, specific
methods for pre-integration vary across studies. Nonetheless, they all aim to calculate
positional changes based on IMU information between specific time intervals. For detailed
explanations, please refer to each individual paper. Furthermore, methods for calculating
residuals for each sensor vary among studies. Nevertheless, they all involve calculating
a difference between an estimated value obtained through computation and a measured
value. For detailed explanations of residual calculation processes, please refer to
each respective paper.
In particular, in Subsection 3.4, we will explore research that involves post-processing
sensor information rather than directly utilizing the information received such as
MAP, Kalman Filter, and MLE. Specifically, we aim to delve into research that employs
marginalization and plane extraction techniques. While marginalization and plane related
studies are briefly mentioned in Subsections 3.1 and 3.2, please refer to Subsections
3.4 for details.
3.1. Integrating Camera and IMU
Using a camera as a single sensor has advantages of being cost-effective and providing
visual information. However, it is sensitive to factors such as light conditions and
object textures, and it faces challenges with initialization. Therefore, to enhance
position estimation accuracy, IMU sensors are often used in conjunction with cameras.
Additionally, the information from these sensors is fused using the Maximum A Posteriori
(MAP) method, with the basic equation represented by Eq. (2).
Among the studies that utilize both camera and IMU, some examples include VINS-Mono
[2], VIO-DualProNet [3], OpenVINS [5], OV Plane [8], PL-VINS [20], and PLE-SLAM [21]. In this context, the primary focus will be on discussing VINS-Mono and OV Plane.
VINS-Mono and VIO-DualProNet employ similar forms of MAP equations. The MAP method
is used to fuse sensor information utilizing the camera's residual ${{r}}_{{C}}$ with
Huber norm ${\rho }({s})$, IMU's residual ${{r}}_{{I}}$, and marginalization ${{r}}_{{m}}$
to estimate the current state. The equation for this is Eq. (8).
where compared to Eq. (2), differences are that the Huber norm is used in front of the camera residual and
marginalization is used, not a value obtained directly from sensors.
OV Plane [8] was developed on top of Open VINS [5]. In this research, sensor information is fused without using MAP. They identify planes
and distinguish between standard extended Kalman filter (EKF) and multi-state constraint
Kalman filter (MSCKF) depending on specific situations. During a process of obtaining
these planes, plane features ${x}_{\pi}$, IMU information states ${x}_{I}$, historical
IMU poses ${x}_{H}$, and 3D point features from LiDAR ${x}_{f}$ are combined into
a state ${x}$ and applied to the Kalman filter. For details, refer to Eq. (10).
In this equation, ${x}$ represents a state in equations (3-7) for the Kalman Filter,
and by defining the state as in Eq. (10), information of sensors is fused.
3.2. Integrating LiDAR and IMU
When using LiDAR, a laser directly contacts objects, providing highly accurate and
reliable information. However, it has limitations in terms of range and may be challenging
to use in open areas. Additionally, pose estimation can be more challenging compared
to cameras, as distortions may occur while moving. To address these challenges, IMU
sensors are often used alongside LiDAR to improve pose estimation.
Among the many studies in this category, I will describe CLINS [6], LIO-Mapping [7], LIPS [9], IN2LAMA [4], and FAST-LIO [10]. CLINS utilizes a residual of LiDAR ${{r}}_{{L}}{(}{x}{)}$ along with residuals ${{r}}_{{a}}({x})$
and ${{r}}_{{w}}{(}{x}{)}$ obtained from the acceleration and angular velocity from
the IMU.
Unlike in Eq. (2), where the information from the IMU is computed into a single residual, here the
residuals of the IMU's acceleration and velocity information are specifically utilized.
In LIO-Mapping, marginalization information, along with residuals from a IMU and LiDAR,
are summed individually. Then, unlike Eq. (2) or Eq. (8), it is divided 2.
Next, we would like to examine research that utilizes only maximum likelihood estimation
(MLE) instead of maximum A posteriori (MAP). LIPS is research that utilizes MLE to
fuse information, incorporating plane information in the process. In LIPS, points
are obtained from LiDAR, and planes are constructed using points. Subsequently, predictions
are made on how these planes moved between frames, and a difference between a measured
position of the plane and a predicted position at the same time is calculated as the
residual ${{r}}_{{{\pi}}}({x})$. Additionally, in this research, residuals are calculated
by MLE.
IN2LAMA minimizes a geometric distance ${{d}}_{{L}}$ (Point-to-Line or Point-to-Plane)
associated with LiDAR features, along with the time-shift residual ${{r}}_{{t}}$,
residual of accelerometer biases ${{r}}_{{{b}}_{{a}}}$from IMU, and residual of gyroscope
biases ${{r}}_{{{b}}_{{\omega }}}$.
where unlike previous equations, this computes the residual of time-shift and utilizes
geometric distance.
In LiDAR-IMU SLAM, there are studies that utilize Kalman Filter to estimate and optimize
positions. FAST-LIO is one example of such research. In FAST-LIO, the scan-end projected
point ${{p}}^{{{L}}_{{k}}}_{{{f}}_{{j}}}$ is obtained using LiDAR's extrinsic parameters
${{T}}^{{I}}_{{L}}$ and IMU's extrinsic parameters ${{T}}^{{{I}}_{{k}}}_{{{I}}_{{j}}}$,
along with local measurement points from LiDAR ${{p}}^{{{L}}_{{j}}}_{{{f}}_{{j}}}$
in the IMU frame.
Furthermore, an estimated feature point in a global frame ${\widehat{{p}}}^{{\kappa
}}_{{{f}}_{{j}}}$ is obtained using this projected point ${{p}}^{{{L}}_{{k}}}_{{{f}}_{{j}}}$,
LiDAR's extrinsic parameter ${{T}}^{{I}}_{{L}}$, and a calculated IMU extrinsic parameter
in a global frame ${\widehat{{T}}}^{{\kappa }}_{{{I}}_{{k}}}$. Subsequently, this
point is used in the Kalman Filter and MAP to compute residual.
In MAP, this residual is used to obtain the ${\kappa }$-th update of ${x}$ in the
iterated Kalman Filter, while in the Kalman Filter, it is used to obtain ${\kappa
}{+1}$-th update of ${x}$.
Besides these, there are instances where new sensors are incorporated, and the information
from these various sensors is fused to estimate the state. For example, there are
researches such as LIO-SAM [11] and LIRO [12].
LIO-SAM incorporates GPS as an additional sensor. Similar to how LiDAR and IMU are
used in LOAM [13], LIO-SAM utilizes LiDAR and IMU while adding GPS to estimate a robot's state. Another
one is LIRO. it incorporates UWB anchors as an additional sensor. This study calculates
residuals from LiDAR, IMU, and UWB anchors ${r}_{L}$, ${r}_{I}$, and ${r}_{U}$ similar
to other studies, to estimate the robot's position (Eq. (17)). And, it uses cost function.
3.3. Integrating LiDAR, Camera, and IMU
Using LiDAR with IMU, as well as Camera with IMU, can complement shortcomings of each
sensor and lead to better results compared to using a single sensor alone. However,
there may still be some shortcomings even with this approach. Therefore, there are
SLAM systems that combine LiDAR, Camera, and IMU to address these shortcomings and
achieve better results.
The researches to be introduced this time are LE-VINS [14], R${}^{2}$LIVE [15], LVIO-Fusion [18], and LIC-Fusion [19]. First, LE-VINS utilizes marginalization like Eq. (8). However, it uses a residual of LiDAR ${{r}}_{{L}}({x})$.
R${}^{2}$LIVE constructs its state using a prior distribution ${{r}}_{{m}}$, an observation
matrix ${{H}}$ composed of Jacobians, an error ${{\delta}}{{x}}$ between estimated
and measured states, and residuals from each sensor (Eq. (19)).
LVIO-Fusion employs a formulation where square of residuals $\|r_I\|^2$, $\|r_C\|^2$,
$\|r_L\|^2$ is multiplied by respective weights ${\omega }_I$, ${\omega }_C$, ${\omega
}_L$. This utilizes LiDAR, Camera, and IMU. Furthermore, it calculates a Huber norm
of residuals for Camera and LiDAR (Eq. (20)).
In LiDAR-Camera-IMU SLAM, there are researches that utilize Kalman Filter to fuse
sensor information. Among them, LIC-Fusion [19] defines a state ${x}$ as a fusion of IMU state ${{x}}^{{T}}_{{I}}$, extrinsics between
IMU and camera ${{x}}^{{T}}_{{calib}{\_}{C}}$, extrinsics between IMU and LiDAR ${{x}}^{{T}}_{{calib}{\_}{L}}$,
IMU information at specific image times${\ }{{x}}^{{T}}_{{C}}$, and IMU information
at specific LiDAR times ${{x}}^{{T}}_{{L}}$.
3.4. SLAM Utilizing Marginalization or Plane Extraction
In addition to using sensor information directly within MAP, MLE, and Kalman Filter,
there are methods that involve post-processing the sensor data before estimating the
state. Among these methods, we aim to explore the cases where Marginalization or Plane
Extraction methods are employed. These methods are not significantly influenced by
the type of sensors used.
Marginalization is employed to reduce a computational complexity of an odometry system
and improve odometry performance. It is a technique commonly used in sliding-window
approaches, where if a second most recent frame in a window is a keyframe, an oldest
keyframe is removed, or if it is not a keyframe, the frame is removed. In Fig. 2 (a), if ${\pmb{x}}_{\pmb{4}}$ is a keyframe, ${\pmb{x}}_0$ is removed (Fig. 2 (b)). If it is not a keyframe, ${\pmb{x}}_{\pmb{4}}$ is removed (Fig. 2 (c)).
Fig. 2. Example of marginalization.
Examples of such researches include VINS-Mono [2], LIO-Mapping [7], LVIO-Fusion [18], R${}^{2}$LIVE [15], FAST-LIVO [16]. Each corresponds to Eqs. (8), (12), (20), and (19), respectively, excluding FAST-LIVO, all of which include the marginalization residual
${{r}}_{{m}}$.
Next, we would like to explore research on estimating states by reconstructing planes.
These planes are then integrated with information from other sensors into a state
vector ${x}$ for further use. If researches involve detecting planes, as shown in
Fig. 3, it identifies planar regions and observes the movement of these planes. The reason
for estimating planes is that planes are abundant in a real world and are wide in
area, making them suitable as landmarks or references. For further details, refer
to [8].
Fig. 3. Example of using a plane.
Among these researches are OV Plane [8], LIPS [9], LVI-Sam [17], where the equations for OV Plane and LIPS are Eqs. (10) and (13), respectively. In LVI-SAM, Visual-Inertial Odometry and LiDAR-Inertial Odometry are
mixed together. In this case, an IMU not only provides biases but also inertial measurements.
Additionally, visual features and LiDAR features are used together for plane extraction
and depth registration, and this information is then utilized in Visual-Inertial Odometry.
Table 3. Summary of sensor types used and classification based on integration methods.
|
Sensors
|
Title
|
Integration Method
|
Additional Remarks
|
|
Camera + IMU
|
VINS-Mono
|
MAP
|
Marginalization
|
|
VIO-DualProNet
|
|
OV Plane
|
Kalman Filter
|
Plane
|
|
LiDAR + IMU
|
CLINS
|
MAP
|
-
|
|
LIO-Mapping
|
Marginalization
|
|
IN2LAMA
|
MLE
|
-
|
|
LIPS
|
Plane
|
|
FAST-LIO
|
MAP
|
Kalman Filter
|
|
LIO-SAM
|
-
|
|
LIRO
|
Cost Func.
|
-
|
|
LiDAR + Camera + IMU
|
LE-VINS
|
MAP
|
Marginalization
|
|
R2LIVE
|
|
Fast-LIVO
|
|
LVI-SAM
|
Plane
|
|
LVIO-Fusion
|
Weights
|
|
LIC-Fusion
|
Kalman Filter
|
-
|
Additionally, besides the methods introduced this paper, there are various other approaches.
Among them, methods utilizing points and lines, such as PL-VINS [20] or PLE- SLAM [21]. This method is not introduced because it introduced the method utilizing planes.
4. Conclusion
Simultaneously Localization And Mapping (SLAM) is a technology that scans the surrounding
environment, creates a map based on this information, and simultaneously estimates
the device's position. This technology is widely recognized, especially in the fields
of mobility, robotics, and autonomous driving.
In SLAM, cameras and LiDAR are commonly used as main sensors. Additionally, various
sub-sensors are employed to complement them, with the Inertial Measurement Unit (IMU)
being a frequently used sensor. When each sensor is used individually, there are limitations
and challenges. LiDAR has range constraints and pose estimation issues during movement,
while cameras are susceptible to lighting conditions. Furthermore, IMU sensors struggle
with cumulative drift. To improve these limitations, multiple sensors are combined
to mitigate the drawbacks of each.
Each method is not exclusive to sensors it was initially designed for. Just as techniques
employed in Camera-IMU systems can be adapted for LiDAR-IMU systems, a choice of a
specific method doesn't necessarily restrict its applicability to a single multi-sensor.
Recently, when exploring Multi-Sensor Fusion SLAM research, recent researches have
shown a slightly greater focus on Graph-based SLAM compared to Filter-based SLAM.
Additionally, a prominently mentioned research in these studies is VINS-Mono [2].
As SLAM technologies are becoming more commercialized, various SLAM products are emerging.
However, many of these products are challenging for individuals to directly utilize.
That is, it is difficult to acquire and use raw measurements from the sensors of the
products, or modify the algorithms and code of the products.
As time progresses, SLAM is expected to be increasingly integrated into individual
products or other robotic systems. However, every product is bound to have limitations,
and it becomes challenging for individuals to make further improvements directly.
The SLAM methods introduced so far involve obtaining raw measurements directly from
sensors and fusing them to estimate the position. Furthermore, beyond these approaches,
I believe that research should focus on utilizing existing products to leverage previously
estimated data from sensors. This can be achieved without directly modifying algorithms
but through additional efforts to achieve more accurate position estimation.