1. Introduction
Forensic Face Recognition (FFR) has become a critical tool in investigations due to the proliferation of surveillance cameras or mobile phones that continuously produce trace images. It consists of comparing images representing faces and evaluating the results in a forensic contest. The comparison involves two images, the trace image, which is typically taken under uncontrolled conditions, and the reference image, which is taken under controlled conditions. In the trace image, the identity of the depicted subject is contested or uncertain. In the reference image the identity of the subject is known; this one is the mug shot. A traditional mug shot is a front and side view of a person from the shoulder up, taken by law enforcement agencies. Its primary purpose is to provide a photographic record of arrested individuals for identification by victims, the public, and investigators. In a mug shot, the facial expression must remain neutral, the eyes must be open, and no hair or other objects should obscure the face, as required by police standards for passport photographs. In forensics images are used for different purposes. The ultimate aim of FFR is the evaluative purpose, that is to interpret the result of the comparison between the trace and mug shot of the suspect to present it as evidence in court [
1]. As the FFR represents a mean to determine the strength of evidence used in a court of law, it must meet different legal requirement depending on the country. The specific guidelines regarding procedures, quality principles, and approaches, are those provided by the European Network of Forensic Science Institute (ENFSI) and by the Facial Identification Scientific Working Group (FISWG) to guarantee the reliability of the analysis process. Both ENFSI [
2] and FISWG [
3] define all the workflow of FFR and also which method may be used for image comparison. Among them, the most recommended is the morphological analysis, whereby facial images are compared feature by feature methodically using a check list. Currently, 2D mug shots are used in FFR, but the increasing number of surveillance (CCTV) cameras is creating a large repository of trace images that need to be matched against millions of existing mug shots. The comparison with traditional front and side mug shots is complicated by the fact that CCTV cameras mainly capture images from above. This is one of the problems that the investigator encounters: the images of the anonymous person are captured by cameras with significantly different angles than those of the conventional mug shot (front and right profile). For this reason, facial comparison may not be possible (ENFSI) or may even produce false negatives.
Figure 1 [
2] shows the variation in the appearance of the same individual at different camera angles.
New research in the field of the forensic science is exploring the benefits of new technologies and the use of 3D tools in the analysis, interpretation, and presentation of forensic data is increasing in the criminal justice system [
4]. As a result, the evolving needs of the forensic community are moving away from traditional portrait-based facial recognition methods [
5] toward a 3D mug shot. Unlike 2D mug shots, which are fixed in perspective, 3D mug shots are flexible in perspective, improving comparison according to the investigators’ needs. Nevertheless, the creation of a 3D mug shot is a more challenging process than that of a 2D one. This is due to the necessity for highly precise 3D reconstruction to prevent the distortion of the subject’s facial features.
The FISWG have developed a detailed facial features checklist document to produce standardization in examinations. Every facial component is described by means of characteristics and descriptors such as size and shape. Moreover, it specifies that, with the term distance, the precise value of a dimension is not the intended determinant, but rather the relative size of this dimension with reference to a morphological analysis.
Figure 2 [
3] shows examples of alterations to the position among facial components and the effect those position have on the overall face composition. It is important to emphasize that the respect of proportions required by the FICWG guidelines is met if the reconstruction minimizes differences from the real face.
According to this requirement, it can be assumed that a 3D face reconstruction (3DFR) that is closer to the subject will produce a 3D mug shot that is close to the real one. It is difficult to define the required accuracy, relative or absolute, because, to the best of our knowledge, we have not encountered any explicit references to establish an error threshold, but at this preliminary stage, a threshold of no more than 1 mm error seems reasonable. Techniques using laser or structured light have achieved sub-millimetre accuracy in the production of 3D models. However, the long acquisition time makes them less suitable for producing a 3D face model in a non-cooperative scenario, such as that of an arrested person. In [
6], Schippers et al., 2024 compare a stereo photogrammetric system (3dDM) with two handheld structured light scanners, the Artec Eva and the ArtecEva Spider from Artec 3D. The acquisition time is 20 s for the first system and 60 s for the other one against 0.15 ms for the photogrammetric system. A short exposure time is achieved through camera settings like shutter speed. Typically, fast shutter speeds, such as 1/250th of a second (4 ms) or faster, will freeze and capture quick-moving action, resulting in a clear image of a subject that would otherwise be blurred.
In recent years, 3D facial reconstruction technology based on imagery has made rapid development. Systems for 3D face reconstruction (3DFR) based on imagery are currently being developed by several researchers [
7,
8,
9]. More recent approaches are those presented in [
10] where Zhang et al., 2024 summarize the most relevant algorithms for 3D facial reconstruction based on a single image or two images as the frontal and lateral images of a conventional mug shot. More specifically, methods based on a 3D Morphable Model and deep learning have attracted research interest. Zhang et al., 2024 also describe the available public dataset that allows realization of a quantitative analysis with respect to a ground truth and a comparison between methods. Reconstruction algorithms may be based on mug shots or sequences of video frames, and their aim is to improve the matching against an existing mug shot database. In [
11] La Cava et al., 2022 provide a comprehensive and up-to-date review of the state of the art in 3DFR algorithms for forensic application. Algorithms suitable for forensic applications should satisfy constraints leading to the legal validity of the conclusion during a lawsuit or in the investigation phase. The 3DFR approaches were divided into two main groups based on whether they are evidence-based, such as CCTV frames or based on a 2D mug shot or photographs—the so called model-approach. The approaches were evaluated against the essential requirements of a forensic system, including robustness to facial ageing and pose variation, robustness to occlusions, use of facial scars and marks, and adherence to biometrics characteristics. The model-based approach allows the strategy of introducing a gallery of various predefined poses in the 2D domain to enhance the representation capability. The author analyses the 3DFR obtained from the 2D mug shots (frontal and lateral view) or from one or multiple images. The conclusion of La Cava et al., 2022 is that a rigorous photogrammetric approach based on a large number of images is the recommended solution to ensure that the criteria are met in a way that is both effective and reliable, specifically in terms of high quality of biometric characteristics. Photogrammetry uses two or more cameras to extract 3D information about the target object, such as a face [
12]. The output is a point cloud that can be used to generate a 3D model of the object. Texture information (RGB values) can be associated with each point to produce a photorealistic model. The use of a large number of synchronized cameras is essential to capture every part of the face redundantly due to the complexity of facial geometry. A large number of overlapping images is mandatory not only to guarantee the coverage of the whole face but also to optimize the image orientation and image alignment. Photogrammetric systems for facial or human body reconstruction are effectively applied in applications in the entertainment industry and in the medical field. The commercial company, Xangle Studio [
13], provides 3D human body reconstruction for film visual effects and games. It uses about 100 full-frame cameras for head reconstruction and about 200 full-frame cameras for body reconstruction. 3D stereo photogrammetric imaging systems are increasingly used in clinical and research settings for facial surgery and rhinoplasty [
14]. The 3dMD System [
15] is one of the most widely used imaging systems currently on the market [
16,
17]. In this system, a random light pattern is projected onto the subject while precisely synchronized cameras capture images from different angles according to an optimal configuration. The accuracy reported by the manufacturer is 0.2 mm. In [
6], Schipper et al., 2024 verify the system reliability of the 3dDM system by scanning the head of a mannequin and the faces of healthy volunteers several times, achieving a mean error of 0.23 mm on a few reference distances. The Botscan system by Botspot [
18] uses 70 synchronized DSLR cameras to simultaneously capture images of a standing person for 3D reconstructions. In [
19] (Michuenzi et al., 2018) compare measurements acquired on a 3D model realized with the Botscan system and Agisoft software against the measurements extracted from forensic photographs. The measurements obtained by photogrammetry were significantly more accurate than those obtained by standard forensic methods based on 2D mug shots, with mean differences of 1.5 mm compared to 3.6 mm. In [
20], Leipner et al., 2019 present a 3D mug shot system specifically developed for forensic identification using a photogrammetric approach. The system comprises 26 digital single-lens reflex (DLSR) CANON EOS 80D cameras arranged in a semicircle with a radius of 1.46 m. The model is scaled with a single reference distance, and its validation is primarily focused on capturing the morphological facial features analyzing different focal distances. Full-frame cameras, known for their high performance and high-resolution characteristics, ensure accuracy, minimal noise, and distortion, especially when paired with top-quality lenses. The primary goal of this research is to evaluate the potential of using low-cost cameras to achieve submillimetre accuracy in reconstructing 3D facial models. In collaboration with the Scientific Investigations Department (RIS) in Rome, we conducted a pilot study, developing a multi-view system utilizing high-resolution Raspberry sensors [
21]. These sensors offer a significant cost advantage compared to full-frame cameras. The system’s performance was assessed by acquiring a 3D model and analyzing the point-wise reconstruction error.
2. Materials and Methods
Before delving into the methodology, it is crucial to understand the entire photogrammetric process involved in producing a 3D mug shot. An object, defined within an established 3D world frame, is captured using two or more images. Each image is acquired from a unique camera position, defining a 3D camera frame with its origin at the camera’s projection centre and its z-axis aligned with the optical axis of the camera. For each camera frame, the three coordinates of the projection centre in the 3D world frame and the three rotations needed to align the 3D camera frame with the 3D world frame define the orientation parameter of the six cameras. The external parameters establish the relationships between the 3D world frame and the individual camera frames. The photogrammetric process can be divided in three stages. The first stage is the acquisition step (
Figure 3a), where each image is captured from a different camera location according to a predetermined network design that defines the orientation parameters,
T, of each camera. To ensure accurate facial reconstruction, the images should provide full coverage from ear to ear, with at least a double coverage for every facial feature. Additionally, cameras must be synchronized to capture all images simultaneously in a single shot minimizing motion blur caused by subject movements.
The second stage (
Figure 3b) involves the orientation process, where the orientation parameters,
T, are estimated using a set of points that allow the writing of projective relationships between the images. Key points for image alignment are primarily tie points (TPs), which connect images sharing common object features. Automatic methods for detecting corresponding interest points between images often generate a very large number of TPs. Ground control points (GCPs), in addition to tie points (TPs), are a limited set of points with precisely known coordinates in an external reference frame. For high-accuracy surveys, these points are often marked and can be located in images with sub-pixel precision. The knowledge of TPs is crucial for the second stage of the process and is derived entirely from the images. This step can be approached in two different ways, depending on whether we are working solely with TPs or with TPs and GCPs. In the first case, the object will be reconstructed in a relative frame, whereby the orientation parameters of all cameras are determined relative to each other (relative orientation). The resulting 3D model is out of scale and not oriented. In the second case, the knowledge of the GCPs provides a 3D model that is both scaled and oriented (external orientation). The third stage (
Figure 4) involves object reconstruction based on the acquired images and known orientation parameters. This process relies on a dense matching algorithm and generates a point cloud representing the object either in a relative frame or in the world frame.
As in any photogrammetric survey for automated surface reconstruction, image orientation is a fundamental aspect of the process. To improve the robustness of the multi-view system in determining orientation parameters, a higher coverage than double is recommended. Additionally, accurate face reconstruction must account for intrinsic facial movements. This necessitates very short image acquisition times, achievable only with multiple synchronized cameras.
Our study presents some preliminary results of a feasibility study for a low-cost photogrammetric system for 3D face reconstruction in forensic, aiming for submillimetre accuracy. The low-cost camera we tested is a Raspberry, which is particularly interesting because it supports C-mount lenses. The main challenge with low-cost cameras is their lower signal-to-noise ratio due to smaller sensors. Additionally, smaller sensors limit the field of view, which is another obstacle to overcome. The price ratio between a Raspberry and a DSLR is approximately 1/10. This is crucial in the forensic field where there is a great interest in producing a more affordable system that guarantees the same accuracy. As an example, in Italy, only the ‘Arma dei Carabinieri’ has a network of at least 500 systems distributed throughout the country for producing traditional mug shots. The system was tested in two phases: in a virtual environment and in a real-world setting. In both phases, we conducted a quantitative analysis of the reconstruction error using a 3D virtual model that is our ground truth. This reference model was captured directly within the virtual environment, allowing a point-by-point measurement of the reconstruction error between our reconstructed model and the ground truth.
It is important to note that in the real-world setting, we captured a 3D printed version of the reference model using the Raspberry Pi cameras. The virtual model was printed using a Stratasys J750 by Stratasys [
22] 3D printer. At full scale, this printer has a declared accuracy of up to 0.2 mm for rigid material, which is negligible compared to our acceptable error range of less than 1 mm.
Figure 5 shows the virtual model on the left, some details of its meshes in the middle, and the printed model on the right. Reconstruction errors were then calculated by measuring the point-by-point distance between our reconstructed model and the virtual ground truth. Therefore, the reconstruction error of the real-world test also introduces the intrinsic printer error. Reconstruction errors are visually represented as a colour map overlaid onto the 3D model, providing a clear visualization of error distribution across the facial surface.
Virtual test aims to anticipate some practical aspects of the study by realizing an investigation into the accuracy and completeness of the body model reconstruction, as a function of camera network configuration and of camera orientation. The virtual environment was created in 3DStudio Max 2022 of Autodesk [
23]. The first test was carried out using the true camera orientation parameters to produce the 3D face reconstruction. It served as a crucial benchmark for the following tests. Subsequently, we investigated the two photogrammetric orientation approaches and the camera configurations to determine which workflow most closely approximated the accuracy achieved with true orientation parameters.
In the real setting, we transferred the results obtained from the simulation network to evaluate the performance of such a sensor. We had only three Raspberry Pi cameras, which were moved along the acquisition position. Three cameras were less than ideal for our multi-view photogrammetric system. However, as we were capturing a static object (the 3D printed 3D model) camera synchronization was not essential. Synchronization becomes critical when capturing images of a real person, especially in non-collaborative context. Despite the lack of synchronization among all the frames, since we took all the image in sequence with the relocation of the cameras to cover all the angles, we successfully obtained the 3D models for three volunteers who maintained their positions throughout the acquisition process. The volunteer’s collaboration allowed us to verify the performance of dense matching on real subjects, which is influenced by camera resolution. For one of them we produced a complete satisfactory 3D facial model that allow us to test the face matching software (NeoFace Watch ver5.1.3.15) routinely used by investigators to assess potential improvements in matching accuracy compared to traditional 2D mug shots. It is important to emphasize that this last test, which focuses on improving facial recognition, is still in its early stages and we will be presenting preliminary results.
When designing the camera configuration, we considered the following factors: a minimum triple coverage of the face; the angle between adjacent camera below 30° to improve image alignment and subsequent orientation; and a camera-to-subject distance of 1.5 m or less to meet the space constraints of the RIS investigator. The most basic camera configuration that meets these guidelines is a single row of eight cameras arranged in a semicircle with a radius of 1.5 m, covering a 180° arc. The angle between adjacent cameras is 25.74°, the distance between the projection centres of adjacent cameras is 40.39 cm and a field of view of 20° ensures triple coverage. Starting from this basic configuration, we gradually increased the number of cameras to achieve additional facial coverage. This involved adding cameras above and below the initial row, creating both side and lateral overlap. The final configuration consisted of 24 cameras, through the addition of two more rows of 8 cameras positioned half a metre above and below the first row, resulting in a total cluster of 24 cameras. The cameras in the additional rows were tilted 18° along the x-axis, downwards for the 8 cameras in the row above the first and upwards for the 8 cameras in the row below.
Figure 6 shows the most basic camera configuration composed by 8 cameras and the complete 24-camera configuration.
We used the software Agisoft Metashape ver.1.7 [
24] to solve the image orientation and image matching step. At first Metashape detects correspondences across the photos, then it applies a greedy algorithm to find approximate camera location and refines them later using a bundle adjustment algorithm to obtain an accurate camera position, orientation, and distortion parameter. The image alignment is followed by the dense matching (reconstruction stage) whose outcome is the point cloud of the acquired object. All reconstruction errors were calculated using the Cloud-to-Mesh Distance plugin in the open-source software Cloud Compare [
25].
4. Discussion
Research in the forensic context is pushing for acceptance of the 3D data offered by the new technology, which must prove the strength of the evidence. Morphological features must be preserved, and accurate metric reconstruction is essential to maintain proportionality. These advances aim to increase the evidential value and acceptance of forensic facial recognition technology and 3D mug shot within the legal system and, therefore, facial reconstruction for application in a forensic context requires more rigorous approaches. This study proposes a multi-view photogrammetric approach utilizing low-cost Raspberry Pi cameras to create high-precision 3D mugshots with submillimetre accuracy.
We first designed the network in a virtual environment to verify the camera configuration, then we compared the outcomes obtained working in a world frame through GCPs distributed in the environments and in a relative frame. The results of the tests realized using the external orientation parameters set in the virtual environment compared with those obtained using estimated orientation parameters, highlight the critical influence of the external orientation task. It is more prone to error than the alignment in a relative frame. Our results indicate that cameras with reduced fields of view, due to their sensor dimensions, require three rows of cameras to produce a robust image bundle oriented in a relative frame. Tests realized in a real setting emphasize the importance of accurately estimating the scale factor whenever it is necessary to extract metric information from the face. Specifically, we observed that 80% of the points had a distance of less than ±1 mm when multiple scale bars were used to estimate the scale factor. More specifically, we used 10 scale bars extracted from the GCPs behind the model.
Furthermore, despite the limited number of synchronized cameras, we were able to obtain the 3D models from three volunteers, who were able to maintain their position for several minutes. More specifically, one of them was a complete model, which allowed us to evaluate the improvement in forensic recognition of the 3D mugshots over the 2D mugshots, showing an increase in matching score of up to 0.42 points, especially in scenarios where the subject is not captured from a frontal view.
In reality, the images of a subject to be identified are rarely taken from a frontal or profile view. Surveillance cameras mainly capture people from above and this factor can be decisive in investigations. Therefore, having a 3D model of a suspect’s face would allow forensic experts in facial recognition to improve the alignment of the known face with anonymous images captured by CCTV. The study represents an initial feasibility analysis for the design of a multi-vision system for facial reconstruction. Further development, such as acquiring a complete synchronized multi-view system and creating a dedicated environment for system development can only enhance these preliminary results.