1. Introduction
Recent innovations in the medical field have led to the proliferation of technological advances inside operating rooms (ORs). As advanced as today’s operating theaters are, increasing surgical workflow complexity, the emergence of new needs of clinicians, and new patient preferences, from one side, and advances in data science, artificial intelligence (especially deep learning), and computer vision, from the other side, are all likely to be key features in future ORs [
1,
2,
3]. By integrating intelligent context-aware systems (CASs), future ORs will enable the transform from clinicians’ knowledge-based to a more data-driven surgical treatment. Indeed, data-driven treatments include perceptively interacting with medical teams (e.g., surgical and anesthesiological teams), enabling multi-perspective knowledge-sharing between medical teams, providing medical support, and mitigating possible complications [
2,
4]. In this context, CAS should be able to conceive the workflow inside the OR, understand the current situation by fusing data from different perspectives (surgical and patient-related data) [
5,
6], and predict upcoming surgical events. Thus, analyzing surgical workflow inside the OR represents a central goal of CAS [
7,
8].
Analyzing a surgical workflow relies on modeling surgical procedures as surgical process models (SPMs) [
8]. In this context, surgical workflows can be described as sequences of surgical phases that represent the main tasks performed during surgery [
7,
8]. Surgical phases consist of goal-specific high-level tasks. Hence, different granularity levels have been defined to model the surgical procedure [
8]. In fact, recognizing surgical phases and detecting surgical tools have great potential in providing intra-operative and post-operative assistance for clinicians. Recognizing the current surgical phase and predicting upcoming phases help in promoting better situational awareness inside the OR and providing medical support to the surgical team by detecting abnormal cases. Moreover, the duration of the surgical procedure can be estimated and the schedule of the surgical department as well as resources can be optimized [
9]. Automatic surgical phase and tool recognition systems can also be utilized to label recorded data and, therefore, provide trainees with training materials.
In the domains of surgical phase recognition and tool detection, the described approaches relied on different data sources, such as surgical videos (laparoscopic [
10], microscopic videos [
11]), sensor data [
12,
13], instrument sensors [
14,
15], and medical device data [
16]. As laparoscopic surgery became an established surgical practice and nearly replaced open surgery, significant research was conducted on laparoscopic video data. The main advantages of laparoscopic videos over other data sources are that they are already integrated into the current setup inside ORs, can be easily accessed and captured, and can provide comprehensive information about the surgical instruments used, the anatomies treated, and activities conducted. On the other hand, analyzing laparoscopic videos has been a challenging task for researchers in the field of surgical data science (SDS) [
2]. Extensive efforts have been made to develop video-based approaches for the automatic recognition of surgical phases [
17] and the detection and localization of surgical tools. Earlier approaches relied on extracting visual features from laparoscopic images and then employing an adequate classifier [
10,
18,
19]. In recent years, the present-day evolution in deep learning (DL), triggered by the development of high-performance hardware infrastructure, has shifted the focus to DL-based approaches rather than relying on traditional machine learning approaches. Interestingly, DL techniques, specifically convolutional neural networks (CNNs), have shown superior performance to other methods [
20].
In this work, a spatiotemporal, a weakly-supervised deep learning approach for analyzing laparoscopic surgical videos (in terms of surgical phase recognition, surgical tool classification, and localization) was proposed. Initially, ResNet-50 was chosen as the base model, but with the following modifications: First, four squeeze-and-excitation (SE) attention modules were adopted to the CNN architecture to enhance the CNN capability to learn more discriminative features and focus on tool-related regions in the image; Second, feature maps from low and top layers were aggregated to generate a better representation of image content. The aggregated features were then shared by two branches (tool and phase branches). Following a similar trend as the earlier approaches [
21], the tool branch contained a convolutional layer to generate multiple feature maps per tool class. By implementing tool-wise pooling and spatial pooling operations, the tool-related feature maps were transferred into a localization map and tool-presence confidence, respectively. The phase branch was composed of a global average pooling (GAP) layer followed by a concatenating layer to include tool presence probabilities in the final feature vector for phase recognition. The LSTM network was finally employed to model temporal information that is crucial for phase recognition and tool presence detection tasks. The proposed model was evaluated on the Cholec80 dataset [
22].
2. State of the Art
For surgical phase recognition, several DL approaches are presented in the literature, including spatial and temporal models. Indeed, temporal information along the surgical video sequence is essential to model dependencies between surgical phases that are typically performed in a specific order [
17]. Therefore, a base CNN model, such as ResNet-50 [
23] or VGG-16 [
24], was first adapted and utilized for extracting spatial features from laparoscopic images. Then, a temporal model (such as the hidden Markov model (HMM) [
22,
25] or recurrent neural network (RNN) [
26,
27,
28]) was incorporated to refine the CNN predictions. Twinanda et al. presented a multi-task CNN model that performed surgical phase recognition and tool classification [
22]. Hierarchical HMM (HHMM) was employed as a temporal model to perform online and offline recognition of surgical phases. To overcome drawbacks imposed by statistical models, later approaches implemented long short-term memory (LSTM) networks to learn temporal features [
26,
28,
29]. For example, Twinanda et al. substituted the HHMM models in the EndoNet methodology with an LSTM model [
26]. Jin et al. proposed a CNN-LSTM deep learning framework (SV-RCNet) trained end-to-end with a prior knowledge inference scheme to carry out phase recognition [
29]. Similarly, Jin et al. devised the MTRCNet approach that performed both the surgical phase and tool recognition and employed a novel loss function that considered the phase-tool relation [
27]. Jalal et al. suggested using a nonlinear autoregressive network with exogenous input (NARX) with a CNN for surgical phase prediction [
28]. In [
30], a temporal approach called TeCNO, which combined a ResNet-50 model with a multi-stage temporal convolutional network (MS-TCN), was proposed. Recently, various transformer-based models tailored for laparoscopic phase recognition have been introduced [
31]. For instance, Czempiel et al. designed the OperA approach that is based on a transformer model to concurrently learn spatial and temporal features along video sequences [
32]. Gao et al. employed a hybrid embedding aggregation transformer to aggregate spatial and temporal features generated by ResNet-50 and TCN models [
33].
Surgical tool classification and localization were tackled in a similar fashion to phase recognition, and several methods suggested multi-task models for surgical presence detection and phase recognition. In [
22], surgical tool classification was conducted solely based on spatial features learned by a CNN model. Subsequent studies addressed typical challenges facing tool classification methods, such as imbalanced data distribution and obscured images. Loss-sensitive and resampling techniques were introduced in [
34] to mitigate the effects of the imbalanced distribution of surgical tools on the CNN training process. Spatiotemporal models were introduced to refine tool predictions obtained by the CNN model. Several temporal models, such as LSTM [
35,
36,
37], graph convolutional networks (GCN) [
38], and convolutional LSTM [
39], were presented in previous works. Abdulbaki Alshirbaji et al. proposed combining a CNN model with two-stage LSTM models to model temporal dependencies in short video clips and along the complete surgical video sequence [
35]. Despite the various successful approaches that have been developed, the progress achieved in the SDS field is still limited and still lacks significant applications in practice [
2]. The main reason is the scarcity of labeled surgical data. Therefore, several techniques have been introduced to increase the size of data used for training CNN models [
40]. These include data augmentation and generative adversarial network (GAN) [
41]. Moreover, weakly-supervised learning of CNN models represents a potential solution for object localization. Here, the CNN is designed to perform object localization but trained only with object presence binary labels. Durand et al. suggested an approach for weakly-supervised object localization by adding a multi-map localization layer on top of the CNN model [
21]. They also introduced a novel spatial pooling strategy to transform the multi-maps into a class-wise localization map. Their approach was investigated by Vardazaryan et al. [
42] and Nwoye et al. [
39] for surgical tool localization in laparoscopic videos, and it performed very well.
Recently, CNN-attention networks have been proposed; they involve adopting attention modules into the CNN architecture to help generate more focused features. Shi et al. proposed an attention-based CNN to perform surgical tool detection in laparoscopic images [
43]. In [
44], an attention-guided network (AGNet) for surgical tool presence detection achieved high performance on the m2cai16-tool dataset. Furthermore, Jalal et al. emphasized the value of employing attention modules for surgical tool localization in their feasibility study [
45]. Attention CNN was capable of generating more fine and focused gradient class activation maps (Grad-CAM) that were utilized to extract bounding boxes. The proposed approaches were evaluated on a single dataset or a single type of surgical procedure. Therefore, the robustness and generalizability of deep learning approaches toward new data represent the main concerns that need to be investigated before translating these approaches into clinical practice [
46].
4. Results
The results of the tool presence detection obtained by the
CNN-MMC,
CNN-SE-MSF, and
CNN-SE-MSF-LSTM approaches are shown in
Figure 3.
Figure 4 shows the F1-score for tool localization using the
CNN-MMC and
CNN-SE-MSF approaches. From both figures, the results show the value of adding the attention modules and combining features from multiple stages to improve tool presence detection and generate better localization maps for all tools. The average precision of all tools was enhanced by a large margin over the
CNN-MMC, and the most notable enhancement was achieved after employing the LSTM network. To further validate the results of the proposed approach, tool-wise comparisons between the
CNN-SE-MSF-LSTM and the state-of-the-art methods are presented in
Table 3. As can be seen, the proposed approach achieved superior performance over the state-of-the-art methods in most tool categories.
Table 4 lists the phase recognition results on the Cholec80 dataset using the
CNN-SE-MSF-LSTM approach. The precision and recall of all phases and the mean values are presented. Additionally, a comparison with the leading methods is also presented in
Table 5. The training times and inference times of the evaluated approaches are presented in
Table 6.
In order to provide insight into the performance improvement achieved by the proposed approach, qualitative results for tool detection and phase recognition were visualized.
Figure 5 visualizes localization maps of every tool obtained by the
CNN-MMC and
CNN-SE-MSF models. Every image contains the manually labeled bounding box and the predicted bounding box of the corresponding tool, and is labeled with the IoU value between the two boxes. The examined tool class probability was higher than 98% obtained by
CNN-MMC or
CNN-SE-MSF for all images.
Figure 6 shows the predictions and ground truth of the top-3 and bottom-3 procedures for surgical phase recognition.
5. Discussion
This study presents a multi-task, weakly supervised deep learning approach trained by binary tool presence labels and phase labels to analyze laparoscopic videos. The approach is intended to recognize surgical phases and detect and discriminate between surgical tools. An extensive evaluation of the proposed model was conducted on the Cholec80 dataset [
22].
5.1. Phase Recognition
The proposed approach yielded a mean recall and mean precision of 89.0% and 87.9%, respectively. These values improved on the base ResNet-50 model mean recall and precision values of 71.8% and 72.0%, respectively. Hence, the attention modules and combining features from multiple stages helped the model to leverage better phase-related feature representation of the laparoscopic image content. Additionally, the LSTM network contributed effectively to modeling the temporal constraints of surgical phases.
As can be seen from the precision and recall values in
Table 4, the proposed approach achieved the best performance for P1, P2, and P4 with recall values of 94.6%, 95.8%, and 95.2%, respectively. Conversely, the results of other phases were lower, particularly for P6. This high variance between these phases is interpreted by the imbalanced data distribution, where P2 and P4 typically have longer periods than other phases in the cholecystectomy procedures. This can be seen from the mean duration of each phase presented in
Table 1. The first four phases are performed consecutively (i.e., linear phase transitions), while the last three phases are associated with non-linear transitions. Consequently, P5, P6, and P7 had lower recognition performance. In a similar way, obtaining a high precision value of P2 (98.4%) interprets the high recognition results of P1 despite its low distribution in the dataset.
The tool-phase relation has already been addressed in other works and also described in the introduction and methodology sections. Therefore, it is worth noting that, there is a high correlation between the results obtained for tools and phases. For instance, the high hook presence detection performance of 99.7% matches the high recognition performances of P2 and P4, which are mainly performed with the hook tool. Furthermore, the improvements obtained by the CNN-SE-MSF-LSTM approach over the CNN-MMC for scissors explain the improvements obtained for P3 (Recall of 86.3%).
5.2. Tool Classification and Localisation
Experimental results show that adding the SE attention modules and combining features from low and high layers improved the tool classification performance over previous methods. Moreover, employing the LSTM network yielded the most notable improvement for all tools, particularly the scissors.
CNN-SE-MSF-LSTM and
CNN-SE-MSF achieved mAP values of
and
, respectively. These values exceeded the established
CNN-MMC [
42] mAP of
(see
Figure 3) and imply the advantage of using attention modules and the MSF for tool classification. Moreover, modeling temporal dependencies along the video sequence helped refine classifications obtained by only employing spatial models.
Every surgical phase is performed by the surgeon using a specific set of tools. This explains the basis of developing a multi-task approach that jointly performs tool and phase recognition. Since the surgical phases are performed in a specific sequence, the tool’s appearance during the surgery is also somewhat constrained. Therefore, the best classification performance was achieved after employing the LSTM (
Table 4). The AP of some tools (e.g., scissors) was enhanced by a larger margin over the spatial model, while other tools had a smaller improvement (e.g., grasper). This high variance in improvements between the tools can be interpreted by the surgical phases associated with these tools. For instance, the scissors were only required during the third phase (cutting and clipping) to cut the cystic duct. Hence, the LSTM learned discriminative temporal information for the scissors. On the other hand, the grasper was utilized during all phases to grasp tissues. Therefore, modeling temporal information provided negligible performance enhancement for the grasper classification.
Figure 4 shows the qualitative assessment of the
CNN-MMC and
CNN-SE-MSF-LSTM approaches. From the localization maps of each tool, it can be noticed that the
CNN-SE-MSF was capable of learning the tool regions better than the base
CNN-MMC model. The IoU values between the manually labeled and predicted bounding boxes show better localization performance using the
CNN-SE-MSF approach. Moreover, adding the SE and MSF to the
CNN-MMC helped smooth the localization map and make it look closer to the shape of the tooltip (
Figure 5, grasper tool).
In the Cholec80 dataset, the grasper has multiple tool instances (i.e., up to three graspers may appear in the image) while all other surgical tools have a single tool instance. The proposed approach was designed to generate one localization map per tool, Nevertheless, multiple instances of the grasper were detected through a post-processing step. Here, the largest three objects in the localization map of the grasper were considered as ’detections’, and the bounding boxes were assigned the same confidence that represents the grasper presence probability.
Figure 7 shows an example of multiple instances of the grasper and the detected bounding boxes. As can be seen, the proposed approach was able to localize the three instances of the grasper, however, only two of the ’detections’ were considered as
with IoU greater than
. The region that contained the shaft of the third grasper (with IoU = 27.83%) was also detected in the localization map as part of the tool and not only the characteristic tip. Indeed, the weakly-supervised training of the proposed approach resulted in relatively larger bounding boxes that included the tip and parts of the tool shaft. In principle, the shaft is also part of the tool but according to the evaluation criteria of this work, only the tooltip should be localized. However, labeling the tool shaft with additional bounding boxes has the potential to better evaluate the performance of the model in terms of the ability to separate background information from tool regions, and potentially also capture the tool orientation as well as location.
Figure 8 shows that the proposed approach failed to detect the bounding box precisely, even in cases when the activated regions matched the tool location in the image. For the first image (
Figure 8a), the manually labeled and predicted bounding boxes of the bipolar are presented in green and blue, respectively. The feature map obtained for the bipolar is also shown in
Figure 8c. The bipolar partially appears in the image, and only a small part of the tip was detected. The bipolar tip consists of two parts, a characteristic blue clevis part and the grasping part that has a similar appearance to the grasper tip. Therefore, the detection of the bipolar relied on localizing its blue clevis, not the entire tooltip. Both the tooltip and the clevis was considered for labeling the Cholec80-Boxes. Hence, the labeling protocol can be modified accordingly by only considering the characteristic clevis of the bipolar. Similarly, the clevis of the bipolar in
Figure 8b was detected by the proposed approach as two separate objects as can be seen in the localization map of the bipolar (
Figure 8d). In the post-processing step, only one object was counted for detecting the bipolar bounding box, which lead to a false prediction with IoU = 28.43%. Therefore, the rate of this kind of false detection could potentially be ameliorated by refining the post-processing step (e.g., morphological image processing).
5.3. Comparison with The State-of-the-Art
Table 3 and
Table 5 present comparisons with the leading methods for surgical tool presence detection and phase recognition. Twinanda et al. introduced the base model EndoNet that performed tool presence detection and phase recognition in a multi-task manner [
22]. HHMM was employed to model temporal dependencies between surgical phases. They achieved a mAP of 81.0% for tool presence detection and a mean precision and mean recall of 73.7% and 79.6% for surgical phase recognition, respectively. Jin et al. tackled tool detection and phase recognition tasks in a similar fashion but employed an LSTM network as a temporal model. They also introduced a novel loss function that better considered the tool-phase relation. Their methods showed great improvements over the EndoNet with a mAP of 89.1% for tool detection and a mean recall of 88% for phase recognition. However, in both approaches, tool presence detection was carried out solely based on spatial information learned by the CNN model. On the contrary, this work proposed using the LSTM network for both phase recognition and tool presence detection tasks.
Wang et al. [
38] proposed using a graph convolutional neural network (GCN) to learn temporal information from short video clips for the tool classification task. They evaluated their methodology on the Cholec80 dataset and reported a value of 90.1% for mAP. In a recent study [
35], two stages of temporal modeling were proposed to learn dependencies, first from short video sequences of unlabeled frames and then across the whole surgical video. This approach yielded the best performance results reaching mAP of 94.6% between other tool presence detection methods [
35]. Vardazaryan et al. [
42] transferred the weakly-supervised WILDCAT approach [
21] into the tool localization in laparoscopic videos. Indeed, the
CNN-MMC (
Table 2) approach represents a reproduction of their work. Nwoye et al. [
39] built upon work in [
42] and employed a convolutional LSTM layer to learn spatiotemporal coherence along the video sequence. Similar to this study, both approaches Vardazaryan et al. [
42] and Nwoye et al. [
39] were trained only with the binary tool labels, and they reported the tool presence of mAP at 87.2% and 92.9%. Interestingly, the model presented in this study (
CNN-SE-MSF-LSTM) achieved higher mAP values than [
35,
38,
39,
42] at 95.6%. The tool localization results of this were not compared with other works because of different types of evaluation data.
For phase recognition, Jin et al. [
29] proposed the SV-RCNet deep learning approach, which is composed of a CNN and LSTM network. They also introduced a prior knowledge inference scheme. Their method yielded a high recognition performance with mean precision and mean recall of 88.1% and 88.9%, respectively. Recently, Czempiel et al. [
30] proposed using a temporal convolutional network, and they reported 80.9% and 87.4% precision and recall, respectively. In [
52], combining a CNN with a two-stage LSTM, the authors achieved 92.9% accuracy on the Cholec80. Recent studies proposed using transformers instead of LSTM networks for temporal modeling. Czempiel et al. [
32] proposed the OperA approach based on a transformer model. They reported 82.2% and 86.9% for precision and recall, respectively. Indeed, the
CNN-SE-MSF-LSTM exceeded the performances of most state-of-the-art methods for phase recognition (
Table 5) and achieved the best recall value of 89.0% and 88.5% precision.
5.4. Limitations and Future Scope
An experimental evaluation of the proposed approach was carried out using a single dataset (Cholec80). To justify the robustness and generalization capability of this approach, extensive evaluations with other datasets should be performed. Furthermore, the spatial and temporal models were trained separately, not end-to-end. Indeed, end-to-end training is the main drawback related to its computational burden, where a large GPU memory is required. Nevertheless, end-to-end training can be done using short image sequences to leverage better spatial–temporal features. The LSTM can then be trained with complete video sequences.
The developed framework has the potential to be employed as a first step in labeling new datasets. For instance, bounding boxes can be generated and then modified by labeling specialists. Consequently, manual tagging to support DL model development could be achieved with less time and effort.