1. Introduction
Optoacoustic imaging, also termed photoacoustic imaging, is an emerging, hybrid technology that enables non-invasive visualization of tissue morphological and functional characteristics at depths of up to several centimeters [
1]. Non-invasive optoacoustic imaging is based on the optoacoustic effect (
Figure 1), in which an instantaneous optical excitation of tissue (usually by means of a pulsed laser) causes the thermoelastic expansion of light-absorbing biomolecules and the subsequent generation of wideband pressure waves that are recorded by ultrasonic transducers positioned on the tissue surface [
2]. This hybrid approach leverages and integrates the desirable characteristics offered by pure optical and ultrasonic methods, i.e., optical contrast and acoustic resolution, respectively. Mathematical inversion of the acquired signals renders planar or volumetric images of the initial spatial distribution of acoustic pressure, which is proportional to the absorbed optical energy [
3]. Illumination with specific wavelengths enables targeting chromophores of interest, such as oxygenated or deoxygenated hemoglobin, melanin, lipids, and water [
4]. In principle, quantitative imaging of chromophore concentrations is possible with multi-wavelength measurements [
5], making (‘multispectral’ or ‘spectral’) optoacoustic imaging attractive for an expansive range of clinical [
4] and preclinical [
6] applications, including, inter alia, histology, dermatology, endocrinology, vascular imaging, and imaging of cancer and inflammation.
Current optoacoustic imaging systems can be broadly categorized as microscopic, mesoscopic, and macroscopic/tomographic, depending on the targeted tissue penetration depth and resolution (
Figure 1). Optoacoustic microscopy systems typically employ raster-scanning imaging heads that accommodate a single acoustic detector, and are further classified into optical-resolution (OR-PAM, optical-resolution photoacoustic microscopy) and acoustic-resolution (AR-PAM) categories, depending on whether the achievable resolution is limited by optical or acoustic diffraction, respectively [
7]. OR-PAM systems depend on tightly focused illumination to achieve very high-resolution imaging of superficial structures, whereas AR-PAM systems enable relatively deeper penetration at reduced resolution. Mesoscopic systems, such as raster-scan optoacoustic mesoscopy (RSOM), further trade resolution for depth, with wide field illumination and either single or arrays of wideband, focused detectors [
8]. Macroscopic implementations also rely on wide field illumination and allow for a significant increase in penetration depth through narrowband detector arrays, at the expense of resolving power [
9].
With its strong ability to provide label-free imaging of the endogenous optical contrast of tissue, optoacoustic imaging promises great value for clinical use. Microscopic systems are naturally applicable in histological imaging scenarios [
10,
11,
12], whereas both microscopic and mesoscopic configurations are effective tools for imaging the skin and revealing indicators (inter alia, characteristics of the microvasculature, melanin, or lipid content) of dermatological conditions [
4,
13,
14]. Macroscopic systems, such as the monochromatic optoacoustic tomography (OAT) or the multispectral optoacoustic tomography (MSOT), have great clinical potential for applications including cancer, vascular imaging, imaging inflammation, imaging of lipids/adipose tissues, and imaging of endocrine disorders [
15,
16,
17,
18,
19,
20,
21,
22]. Examples of optoacoustic images obtained in clinically relevant settings are shown in
Figure 2.
Figure 1.
Optoacoustic principle and imaging configurations.
Above, the optoacoustic effect, in which light absorbed by chromophores in tissue results in emission of ultrasound.
Below, categorization of optoacoustic imaging configurations into microscopic, mesoscopic, and macroscopic, with the corresponding tissue penetration depth limits and resolution orders of magnitude, as reported in relevant reviews [
8,
9,
23].
Figure 1.
Optoacoustic principle and imaging configurations.
Above, the optoacoustic effect, in which light absorbed by chromophores in tissue results in emission of ultrasound.
Below, categorization of optoacoustic imaging configurations into microscopic, mesoscopic, and macroscopic, with the corresponding tissue penetration depth limits and resolution orders of magnitude, as reported in relevant reviews [
8,
9,
23].
Many factors reduce or limit the quality of optoacoustic images, which, in turn, hinder the modality’s potential for clinical translation. These factors may emerge from the imaging hardware [
25,
26,
27,
28,
29,
30], the inexact or approximative image reconstruction algorithms [
25,
26,
31,
32,
33], the attenuative properties and inhomogeneous nature of tissue (light-tissue interaction phenomena) [
34,
35,
36,
37,
38,
39], or particularities of the acquisition procedure [
40,
41], and manifest as image noise, artifacts, and poor overall image fidelity.
Naturally, research activity in the fields of denoising and reconstruction quality enhancement has generated diverse approaches in all stages of the imaging pipeline, i.e., prior to, during, and after tomographic reconstruction. For such approaches to become part of what will eventually be standard technical procedures for clinical optoacoustic imaging, quality assessment is necessary. Quality assessment, or quality evaluation, is an important topic in medical imaging that aims to provide means for the performance evaluation of imaging systems and image restoration or compression algorithms [
42]. Establishing well-designed image quality assessment methodologies enables impartial comparison of heterogenous image quality improvement techniques in the context of practical expectations, whereas identifying their strengths and limitations helps to establish trust and facilitates progress towards image quality enhancement and, finally, clinical translation.
A common initial step in the process of testing imaging technologies is validation in controlled settings, i.e., via numerical (computational) or physical imaging phantoms that are designed to replicate imaging scenarios of varying complexities; such techniques have been commonly employed in optoacoustic imaging [
43]. Numerical phantoms are used in digital simulation environments that offer convenience and flexibility in modifying parameters of the imaging system (e.g., properties of the ultrasonic detector) and the imaged targets (e.g., optical/acoustic properties of absorbers and surrounding media) that may affect image quality [
25], whereas physical phantoms provide more realistic testing conditions at the expense of flexibility. Preliminary testing with phantoms enables the assessment of imaging performance in terms of different, isolated characteristics of interest; a fundamental example relevant to optoacoustic imaging is the characterization of image resolution and sensitivity at various depths, by imaging thin absorbers embedded in a typically homogeneous medium, e.g., light absorbing wires inserted in porcine gelatin [
44,
45]. Additionally, absorbers arranged in a grid pattern can be used to measure geometric accuracy and identify distortions [
31], as well as to examine the uniformity of intensity across the field of view. Certain properties of simple phantoms, such as high optical absorbance of targets and low optical and acoustic attenuation of the enclosing medium, oversimplify imaging, and could, therefore, lead to falsely optimistic estimates of imaging performance. To this end, biologically realistic phantoms have been proposed [
46,
47]. These are fabricated using material that resembles real tissue in terms of morphology and acoustic and optical properties, providing a more complex and challenging testing environment.
Though testing in such controlled environments facilitates quantitative benchmarking and provides valuable insights, it cannot possibly substitute for evaluation with images obtained in real-world clinical scenarios. An image denoising algorithm, for example, could achieve excellent results in images of simply structured phantom targets, but perform poorly in the presence of multiple unpredictable real-world variables, such as subject motion, reflections of out of plane absorbers, and inter-subject variability in anatomy or in other properties of tissue. Thus, during the assessment of image quality improvement techniques, the ultimate benchmark should be their effect on clinical utility, i.e., their capacity to reveal additional, clinically relevant content (e.g., known anatomic landmarks, characteristic biomarkers of disease, or other measurable features related to tissue function, such as blood oxygenation) in scans of human subjects in vivo. Proof-of-concept-level evidence of image quality improvement can be derived with one or few subjects [
48,
49], whereas more mature and quantitative evaluation can be conducted by involving larger subject groups and performing statistical analyses [
50,
51].
Although a literature review on signal and image processing methodologies in optoacoustic imaging was recently published [
52], the focus was not on studies that involve human subjects, and the matter of quality assessment has not been considered. Within this scope, we conducted a systematic literature search to identify existing approaches to optoacoustic image denoising and quality improvement that have been evaluated on human subjects in vivo. As a first natural outcome, a concise summary of factors that limit optoacoustic image quality, as well as their manifestations in the image domain, is drawn based on the identified body of study material. Secondly, a methodological overview is given, offering a technical outline of every individual approach, and a categorization into signal domain, image domain, and reconstruction or hybrid methodologies. Moreover, to deviate from the simple survey paradigm and provide additional value for upcoming research efforts, we identify subgroups of approaches to improving optoacoustic image quality with a common purpose and similar evaluation procedures, and critically analyze the subgroups’ efficacies; in each subgroup, the individual studies are comparatively rated according to a set of criteria designed to capture important aspects of optoacoustic image quality assessment. The individual criteria ratings are combined to yield comprehensive total ratings, which, after normalization by the subgroup mean rating, resulted in a final ranking of the studies. This enabled identification of the most effective approaches to evaluating optoacoustic image quality to inform future work. In parallel, a broader examination of the distribution of individual criteria ratings revealed limitations and areas of potential improvement in the assessment of optoacoustic image quality. Finally, the primary findings are summarized in a set of recommendations for the comprehensive evaluation of optoacoustic image quality improvement approaches.
To the best of our knowledge, this is the first work to combine a technical overview with a critical, in-depth analysis of reported quality assessment procedures while having evaluation on humans as a prerequisite in optoacoustic imaging.
3. Material and Methods
3.1. Systematic Literature Search
We conducted a systematic search in PubMed, Scopus, Web of Science, IEEE Xplore, ACM Digital Library, and Google Scholar, for articles published from 1 January 2010 to 31 October 2021. The searched information fields included publication titles, abstracts, and keywords. A comprehensive search query consisting of three main clauses was used: (photoacoustic* OR “photo-acoustic” OR “photo-acoustics” OR optoacoustic* OR “opto-acoustic” OR “opto-acoustics”) AND (reconstruct* OR denois* OR noise OR artifact* OR artefact* OR clutter) AND (clinic* OR “hand-held” or handheld OR “hand held” OR freehand OR “free-hand” OR “free hand” OR human OR humans OR patient OR patients OR volunteer OR volunteers OR subject OR subjects OR individual OR individuals OR participant OR participants OR man OR men OR woman OR women OR person OR people OR “in-vivo” or “in vivo” OR experiment*).
In Google Scholar, publication titles were searched with the first two clauses only, due to its limited functionality. Records of ineligible publication types, such as book chapters or academic theses, were manually excluded. Offline inspection also revealed numerous falsely identified records whose retrieved information did not satisfy all search clauses. Such records were automatically identified and excluded from screening. This automatic process was only applied to records with available abstracts. Records whose abstracts were not retrieved were handled separately. The removal of duplicate records was performed in a semi-automated manner. Groups of duplicate records were automatically identified and subsequently manually inspected to eliminate false positives. The titles and abstracts of the remaining records were screened, and potentially eligible articles were sought for retrieval and full-text evaluation according to the following four inclusion criteria:
- (i)
The proposed method is employed in an imaging scenario.
- (ii)
Image quality improvement or noise reduction is the primary objective.
- (iii)
The proposed method functions entirely after the acquisition of optoacoustic data.
- (iv)
Evaluation of the proposed method with human subjects in vivo is reported.
To eliminate potential bias during the selection of articles, the title and abstract screening process was performed independently by two authors (ID, LH). The same individuals collaborated in the full-text evaluation of the inclusion criteria. All disagreements were resolved by discussion, until consensus was reached. Forty-five eligible papers were finally selected, including one that was not retrieved by the search, but was identified in the references of a related, included work. An overview of the search and screening process is given in
Figure 5, which depicts the corresponding PRISMA 2020 diagram [
56].
3.2. Subgroup Analysis, Rating Criteria, and Procedure
For the systematic assessment of the included studies in terms of evaluation adequacy, a comparative rating approach was adopted. Initially, four authors (ID, AK, PS, LH) collaborated in the conceptualization and design of the rating process. A focus group of three authors (ID, AK, LH) executed the rating process. In the following, all papers were cooperatively evaluated by all members of the focus group. All disagreements were resolved by discussion, until consensus was reached.
First, studies were classified according to their general motivation, forming subgroups of studies with a common purpose (i.e., intention to solve a common problem) and evaluation procedures with comparable characteristics. The subgroup analysis was essential to enable fair comparison, since the studies exhibited significant variety in the design of evaluation experiments. Thirty-six out of forty-five studies were finally included in the subgroup analysis; the remaining nine studies were not included either due to their purpose not suiting any subgroup, or due to the impossibility of meaningful comparison to the other studies of the subgroup. The subgroups are summarized in
Table 2.
Following the assignment of studies to the subgroups, the evaluation section of each paper was thoroughly analyzed, considering, in detail, the design and results of all reported experiments on numerical and physical phantoms, as well as on humans or animals in vivo. Through this analysis, a set of criteria was established, according to which the studies were compared and rated. The criteria, described in
Table 3, aim to comprehensively capture the important aspects of the evaluation procedure in an organized and analytic manner.
The rating procedure was independently conducted for each individual subgroup. Initially, a subset of applicable criteria was selected according to the specific characteristics of the included papers. In particular, C2 is only applicable in the SRES, SSPR, and SLVB subgroups, where reference methods could be identified in preceding research, and C6 is only applicable in the SSPR subgroup. The remaining criteria are applicable to all subgroups.
In all subgroups, except for S
MOT, the experiments reported in each study were classified into three categories, involving numerical, phantom, and human subjects. Experiments involving animals were reported in only two studies [
48,
61], and added little new information to the evaluation; these experiments were, therefore, considered as complementary to studies with human subjects, and assigned to the category of experiments with human subjects. Each category of experiments was analyzed according to the applicable criteria, and ratings were given to each criterion according to the following scale: 0: absent, 1: lacking, 2: adequate, 3: ample, 4: thorough. The minimum possible increment was 0.5. A fundamental motivation behind the rating assignment procedure was to reflect the relative quality, with respect to each criterion, of the studies included in each subgroup. Therefore, it is more appropriate to consider the ratings as being relative, not absolute. The ratings given for each criterion were combined into a subtotal rating for each category of experiments, via weighted addition. All criteria contributed equally to the subtotal with a weight of 1, except for C
6, which was empirically assigned a weight of 1/5 to reflect a smaller importance; it was generally considered as being a secondary criterion. The subtotals were combined into a total rating via weighted addition, followed by division with the sum of the weights of the applicable criteria, as shown in
Figure 6. Subtotals corresponding to numerical and phantom experiments were empirically assigned a weight of 1/3 to emphasize the significance of experiments on human subjects.
The SMOT subgroup was handled slightly differently, as the studies primarily performed their evaluations via experiments with human subjects, i.e., the most appropriate way to draw useful conclusions in such a setting. In some cases, synthetic motion was added to scans obtained from either physical phantoms or human subjects for preliminary validation of the proposed methods. Such experiments were taken into consideration, but were not analyzed into separately rated criteria (as in the other subgroups) due to the simplicity they exhibited. Instead, they were handled as a single criterion, i.e., they were given a single rating and contributed to the total rating equally to the other criteria.
The ultimate objective of the rating procedure was to highlight studies that conducted, in their subgroup, relatively comprehensive and unbiased evaluations of their proposed methods for the improvement of optoacoustic image quality. To achieve this, the total rating was divided by the mean rating of the corresponding subgroup, yielding the normalized total rating, a metric of the deviation from the subgroup average. This metric was finally used to select a set of studies that surpass the threshold defined by the upper quartile (Q3) of the normalized total ratings distribution. An illustrative overview of the rating procedure is given in
Figure 6.
5. Discussion
5.1. Close Inspection of Featured Works
We initiate the discussion with a close inspection of the works that surpassed the selection threshold (
Figure 7,
Table 7), selecting and mentioning key characteristics that contributed to their deviation from the rest of the studies in the respective subgroup.
The work by Chowdhury et al. [
31] stands out in S
ACM, primarily due to the experiments with both numerical and physical (printed) phantom targets that uniformly cover the entire field of view with point absorbers. Imaging such targets enabled objective baseline validation of the method’s correction capacity, and could constitute a universally applicable preliminary step to evaluate a wide range of quality enhancement approaches. The availability of a reference image for the printed phantom target allowed quantitative assessment of structural fidelity with the Structural Similarity Index Measure (SSIM) [
93], applied usually in simulated settings only. Extensive simulations rendered the beamforming approach by Ma et al. [
44] distinct in S
RES; in addition to the typically used arrangement of point targets on the axial direction, a large circular source was simulated to assess shape distortion. Additionally, closely situated numerical and physical point targets were imaged to investigate improvement in the separability of adjacent structures, a commonly overlooked, but important, feature. In S
SPR, the learning-based reconstruction technique by Hauptmann et al. [
59] is also set apart, owing to numerical experiments in which a dataset of vessel-rich volumetric images was generated using a collection of lung CT scans to simulate optoacoustic measurement data. Such data may more closely resemble measurements obtained in real practice, and could be useful for the initial evaluation of various approaches, besides learning-based ones. Another advantage comes from the availability of multiple different images to evaluate with, making statistical quantitative analysis possible.
Two approaches [
49,
75] were featured in S
MOT. The OR-PAM motion correction technique by Cheng et al. [
75] was validated with a well-balanced set of experiments in both artificial and realistic settings. The former examined the robustness of displacement estimation to synthetically added noise at varying levels, as well as the tolerance to different magnitudes of misalignment. In the latter, both the back of a human hand and a palm were scanned, adding to the variety. The calculation of the PSNR and SSIM metrics between all consecutive pairs of adjacent B-scans provided a measure of inter-B-scan similarity, a useful means to quantify the increase in alignment and its consistency. Three distinct depth layers in the reconstructed volume were visualized separately to demonstrate the correction effects on the vascular networks of three different scales. Resolution was quantified with the full width at half-maximum (FWHM) metric at six randomly selected vessel profiles in each depth layer. This depth-wise qualitative and quantitative assessment showcases diligence and attention to detail, also seen in the evaluation of the RSOM motion correction algorithm by Schwarz et al. [
49]. An interesting addition therein is the separation of low- and high-frequency sub-bands in the visualization and calculation of the depth-wise contrast-to-noise ratio (CNR), again facilitating assessment at different scales and depths. Nevertheless, what distinguishes this study further is the examination regarding quantification capacity and clinical utility; first, images of healthy skin and a psoriasis plaque were compared to observe that the broadening of capillary loops, a typical biomarker of psoriatic skin, was only visible after motion correction. In addition, using a multispectral RSOM configuration, blood oxygenation measurements across a single vessel were shown to exhibit less abrupt, more biologically plausible variations that were spatially associated with regions of vessel bifurcation. Though analysis on a single vessel might not be considered sufficient to provide significant evidence, this is a step in the right direction, exploiting the full potential of multispectral optoacoustic imaging.
Among the featured works, those that exhibited exemplar attention to clinical and quantification aspects are discussed in the following. The evaluation of the beamformer by Yang et al. [
51] attained the highest rank in S
RES, with the quantitative analysis in terms of discrimination capacity being a key determining factor; in an experiment involving a cohort of 28 patients, descriptive features of histograms of optoacoustic image regions corresponding to cancerous and non-cancerous ovaries were analyzed to test for a statistically significant difference between the two groups. In addition, a logistic regression analysis on sets of features was performed, and area under the receiver operating characteristic curve (AUROC) metrics were compared between methods. Overall, the proposed method performed favorably. In S
LVB, attention to clinically relevant aspects also distinguished the two featured studies [
50,
81]. The prior-integrated reconstruction approach by Yang et al. [
81] was first assessed visually by confirming an enhancement in clarity of the radial artery at three different depths, as well as in visibility of the carotid artery lumen and, if present, atherosclerotic plaque. Quantitatively, between groups of three healthy volunteers and five atherosclerotic patients, a statistically significant difference in the lipid content of the carotid region could only be detected with the proposed approach. Lastly, the quantification capacity, in the work of Steinberg et al. [
50], was initially evaluated in vitro with multispectral measurements of an indocyanine green (ICG) tube in chicken breast tissue, demonstrating that sufficient correlation between the measured and the known reference ICG spectra could only be achieved using the proposed reconstruction technique. In the clinically relevant in vivo assessment, ten patients with suspected prostate cancer lesions were scanned with multiple optical wavelengths before and after injection of ICG. A statistically significant difference in optoacoustic amplitude in the prostate region, as well as a dependence between relative optoacoustic amplitude and ICG dose, was only observable in images reconstructed with the proposed method.
5.2. Broader Assessment of the Analyzed Works Per Individual Criteria
Following the inspection of notable details in the featured studies, a broader assessment through the distribution of individual criteria ratings (
Figure 8) is also valuable to reveal significant shortcomings and challenges. Concerning quantitative evaluation adequacy, expressed by criterion C
3, the relatively large number of studies in the low-end of the rating spectrum indicates a scarce availability of objective, quantitative measures of image fidelity. This is not unexpected, considering the absence of true reference images when imaging subjects in vivo. Remarkably, the distribution of C
4 ratings clearly reveals an almost universal absence of evaluation in terms of anatomical verity, quantification capacity, or clinical utility. In other words, image quality was rarely examined in the context of clinical value. This highlights an area with great potential for improvement on the way to more mature evaluation procedures. Another significant observation is that little attention was given to the effect of absorber depth on image quality, as made evident by the ratings for criterion C
5. However, the highly depth-dependent optical and acoustic attenuation properties of tissue call for more granular assessment, ideally at multiple depth levels, to fully investigate a method’s correction capacity.
5.3. Recommendations
In light of the aforementioned, a set of recommendations for the comprehensive evaluation of future research efforts in optoacoustic image quality improvement can be assembled. In the preliminary stage, experiments with numerical and physical phantoms may assess the resolution, existence of artifacts, and overall structural fidelity of the images. This would be greatly facilitated by widely available, standardized targets that cover the entire field of view with absorbers of multiple scales, in a variety of simple and complex geometric arrangements, and situated in realistic, lossy media. Such standardization would provide common references for an objective comparison of different methods, which is currently unfeasible. Absorbing substances with known spectral responses could be involved to enable further validation of the quantification capacity in multispectral configurations. Following the preliminary phase, advancing with evaluation in relevant clinical settings, designed with clinical utility in scope, is preferable. Experiments with one or few individuals can demonstrate possible improvement in clinical utility, especially if findings are reported at multiple depths and scales, and if the enhancement is visually obvious. Nevertheless, studies involving multiple participants and reporting quantitative, statistically significant findings will provide more substantial evidence, less prone to bias.
5.4. Limitations
Our work comes with limitations. It can be argued that the current design of the rating procedure does establish an absolutely objective, infallible measure of study quality; an alternative set of weights for the individual criteria or the experiment categories could be proposed, potentially affecting the final selection of studies. Our choices reflect our intent to minimize subjective bias by weighting the primary criteria equally, given that they were elicited from, and were developed to be, applicable to works with a broad range of purposes, as well as to emphasize the importance of evaluation with human subjects. Future efforts could explore alternative designs and the standardization of such sets of criteria to establish thorough study quality control pipelines. Nevertheless, works that have demonstrated evaluation rigor can be reasonably expected to stand out in terms of deviation from their subgroup mean rating, as expressed by the normalized total rating scores. Furthermore, though the exclusive examination of studies that have reported human-involving experiments allowed us to maximize clinical relevance, future work could identify approaches with substantial potential for clinical application by seeking studies that have performed validation with animal subjects in vivo, probing into the domain of pre-clinical optoacoustics research.
5.5. Overall Perspective
From an overall perspective, with image quality improvement and assessment approaches in clinical optoacoustic imaging as the core domain of interest, this review retrieved relevant research material through an extensive, systematic search and screening process. An overview of limiting factors that contribute to optoacoustic image quality deterioration was presented; they were categorized into limitations stemming from the hardware, reconstruction algorithms, tissue, and acquisition process. The landscape of the identified image quality improvement approaches was outlined, with concise descriptions of each method’s purpose and key technical details. At a high level, the methods were clustered into signal-domain, image-domain, reconstruction, and hybrid techniques, depending on whether they were introduced prior to, after, or during image reconstruction.
Regarding image quality assessment, proper practices and prevalent shortcomings were sought. In a systematic analysis of the included material, subgroups of studies with common objectives and similarly structured evaluation procedures were composed. The evaluation sections of each study were extensively analyzed, yielding a set of criteria that enabled, inside each subgroup, a comparative rating in terms of evaluation sufficiency. The rating process generated two primary outcomes: first, a selection of works that exhibited significant positive deviation from their subgroup average in terms of total rating was featured, feeding a discussion on characteristics that rendered them eminent, in context with the corresponding criteria. Such characteristics were identified in numerical, physical phantom, and human-involving evaluation experiments. Moreover, concentrating on the latter, the inspection of the criteria-wise distribution of ratings revealed improvement potential in the quantitative assessment; a substantial shortfall in depth-wise evaluation; and a wide disregard of anatomical verity, quantification capacity, and clinical utility aspects.