1. Introduction
Among issues related to human vision, the processing of visually complex entities is one of the most important. The processing of information is often based on local-to-global or global-to-local connections [
1]. The local-to-global concept concerns the transitions from local details of scene to a global configuration, while global-to-local works in the reverse order, from global configuration towards the details. For example, an algorithm for face recognition, which uses the local-to-global approach, starts with eyes, nose, and ears recognition and, finally, brings face configuration. Differently, a global-to-local algorithm first identifies the face, which leads to the identification of details (eyes, nose, and ears). During the task of human recognition, the global configuration of a scene plays a key role, especially when subjects see the images for a short duration of time. Furthermore, humans leverage local information as an effective way to recognize scene categories. Higher level visual perception theories distinguish individual elements at the local and global level, in which the information on many local components is perceptually grouped [
2]. Nodes and edges in a graph representation encode information with the purpose to highlight relations among raw data. Many fields such as computer vision and pattern recognition adopt data graph representations and related manipulation algorithms. Specifically, in the image processing field, graphs are used to represent digital images in many ways. The standard approach concerns partitioning of the image into dominant disjoint regions, where local and spatial features are respectively nodes and edges. Local features describe the intrinsic properties of regions (such as shape, colors, texture), while spatial features provide topological information about the neighborhood. Image representation is one of the crucial steps for systems working in the image retrieval field. Modern Content-Based Image Retrieval (CBIR) systems consider essentially the image’s basic elements (colors, textures, shapes, and topological relationships) extracted from the entire image, in order to provide an effective representation. Through the analysis of these elements, compositional structures are produced. Other systems, called Region-Based Image Retrieval [
3] (RBIR), focus their attention on specific image regions instead of the entire content to extract features. In this paper, a graph structure for image representation, called Attributed Relational SIFT-based Regions Graph (ARSRG), is described, analyzed, and discussed with reference to previous works [
4,
5,
6,
7]. There are two main parts: examination of the structure, through the definition of its components, and the collection and analysis of the previously obtained results. The main goal of ARSRG is to create a connection between local and global features in the image through a hierarchical description. Global features are extracted using a segmentation technique, while local features are based on a Local Invariant Feature Extraction (LIFE) method. The structure provides different information arising from image regions, topological relations among them, and local invariant features. In order to extract stable descriptors (robust to deformations), SIFT features have been selected among all LIFE methods, as they extract salient points of an image in the scale-space. Moreover, new definitions and properties arising from the detailed analysis of the structure are introduced. This theoretical analysis, based on the introduction of different definitions, has been helpful to go into the main and secondary components of ARSRG, with the purpose to better understand the phases of construction and comparison. Finally, through a wide experimental phase, how the structure is adaptable to different types of application contexts is shown. The latter involved a collection and a depth analysis of results previously obtained in different fields both in terms of image content and application. It was a crucial phase because the goal was to identify the common aspects that mainly supported the theoretical basis. The paper is organized as follows:
Section 2 includes a related research about graph-based image representation including Scale-Invariant Feature Transform (SIFT) [
8].
Section 3,
Section 4 and
Section 5 are dedicated to ARSRG’s description, definitions, and properties. Experimental results and conclusions are respectively reported in
Section 6 and
Section 7.
2. Related Work
The literature reports many approaches that combine local and spatial information arising from SIFT features. Commonly, a graph structure encodes information about keypoints located in a certain position of an image. Nodes represent SIFT descriptors, while edges describe spatial relationships between different keypoints.
In [
9], a graph
represents a set of SIFT keypoints from the image
and is defined as:
where
is a node related to a SIFT keypoint with position
,
is the SIFT descriptor attached to node
, and
is the adjacency matrix. If
, the nodes
and
are adjacent,
otherwise.
In [
10], the authors combined the local information of SIFT features with global geometrical information in order to estimate a robust set of feature matches. This information is encoded using a graph structure:
where
is a node associated with a SIFT keypoint,
B is the adjacency matrix,
if the nodes
v and
are connected,
otherwise, while
is the SIFT descriptor associated with node
v.
In [
11], nodes were associated with
N image regions related to an image grid, while edges connect each node with its four neighbors. Basic elements are not pixels, but regions extended in the
x (horizontal) and
y (vertical) directions. The nodes are identified using their coordinates on the grid. The spatial information associated with nodes is indices
. Furthermore, a feature vector
is associated with the corresponding image region and, then, with a node. The image is divided into overlapping regions of 32 × 32 pixels. Four 128-dimensional SIFT descriptors, for each region, are extracted and concatenated.
In [
12], the graph-based image representation included SIFT features, MSER [
13], and Harris-affine [
14]. Given two graphs
and
, representing images
and
,
V is the set of nodes, image features extracted,
E the set of edges, features’ spatial relations, and
A the set of attributes, information associated with features extracted.
In [
15], SIFT features were combined in the form of a hyper-graph. A hyper-graph
is composed of nodes
, hyper-edges
, and attributes
related to hyper-edges. A hyper-edge
e encloses a subset of nodes with size
from
V, where
represents the order of a hyper-edge.
In [
16], an approach to 3D object recognition was presented. The graph matching framework is used in order to enable the utilization of SIFT features and to improve robustness. Different from standard methods, test images are not converted into finite graphs through operations of discretization or quantization. Then, the continuous graph space is explored in the test image at detection time. To this end, local kernels are applied to indexing image features and to enable a fast detection.
In [
17], an approach to the matching features problem with the application of scene recognition and topological SLAM was proposed. For this purpose, the scene images are encoded using a particular data structure. Image representation is built through two steps: image segmentation using the JSEG [
18] algorithm and invariant feature extraction with MSER and SIFT descriptors in a combined way.
In [
19], SIFT features based on visual saliency and selected to construct object models were extracted. A Class Specific Hypergraph (
) to model objects in a compact way was introduced. The hypergraphs are built on different Delaunay graphs. Each one is created from a set of selected SIFT features using a single prototype image of an object. Using this approach, the object models can be represented through a minimum of object views.
The authors in [
20] provided a solution to the object recognition problem representing images through SIFT-based graph structural expression. A graph structure is created using lines to connect SIFT keypoints.
represents the graph; the set
E represents edges; the set
V represents vertices; and the set
X the associated SIFT descriptors. The node represents a keypoint detected by the SIFT algorithm, and the associated label is the 128-dimension SIFT descriptor. The edge
connects two nodes
and
. The graph can be defined as complete if all keypoints, extracted from the image, are connected among them. Formally, the set of edges is defined as follows:
where
represents the keypoint spatial coordinates,
its scale, and
is a threshold value. An edge does not exist when the value is greater than the threshold
. In this way, an extra edge is not created. This formulation of the proximity graph reduces the computational complexity and, at same time, improves the detection performance.
In [
21], the median
K-nearest-neighbor(K-NN) graph
was defined. A vertex
for each of the
N points
is created, with
. Furthermore, a non-directed edge
is created when
is one of the
K closest neighbors of
and
.
is the median of all distances between pairs of vertices and can be defined as:
During K-NN graph construction, a vertex can be considered completely disconnected if there are no K vertices that support the structure. The graph has the adjacency matrix , where when and otherwise.
3. Attributed Relational SIFT-Based Regions Graph
In this section, the Attributed Relational SIFT-based Regions Graph (ARSRG) is defined based on two main steps: feature extraction and graph construction. Feature extraction consists of Region of Interest (ROI) extraction from the image through a segmentation algorithm. Connected components in the image are then detected with the aim of building the
Region Adjacency Graph (RAG) [
22], to describe the spatial relations between image regions. Simultaneously, SIFT [
8] descriptors, which ensure invariance to rotation, scaling, translation, illumination changes, and projective transforms, are extracted from the original image. Graph construction consists of the building of the graph structure. Three levels can be distinguished in ARSRG:
root node,
RAG nodes, and
leaf nodes. At the first level, the
root node represents the image and is connected to all
RAG nodes at the second level. Adjacency relationships among different image regions are encoded through
RAG nodes. Thus, adjacent regions in the image are represented by connected nodes. In addition, each
RAG node is connected to the
root node at the higher level. Finally, SIFT descriptors extracted from the image are represented through
leaf nodes. At the third level, two types of configurations can appear:
region based and
region graph based. In the
region-based configuration, a keypoint is associated with a region based on its spatial coordinates, whereas the
region graph-based configuration describes keypoints belonging to the same region connected by edges (which encode spatial adjacency). Below, the steps of feature extraction and graph construction are described in detail.
3.1. Feature Extraction
3.1.1. Region of Interest Extraction
ROIs from the image through a segmentation algorithm called JSEG [
18] are extracted. JSEG performs segmentation through two different steps: color quantization and spatial segmentation. The first step consists of a coarse quantization without degrading the image quality significantly. The second step provides a spatial segmentation on the class-map without considering the color similarity of the related pixel.
3.1.2. Labeling Connected Components
The next step involves the labeling of connected components on the segmentation result. A connected component is an image region consisting of contiguous pixels of the same color. The process of connected components labeling an image
B produces an output image
that contains labels (positive integers or characters). A label is a symbol naming an entity exclusively. Regions connected by the four-neighborhood and eight-neighborhood will have the same label (represented in Algorithms 1 and 2 by the variable
m containing a numerical value). Algorithm 1 shows a version of connected components’ labeling.
Algorithm 1. |
Require: I - Image to Label; |
Ensure: I - Image Labeled; |
1: m = 0 |
2: for y = 1: do |
3: for x = 1: do |
4: if I[i][j] == 0 then |
5: m = m + 1 |
6: |
7: end if |
8: end for |
9: end for |
10: return I |
Algorithm 2. |
Require: I - Image to Label; - image index; l - label; |
Ensure: ⊘; |
1: if I[i][j] == 0 then |
2: I[i][j] = m |
3: |
4: |
5: |
6: |
7: |
8: |
9: |
10: |
11: end if |
3.1.3. Region Adjacency Graph Structure
The second level of the ARSRG hosts a graph-based image representation named the
Region Adjacency Graph (RAG) [
22]. In Algorithm 3, a pseudocode version of the
RAG algorithm is shown. A region represents an elementary component of the image, based on the image segmentation result. In the
RAG, a node is a region, and an edge describes the adjacency between two nodes.
RAG is built with reference to spatial relations between regions. Two regions are defined to be spatially close if they share the same boundary. In this regard, a neighborhood check of the labeled region is performed between the label of the pixel under consideration, named
in Lines 2–4, and the labels of the pixels belonging to its neighborhood (the eight directions included in the eight-neighborhood). If the latter contain a different label with respect to
, then this means that there are two adjacent regions represented in the RAG. The
RAG is defined as a graph
, where nodes are regions in
V and edges
E are the boundaries that connect them.
G is encoded through the adjacency matrix,
(Line 5), which describes the topology of the graph connections. For example, if
contains one, this means that the regions
will be connected in the image. Moreover, one of the main properties of
RAG is the invariance to translation and rotation, useful for a high-level image representation.
Algorithm 3. |
Require: ; |
Ensure: ; |
1: |
2: for do |
3: for do |
4: if then |
5: |
6: end if |
7: end for |
8: end for |
9: return |
3.1.4. Scale-Invariant Feature Transform
SIFT [
8] descriptors are extracted to ensure invariance to rotation, scaling, translation, partial illumination changes, and projective transform in the image description. SIFT is computed during the feature extraction phase, through a parallel task with respect to
RAG creation.
3.2. Graph Construction
The ARSRG building process consists of the creation of three levels:
Root node: the node located at the first level of the graph structure and representing the image. It is connected to all nodes at the next level.
Region Adjacency Graph (RAG) nodes: adjacency relations among different image regions based on the segmentation result. Thus, adjacent image regions are represented by nodes connected at this level.
Leaf nodes: The set of SIFT features extracted from the image. Two types of connections are provided:
- (a)
Region based: A leaf node represents a SIFT keypoint obtained during feature extraction. Each leaf node-keypoint is associated with a region based on its spatial coordinates in the image. At this level, each node is connected to just one
RAG higher level node (
Figure 1a).
- (b)
Region graph based: In addition to the previous configuration, leaf nodes-keypoints belonging to the same region are connected by edges, which encode spatial adjacency, based on thresholding criteria (
Figure 1b).
4. Formal Definitions
This section introduces detailed definitions for the purpose of formally fixing the ARSRG structure. Definitions 1 and 2 describe the components related to the two configurations (as shown in
Figure 1). Definitions 3 and 4 define the sets of attributes associated with the nodes, through the functions introduced by Definitions 5 and 6, of the second and third level. Definitions 7 and 8 introduce connection structures for SIFTs. From Definitions 9 to 11, different types of edges between the levels are described. Finally, Definitions 12 and 13 include the support structures for the second and third level. The ARSRG structure is defined based on two leaf node configurations.
Definition 1. ARSRG (first leaf nodes’ configuration): G is defined as a tuple , where:
, the set of region nodes.
, the set of undirected edges, where and is an edge that connects nodes .
, the set of SIFT nodes.
, the set of directed edges, where and is an edge that connects source node and destination node .
Definition 2. ARSRG (second leaf nodes’ configuration): G is defined as a tuple , where:
, the set of region nodes.
, the set of undirected edges, where and is an edge that connect nodes
, the set of SIFT nodes.
, the set of directed edges, where and is an edge that connects source node and destination node .
, the set of undirected edges, where and is an edge that connect nodes
ARSRG structures, first and second leaf node configuration, are created based on Definitions 1 and 2. The nodes belonging to sets and are associated with features extracted from the image. In particular:
Definition 3. is a set of vector attributes associated with nodes in . An element, , is associated with a node of the ARSRG structure at the second level. It contains the region dimension (pixels).
Definition 4. is a set of vector attributes associated with nodes in . An element, , is associated with a node of the ARSRG structure at the third level. It contains a SIFT descriptor.
The association between features and nodes is performed through assignment functions defined as follows:
Definition 5. The node-labeling function assigns a label to each node of the ARSRG at the second level. The node label is a feature attribute extracted from the image. The label value is the dimension of the region (pixels number). The labeling procedure of a v node occurs during the process of the ARSRG construction.
Definition 6. The SIFT node-labeling function assigns a label to each node of the ARSRG at the third level. The node label is a feature vector , the keypoint, extracted from the image. The labeling procedure of a node checks the position of the keypoint in the image compared to the region to which it belongs.
Furthermore, the RAG nodes are doubly linked in horizontal order, among them, and in vertical order, with nodes . Edges are all undirected from left to right, while edges are all directed from top to bottom. The root node maintains a list of edges outgoing to RAG nodes. Furthermore, each RAG node maintains three linked lists of edges: one for outgoing from RAG nodes, one for outgoing from leaf nodes, and one for ingoing to the root node. Finally, each leaf node maintains two linked lists of edges: one for ingoing to RAG nodes and one for outgoing from leaf nodes. The edges in each list are ordered based on the distances between end nodes: shorter edges come first. These lists of edges have direct geometrical meanings: each node is connected to another node in one direction: left, right, top, and bottom.
A very important aspect concerns the organization of the third level of the ARSRG structure. To this end, the SIFT Nearest-Neighbor Graph () is introduced.
Definition 7. An is defined as:
: the set of nodes associated with SIFT keypoints;
: the set of edges, where for each , an edge if and only if exists. is the Euclidean distance applied to the x and y position of the keypoints in the image; τ is a threshold value, and p stems from one to k, k being the size of .
This notation is very useful during the matching phase. Indeed, each indicates the set of SIFT features belonging to the image region, with reference to Definition 2, and represents SIFT features organized from the local and spatial point of view. A different version of is called the complete SIFT Nearest-Neighbor Graph ().
Definition 8. An is defined as:
: the set of nodes associated with SIFT keypoints;
: the set of edges, where for each , an edge if and only if exists. is the Euclidean distance applied to the x and y position of keypoints in the image; τ is a threshold value; and p stems from one to k, k being the size of . In this case, τ is greater than the maximum distance between keypoints.
Another important aspect concerns the difference between vertical and horizontal relationships among nodes in the ARSRG structure. Below, these relations, edges, are defined.
Definition 9. A region horizontal edge e, , is an undirected edge that connects nodes .
Definition 10. A SIFT horizontal edge e, , is an undirected edge that connects nodes .
Definition 11. A vertical edge e, , is a directed edge that connects nodes and from source node to destination node .
As can be seen, horizontal and vertical edges connect nodes of the same and different levels, respectively. Finally, these relations are represented through adjacency matrices defined below.
Definition 12. The binary regions’ adjacency matrix describes the spatial relations among RAG nodes. An element defines an edge, , connecting nodes . Hence, an element is set to one if node is connected to node , zero otherwise.
Definition 13. The binary SIFT adjacency matrix describes the spatial relations among leaf nodes. An element defines an edge, , connecting nodes . Hence, an element is set to one if node is connected to node , zero otherwise.
Figure 2 shows the two different ARSRG structures on a sample image.
5. Properties
In this section, the ARSRG structure’s properties arising from the feature extraction and graph construction steps are highlighted.
Region features and structural information: The main goal of the ARSRG structure is to connect regional features and structural information. The first step concerns image segmentation in order to extract ROIs. This is a step towards the extraction of semantic information from a scene. Once the image has been segmented, the RAG structure is created. This feature representation highlights individual regions and spatial relations existing among them.
Horizontal and vertical relations: The ARSRG structure presents two types of relations (edges) between image features: horizontal and vertical. Vertical edges define the image topological structure, while horizontal edges define the spatial constraints for node (region) features. Horizontal relations (Definitions 9 and 10) concern ROIs and SIFT features located at the second level of the structure. The general goal is to provide the information of spatial closeness, define spatial constraints on the node attributes, and characterize the feature map of a specific resolution level (detail) on a defined image and that can be differentiated according to the computational complexity and the occurrence frequency. Their order is in the range , where n is the number of features specified through the relations. In a different way, vertical relations (Definition 11) concern connections between individual regions and their features. The vertical directed edges connect nodes among the second and third levels of ARSRG (RAG nodes to leaf nodes) and provide a parent-child relationship. In this context, the role of the ARSRG structure is to create a bridge between the defined relations. This aspect leads to some advantages, i.e., the possibility to explore the structure both in breadth and in depth during the matching process.
Region features invariant to point of view, illumination, and scale: Building local invariant region descriptors is a hot topic of research with a set of applications such as object recognition, matching, and reconstruction. Over the last few years, great success has been achieved in designing descriptors invariant to certain types of geometric and photometric transformations. Local Invariant Feature Extraction (LIFE) methods work in order to extract stable descriptors starting from a particular set of characteristic regions of the image. LIFE methods were chosen, for region representation, in order to provide invariance to certain conditions. These local representations, created by using information extracted from each region, are robust to certain image deformations such as illumination and viewpoint changing. The ARSRG structure includes SIFT features, identified in [
23] as the most stable representations between different LIFE methods.
Advantages due to detailed information located on different levels: The detailed image description, provided by the ARSRG structure, represents an advantage during the comparison phase. In a hierarchical way, the matching procedure explores global, local, and structural information, within the ARSRG. The first step involves a filtering procedure for regions based on size. Small regions, containing poor information, are removed. Subsequently, the matching procedure goes to the next level of the ARSRG structure, analyzing features of single regions to obtain a stronger match. The goal is to solve the mapping on multiple SNNGs (Definition 7) of the ARSRGs. In essence, this criterion identifies partial matches among SNNGs belonging to ARSRGs. During the procedure, different combinations of graph SNNGs are identified, and a hierarchy of the matching process is constructed. In this way, the overall complexity is reduced, which is expected to show a considerable advantage especially for large ARSRGs.
Advantages due to matching region-by-region: Region-Based Image Retrieval (RBIR) [
24] systems work with the goal of extracting and defining the similarity between two images based on regional features. It has been demonstrated that users focus their attention on specific regions rather than the entire image. Region-based image representation has proven to be more close to human perception. In this context, in order to compare the ARSRG structures, a region matching scheme based on the appearance similarities of image segmentation results can be adopted. The region matching algorithm exploits the regions provided by segmentation and compares the features associated with them. The pairwise region similarities are computed from a set of SIFT features belonging to regions. The matching procedure is asymmetric. The input image is segmented into regions, and its groups of SIFT keypoints can be matched within a consistent portion of the other image. In this way, the segmentation result is used to create regions of candidate keypoints, avoiding incompatible regions for two images of the same scene.
False matches’ removal: One of the main issues of LIFE methods concerns the removal of false matches. It has been shown that LIFE methods produce a number of false matches, during the comparison phase, that significantly affect accuracy. The main reason concerns the lack of correspondence among image features (for example due to partial background occlusion of the scene). Standard similarity measures, based on the features’ descriptor, are widely used, even if they rely only on region appearance. In some cases, it cannot be sufficiently discriminating to ensure correct matches. This problem is more relevant in the presence of low or homogeneous textures and leads to many false matches. The application of the ARSRG structure provides a solution for this problem. In order to reduce false matches, small ARSRG region nodes and the associated SIFT descriptors are removed. Indeed, small regions and their associated features are not very informative, neither in image description nor matching. The ratio test [
8] or graph matching [
25] can be applied to perform a comparison among remaining regions. This filtering procedure has a strong impact on experiments, resulting in a relevant accuracy improvement.