Simplified Machine Learning Model as an Intelligent Support for Safe Urban Cycling

Hernández-Herrera, Alejandro; Rubio-Espino, Elsa; Álvarez-Vargas, Rogelio; Ponce-Ponce, Victor H.

doi:10.3390/app15031395

Open AccessArticle

Simplified Machine Learning Model as an Intelligent Support for Safe Urban Cycling

by

Alejandro Hernández-Herrera

¹

,

Elsa Rubio-Espino

^1,*

,

Rogelio Álvarez-Vargas

²

and

Victor H. Ponce-Ponce

¹

Centro de Investigación en Computación, Instituto Politécnico Nacional, Av. Juan de Dios Bátiz, Esq. Miguel Othón de Mendizábal, Col. Nueva Industrial Vallejo, Alcaldía Gustavo A. Madero, Mexico City C.P. 07700, Mexico

²

ProfTech Servicios, S. A. de C. V., Semilla 2, Col. Arquitos, Querétaro C.P. 76048, Mexico

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1395; https://doi.org/10.3390/app15031395

Submission received: 19 December 2024 / Revised: 17 January 2025 / Accepted: 18 January 2025 / Published: 29 January 2025

(This article belongs to the Special Issue Road Safety in Sustainable Urban Transport)

Download

Browse Figures

Versions Notes

Abstract

:

Urban cycling is a sustainable mode of transportation in large cities, and it offers many advantages. It is an eco-friendly means of transport that is accessible to the population and easy to use. Additionally, it is more economical than other means of transportation. Urban cycling is beneficial for physical health and mental well-being. Achieving sustainable mobility and the evolution towards smart cities demands a comprehensive analysis of all the essential aspects that enable their inclusion. Road safety is particularly important, which must be prioritized to ensure safe transportation and reduce the incidence of road accidents. In order to help reduce the number of accidents that urban cyclists are involved in, this work proposes an alternative solution in the form of an intelligent computational assistant that utilizes simplified machine learning (SML) to detect potential risks of unexpected collisions. This technological approach serves as a helpful alternative to the current problem. Through our methodology, we were able to identify the problem involved in the research, design, and development of the solution proposal; collect and analyze data; and obtain preliminary results. These results experimentally demonstrate how the proposed model outperforms most state-of-the-art models that use a metric learning layer for small image sets.

Keywords:

one-shot learning; few-shot learning; contrastive learning; intelligent cycling support; urban cycling

1. Introduction

1.1. Urban Mobility

There is no doubt that population growth is consistent, and in Mexico during the last 25 to 30 years, this growth, combined with very poor urban, planning has caused a problem of vehicle congestion in the nation’s large cities. This situation causes the daily transfer from one point to another and on the outskirts of these large urban cities to take between 1 and 3 h on an average trip, when these trips could normally be made with a duration of between 30 min and 1 h.

Faced with these circumstances, sustainable mobility became a highly relevant issue for planning urban mobility systems, since it is a model that promotes the use of different means of transport that are friendly to the environment, inclusive, and accessible [1]. In this research, we are mainly considering three means of mobility that, due to their nature and hierarchy in mobility [2], are considered sustainable: walking, urban cycling, and public transport, including this last hierarchy, which is considered common public transport in Mexico, such as the metro, metrobus, light rail, commuter train, trolleybus, and cable car. There are multiple benefits that this type of mobility contributes to the environment by not having large amounts of gas emissions, by not having energy waste, and even by contributing to reducing the carbon dioxide footprint in the atmosphere. With respect to social benefits, since these are collective and individual means of transportation, they significantly promote inclusion and are viable options for making long-distance journeys at a reduced cost.

Among these three means of mobility mentioned above, urban cycling is worth highlighting since it is a transportation choice that lightens the load on vehicles in road congestion. It is an ecological option because it promotes the significant reduction in harmful gases, such as CO₂; it is an accessible means of transportation and easy to use and that, at the same time, contributes to improving both emotional and physical health. However, this alternative requires more attention from the authorities in charge of planning mobility in large cities, since exclusive lanes are insufficient. Public access systems are limited and have a high cost for a sector of the population. There are not enough spaces to park bicycles or store them, there is also an inequality of spaces to travel, and urban cyclists are not given priority in terms of road safety.

1.2. Road Safety for Cyclists

Pedestrians and cyclists head the mobility hierarchy (see Figure 1), which classifies modes of transportation according to the vulnerability they present, as well as both negative and positive peculiarities that they cause as a means of transportation. It is worth mentioning that the negative peculiarities include the possible risks that a certain modality represents for users in other hierarchies.

To promote the use of bicycles, it is necessary to guarantee safety conditions through a focus on preventing road events that cause deaths and injuries. The philosophy of the Vision Zero road safety concept [3] is based on the simple fact that we are all human beings and, therefore, make mistakes. This is why there is a need to generate systems that support the cyclist during their road trajectory, thereby reducing the possibility that these errors end in injuries and deaths. Institutions such as the World Health Organization (WHO) point out that in the event of a transport accident, the population of cyclists and pedestrians is more vulnerable and their safety is strategic in a global culture that needs to intensify sustainable travel [4].

In Mexico, as in other countries, we face a great public health challenge as a result of injuries generated by poor road safety. Statistically, it has been reported that traffic injuries are the leading cause of death among children between the ages of 4 and 14 years. Among the highest percentages of deaths are those of young people between 15 and 30 years old (32.2%), followed by the group between 30 and 44 years old (25.5%) [5], which has a strong impact on the economy and emotional stability of families, as well as society as a whole.

1.3. Intelligent Urban Cycling

Although it is true that the bicycle is in the subconscious of citizens as a pleasant element, associated with a livable city, a majority of citizens would use it when traveling if there were a safe and coherent infrastructure and pleasant environments. In this sense, technology can currently be used as a cyclist’s assistant, which will allow them to travel in a safer way during the trip.

Currently, the aim is to effectively incorporate urban cycling into the transportation networks within the so-called smart cities and promote more modern, clean, and safe modes of transportation. This requires establishing an individual mobility model that, through the collection and analysis of various types of data, generate as a result information that integrates smart mobility [6,7,8].

Correspondingly, the organization of this paper is arranged based on the following sequence:

Section 2 is prepared to outline the intuitive information presented by the problem and the main research method adopted to identify the dominant characteristics and advantages of simplified machine learning in achieving unexpected collision risk identification tasks crucial for training and testing procedures to maximize the effectiveness of model classification.
Section 3 is structured to generally explain the methodology used and the essential description of the proposed solution that includes the architecture of the proposed cognitive model and its main parts.
Section 4 describes the experimental stage used to observe the performance of the model with different datasets and provides comparative tables and graphs that illustrate its performance when using different feature extractors and in each of the sample ranges that were used (one-shot and five-shot).
Section 5 provides a discussion of notable results related to the evaluation of the proposed model, its generalization capacity, and its comparison against other state-of-the-art methods, as well as other particular aspects of its operation and performance.
Section 6 expresses the main research conclusions.

Generally, the paper’s contribution is reflected in the following points:

1.: Reducing time, effort, and costs related to the number of examples or training samples used in conventional deep learning (DL) and machine learning (ML) models;
2.: Having this proposal, based on cutting-edge technology, presented as a novel support option for cyclists, allowing them to travel more safely during trips within urban areas;
3.: Contributing to the area of machine learning with a model that proposes using fewer examples or samples of information for training and being able to resemble the natural learning of human beings.

2. Simplified Machine Learning

2.1. The Challenge of Machine Learning

Increasing safety in urban cycling and its incorporation as a smarter and safer means of transportation towards future smart cities suggests a technologically supported solution to assist the urban cyclist. Intuitively, the problem shows that there is information available and other information that is not. Table 1 summarizes these two aspects to consider to model a possible solution.

Additionally, intuitively, Figure 2 presents, in a general way, the characteristics that a suggested machine learning model should have to be applied to the problem of this research.

The information and intuitive characteristics of the problem to be solved suggest that the proposed solution can be oriented towards the area of artificial intelligence and specifically within the field of machine learning, because to assist the cyclist in identifying the risk of a collision, it must have an accelerated learning process also using previously learned information. Recently, machine learning has been very successful in various tasks, specifically in image and speech classification, pattern recognition, or improved searches for information on the web. However, it is also known that these models usually require a large amount of data and training time to have a reliable learning process.

Therefore, the purpose of this research is that based on the approach and description of the problem, a technological solution is proposed that supports the urban cyclist in reducing the risk of collision by focusing on automatic detection within a dynamic environment, so the research question is defined as follows: can a machine learning model be used, based on some examples, to learn the concept of unexpected collision risk and detect it in real time?

2.2. Related Work

Reviewing possible technological proposals for solutions to this problem, there are preventive approaches that rely on unsupervised machine learning to explore the circumstances associated with urban cycling safety, such as those exposed by Zhao, H. et al. [9], where large amounts of publicly available data, such as satellite images and neighborhood and city maps, are used to collect information from the environment of cyclist accidents and use machine learning methods, such as generative adversarial networks (GANs) to learn from these datasets and explore the factors associated with cyclist crashes. In this same sense, it is known that work has been carried out in this regard in Spain, such as the one presented by Galán R. et al. [10], where the variables that are causes of accidents among bicycle users are studied, with the aim of reducing the number of accidents and thus increasing the number of people who use it as a means of transportation with greater safety.

However, in the time in which this research was carried out, no evidence was found that a technological proposal that involves a means of transport such as the bicycle and supports the reduction in accidents in this sense has been contemplated. Nor has specific evidence been found of a geospatial analysis of accidents in urban cycling to be able to make a comparison of methodologies and results.

It is known that, recently, machine learning has been very successful in various tasks, such as pattern classification and searching for massive information on the web, as well as image and speech recognition. However, these machine learning models often require, as a training input, a large amount of example data to learn. Likewise, the technology known as deep learning (DL) [11,12] is booming and has been playing an important role in the advancement of machine learning. However, it also requires large amounts of data.

In addition to this, the large size of the data tends to lead to slow learning; this is mainly due to the parametric aspect of the deep learning algorithm, in which, due to the operating characteristics, the training examples must be learned in parts and gradually. Now, it has been mentioned that one of the characteristics of the simplified machine learning model is that it must be able to learn from very few examples; therefore, deep learning would not apply for this purpose. On the other hand, the model should be more similar to the way humans learn, that is, generalize knowledge from a few examples.

The simplified machine learning model has several advantages; as a summary, Table 2 indicates the most relevant advantages it has over other machine learning models.

2.3. Contrastive Learning

Intuitively, we can say that contrastive learning mimics the way humans learn about the world around them. According to many specialists, children learn new concepts more easily, for example, by looking at a picture of an owl and trying to find a similar one among a set of images of various animals. In this case, the child has to compare each of the animals by identifying their characteristics, compare them with those of the owl in the original image, and then conclude that the image represents a similar animal.

From what was described above, it turns out that it is easier for a person without prior knowledge, like a child, to learn new things by contrasting between similar and different things instead of learning to recognize them one by one. Initially, the child may not be able to identify an owl as such, but after a while, the child learns to distinguish common characteristics among owls, such as the shape of their head, their posture, their wings, and the shape of their body.

Within machine learning (ML), contrastive learning is a machine learning paradigm in which unlabeled data are juxtaposed against each other to teach a model that is similar and that is different. That is, as the name suggests, the samples are contrasted against each other, and those belonging to the same distribution are pushed towards each other in the same compact Euclidean space. On the contrary, those that belong to different distributions move away from each other [13].

Contrastive learning is a technique that improves the performance of computer vision tasks because it has shown promising results over deep learning, thus gaining importance in the field, by using the principle of contrasting samples with each other to learn the attributes that are common between data classes and the attributes that differentiate one data class from another.

In recent years, there has been a resurgence in the field of contrastive learning, which has led to important advances in self-supervised learning [14,15,16,17,18]. The common idea in these works is the following: join an anchor sample and a “positive” sample in the representative space (embedding space) and separate the anchor from many “negative” samples. Since there are no labels available, a positive pair often consists of an increase in sample data, and negative pairs consist of the anchor and samples chosen at random from the small set of images. These concepts are shown by comparing them graphically in Figure 3. Likewise, in [15,16], certain connections are described that exist between contrastive loss and the maximization of similar information when evaluating it between different data samples.

2.4. Approach Overview and Contributions

Contrastive learning, as mentioned by Khosla, P. et al. [13], mimics the way humans learn and aims to learn low-dimensional representations of data by contrasting between similar and dis-similar samples. Therefore, humans can learn new things with a small set of examples. When a person is presented with new examples, they are able to understand new concepts quickly and will then recognize variations of those concepts in the future. Just like that, a child can learn to recognize a cat from a single image. However, currently, a machine learning system needs many examples to learn the characteristics of a cat and recognize them in the future or in other examples.

We can observe that in standard associative learning, an animal must repeatedly experience a series of associations between a stimulus and a consequence before it completely learns a particular stimulus. Therefore, learning is inevitably incremental. However, animals sometimes conclude or infer results that they have never even observed before and from which they need to learn quickly to survive. In such cases, animals can learn from a single exposure to the stimulus in this situation, but making an analogy with machine learning, this is what we have defined as simplified machine learning, Hernández-Herrera et al. [19, 20] and generally, it is known in the literature as one-time learning [21].

Simplified machine learning is particularly advantageous for various applications where traditional machine learning models are insufficient due to limitations in the amount of data or the need for rapid adaptation and flexibility.

Medicine: In this field, its application focuses on supporting health professionals in the diagnosis of rare diseases using a limited number of medical images as training examples, which supports identifying conditions using minimal or scarce data. On the other hand, it could accelerate the development of more precise treatments, drugs, and diagnoses.
Security systems and access control: Within facial recognition systems, being able to identify a person from a single image is essential in situations such as access control, where it is very likely that there are no several images available. It also helps improve biometric authentication systems by being able to verify identities using minimal training samples.
Handwriting recognition: In document scanning systems, it enables the recognition of handwritten or uncommon fonts using limited examples, enabling the scanning of historical or old documents. It also supports the classification and understanding of textual data with minimal labeled examples using only a limited set of examples.
Robotics: Using object recognition with few examples, robots can identify and manipulate new objects with minimal training. This functionality allows for increased adaptability in non-uniform or dynamic environments such as commercial warehouses. Additionally, the ability of robots to learn tasks from a single demonstration facilitates the learning of complex tasks using few examples. Likewise, autonomous robotics uses recognition with few examples to detect or avoid unknown objects in its path for better self-driving.
Species identification: It can support the identification and tracking of rare or completely unidentified species from a few photographic samples. Furthermore, it facilitates the study of biodiversity through the recognition of new species from limited observations or images.

Compared with computers, it is a hallmark of human intelligence to be able to learn quickly, whether it is recognizing objects from a few examples or quickly learning new skills after a few minutes of experience. Today it is claimed that artificial intelligence systems should be able to do the same, learn and adapt quickly from a few examples, and continue to adapt as these systems are exposed to more and more available data. This type of learning with the characteristics of speed and flexibility is a challenge, since the system must integrate its prior knowledge with a small amount of new information, efficiently avoiding overfitting to new data. Likewise, this previous experience and the new data will depend on the task at all times (see Figure 4).

Our solution proposal in terms of being able to support an urban cyclist to ride safely in urban environments is to specify and design the model that can perceive and perform machine learning with few training examples, which assists in risk detection of collision, and thus be able to alert the cyclist of the possible danger, as shown in Figure 5. Here, it is proposed that a prototype of the system installed on the bicycle, performs visual perception, and through the simplified machine learning model, allows for detecting that a vehicle is approaching the cyclist, estimating a possible risk of collision.

The main justification behind this simplified machine learning will then be to able to train a cognitive model with one or very few examples, as well as be able to generalize unknown categories without extensive retraining and thus be able to better adapt as a solution to the problem posed in this research.

So far, no concrete evidence of a specific work related to the problem has been found, so it is considered that this research is the first attempt to define a machine learning model that allows for detecting and evaluating the risk of unexpected collisions in urban cycling.

Some of the work conducted during our research and presented in this article was inspired by what was developed in [22], which showed how similarity using Euclidean distance (

L 1

) was superior to similarity using cosine distance presented in [23]. Therefore, it was assumed that adding a combined affinity layer will improve the classification accuracy. With this approach, we proposed to implement this combined affinity layer in our Siamese artificial neural network for one-shot learning. This is a major contribution to what we have called simplified machine learning, by developing a new type of affinity layer (bi-layer) for deep affinity neural networks, which is the basis of our Siamese artificial neural network.

3. Materials and Methods

Below in this section, a brief general description of the methodology used is presented, including a substantial description of the proposed solution, which includes an explanation of the architecture of the proposed cognitive model as well as its essential parts, such as the affinity layers and the layer combined affinity.

3.1. Methodology

For the development of this research, the mixed methodology or mixed research route, proposed by Hernández-Sampieri and Mendoza [24], was used as a basis with some variations required for the present work. The main reason for using this mixed methodology is that it does not replace quantitative research or qualitative research, but rather uses the strengths of both types of inquiry, combining them, and tries to minimize their potential weaknesses. Likewise, we consider that it harmonizes or is more suited to the problem statement.

Our methodology integrates quantitative and qualitative approaches and is constituted by the following general stages (see Figure 6):

1.: Research problem statement;
2.: Design and development of the solution proposal;
3.: Data collection and analysis;
4.: Preliminary results.

3.2. Cognitive Model Architecture

Metric learning [25], that is, learning with a distance measure or similarity learning, is a method that performs mapping in the feature space through feature transformation to subsequently form groups within the feature space. Metric learning-based methods are widely used for facial recognition and person identification. Metric learning learns the similarity of two images through the distance between them, where similar targets move closer in distance and different targets move farther away from each other. Therefore, metric learning requires certain key characteristics of the learning objectives, that is, individualized characteristics of each object.

By distinguishing different objects, the appearance characteristics of similar objects are very similar; these characteristics belong to the common characteristics between almost identical objects. Distinctive characteristics such as shape, color, texture, and size are used to distinguish the different characteristics of two or more objects. Metric learning distinguishes different identities by learning key distinguishing features.

The most commonly used methods for measuring efficiency or loss within metric learning include binary cross-entropy loss, contrastive loss, triple loss, and quaternion loss.

As a background, contrastive loss was introduced in 2006 by Hadsell, Chopra, and LeCun [26], and is generally described as a learning loss metric function that calculates the Euclidean distance or cosine similarity between pairs of vectors. It assigns a loss value based on a predefined margin threshold. If the distance between two vectors is less than the margin threshold, the loss value is equal to zero. If the distance is greater than the margin threshold, the loss value increases and is greater than zero. Contrast loss plays a crucial role in maintaining the similarity and correlation of latent representations across different information modalities. This ensures that similar cases are represented by similar vectors and different instances are represented by different vectors.

In a simple way, the loss in metric learning can be exemplified as follows: having two input images

x_{1}

and

x_{2}

; extracting their respective characteristic vectors

f_{x 1}

and

f_{x 2}

, the Euclidean distance can characterize the similarity, that is, the closeness between two objects through the distance in Euclidean space, defined in Equation (1) (adapted from [26]) as

D_{x_{1}, x_{2}} = {∥\vec{f_{x_{1}}} - \vec{f_{x_{2}}}∥}_{2}

(1)

Formally, a contrastive loss function is used to learn the parameters W of a parameterized function

G_{W}

, such as matching neighboring objects move together and non-neighbors move away. Prior knowledge will then be used to identify the neighbors in each case for training data. Hadsell et al. [26] exemplify an energy-based model where, given the neighborhood relations, these are used to learn the mapping function. In this context, given a family of functions G (such as a CNN), parameterized by W (weights of a CNN), the objective will then be to find values of W that map a set of high-dimensional input values by means of a comparator, for example, the Euclidean distance, to perform “semantic similarity” of the entries in the input space, providing a set of neighborhood relations.

Contrastive loss is mainly used to train so-called Siamese neural networks, introduced in the early 1990s by Bromley et al. [27]. The Siamese network is a “connected neural network”, and its network structure is shown in Figure 7. Likewise, in the Algorithm 1 we establish the general training procedure for a Siamese neural network.

Here, we see that the network has a “connected body” through sharing the set of weights; that is, the weights of the two neural networks are the same. The Siamese network is mainly used to measure the similarity between two inputs, which can be generated from a CNN or an LSTM [28]. For example, when two images are input, the two inputs are fed into two neural networks; these two neural networks map the inputs to the new representation space separately, allowing the input to be represented as a value within the new space. The similarity or dissimilarity between both inputs is evaluated by calculating the loss value itself.

Algorithm 1: General training algorithm for a Siamese neural network

Require:: S—Training dataset, f—CNN, $z_{i j}$ —binary label $(z_{i j} \in 0, 1)$ , $w$ —shared weights, $η$ —learning rate
Ensure:: $L (x_{i}, x_{j}, z_{ij})$
1:: for $S \leftarrow \{(x_{i}, x_{j}, z_{i j}), (i, j) \in K\}$ do
2:: $h_{i} \leftarrow f (x_{i})$ {feature vector of $x_{i}$ }
3:: $h_{j} \leftarrow f (x_{j})$ {feature vector of $x_{j}$ }
4:: $d_{h_{i}, h_{j}} \leftarrow {∥h_{i} - h_{j}∥}^{2}$ Euclidean distance
5:: Contrastive loss function $L (x_{i}, x_{j}, z_{ij})$ (Equation (3))
6:: Total error function to be minimized $L (w)$ (Equation (2))
7:: Optimize and update weights $w_{n + 1} \leftarrow w_{n} - η ▽ f (w_{n})$
8:: end for

Let us say that each pair of training images intrinsically has a binary label Y assigned to this pair. If

Y = 0

,

x_{1}

and

x_{2}

are considered similar; otherwise if

Y = 1

. Then the contrastive loss function in its general form can be defined as

L (W) = \sum_{i = 1}^{p} L (W, {(Y, {\vec{X}}_{1}, {\vec{X}}_{2})}^{i})

(2)

L (W, {(Y, {\vec{X}}_{1}, {\vec{X}}_{2})}^{i}) = (1 - Y) L_{S} (D^{i}) + Y L_{D} (D^{i})

(3)

where

{\vec{X}}_{1}

and

{\vec{X}}_{2}

are a pair of input vectors shown to the system; in our case, they are the feature vectors of the input images,

{(Y, {\vec{X}}_{1}, {\vec{X}}_{2})}^{i})

is the i-th labeled pair of samples,

L_{S}

is the partial loss function for a pair of similar points,

L_{D}

is the partial loss function for a pair of different points, and D is to shorten the Euclidean distance notation

D_{X_{1}, X_{2}}

. Finally P is the number of training pairs. Both the function

L_{S}

and the function

L_{D}

must be designed such that minimizing L with respect to W results in low values of D for similar pairs and high values for D in different pairs.

Specifically, the exact loss function will be the following Equation (4):

L_{c} (W, {(Y, {\vec{X}}_{1}, {\vec{X}}_{2})}^{i}) = (1 - Y) \frac{1}{2} {(D)}^{2} + Y \frac{1}{2} {\{m a x (0, m - D)\}}^{2}

(4)

where

m > 0

is a margin. The margin defines a radius around

G_{W} (X)

, which is a set threshold. It can be known from the contrastive loss that the loss function can be used both to express the coincidence of pairs of samples and to train the model with the extracted features effectively. With all this, through the continuous reduction in the loss value, the distance between pairs of similar samples is continuously reduced, while the distance between pairs of dissimilar samples is continuously increased.

The general description of the architecture of the proposed Siamese cognitive model is presented in Figure 8. We established as a basis for our Siamese network model an approach similar to the one shown in [28], but adjusting the convolutional neural networks (CNNs) to generate 1024 features instead of 4096, they will also share the same parameters since they are copies of the same CNN. The neural network architecture that learns image embeddings and attribute vectors in the same vector space (embedding space) was used in the implementation of the feature extractor; in this way, the distances between affinity features can be calculated. The two input images

(x_{i}, x_{j})

feed the CNNs, where the two fixed-length feature vectors,

f (x_{i})

and

f (x_{j})

, are obtained. Since both feature extraction networks are the same,

f (x_{i}) ≃ f (x_{j})

if the two images are affine and

f (x_{i}) \neq f (x_{j})

otherwise.

As main feature extractors, the standard CNNs in the state of the art (ResNet-18 [29] and EfficientNet-B0 [30]) were used, which helped to accelerate the training time of the proposed model when generating fewer parameters on the network. Subsequently, these results will feed the affinity layers as detailed in the following section.

3.3. Affinity Layer Overview

In the design and specification of the affinity layer, these are calculated using for

A_{1}

the Euclidean distance, shown in Equation (5), and for

A_{2}

the Manhattan distance, as seen in Equation (6), where u and v are the feature vectors. We adapt for the proposed model a perspective shown in [28,31] to integrate as the layers in an artificial learning neural network. In this way, a one-to-one operation is performed on each element of the feature vectors and to finally generate a new one.

δ_{e} (u, v) = \sqrt{\sum_{i = 1}^{n} {(u_{i} - v_{i})}^{2}}

(5)

δ_{m} (u, v) = \sum_{i = 1}^{n} |u_{i} - v_{i}|

(6)

3.4. Combined Affinity Layer Overview

The basis of our Siamese artificial neural network is the so-called combined affinity layer C that unifies the feature vectors of the Euclidean (

A_{1}

) and Manhattan (

A_{2}

) layers to form a single one that will evaluate the similarity or dissimilarity of the input images.

The combined affinity layer

A_{m a x}

works as follows: Ee take the element-wise maxima of the two affinity layers (

A_{1}

and

A_{2}

). Assuming that

A = (A_{i k})

is the affinity layer, for

1 \leq i \leq m

and

1 \leq k \leq n

, where m is the total number of affinity layers, and n is the size (number of rows) of each affinity layer, the corresponding maximum of elements in each row of the two layers are taken to form a layer of size n, which is defined by Equation (7).

A_{m a x (k)} = \max (A_{i k}), 1 \leq i \leq m

(7)

In the design and implementation of the combined layers, the 1024 output of the CNNs was conditioned with a ReLU activation function, a kernel regularizer to prevent overfitting, and a bias initializer. The regularizer works with a mean of 0, while the bias initializer has a mean of 0.5 and a standard deviation of 0.01. Having two inputs to compare each of the model’s feature extractors produces a vector of 1024 features. Those outputs then become the inputs for the two separate layers to calculate the corresponding affinities: the layer

A_{1}

(Euclidean distance,

L 1

) and the layer

A_{2}

(Manhattan distance,

L 2

). In each of these affinity layers, an output of 1024 features is produced. Subsequently, it is passed through the maximum affinity layer

A_{m a x (k)}

, which finds the maximum number of elements of the two layers and generates a maximum vector of 1024 features. As a last step, a sigmoid activation function is applied on a layer with a filter, which will produce a value between 0 and 1, which establishes the probability of affinity between the input images.

3.5. Dataset Overview

In the area of one-shot learning, much of the research evaluating models for image categorization commonly uses the MiniImageNet [23] and CIFAR-100 [32] datasets. For this reason, we used both datasets in the experimental phase, which allowed us to compare them with other state-of-the-art methods that have also used them. Furthermore, we also added the CUB-200-2011 (Caltech-UCSD Birds-200-2011) dataset [33], which has recently been the dataset for comparing few-shot learning tasks, which is also the most used dataset for fine-grained visual categorization tasks. We also incorporated the DroNet dataset [34] to establish a comparison with a proposed autonomous drone driving approach for obstacle avoidance and make our evaluation and comparison more comprehensive.

As a common practice in the area of machine learning, each dataset was divided into three subsets for one-shot learning: the training set (

T_{s}

), the validation set (

V_{s}

), and the search/query set (

Q_{s}

).

T_{s}

is a disjoint set of the sets

Q_{s}

and

V_{s}

, but

V_{s}

and

Q_{s}

belong to the same category or class. Suppose there are a number i of categories in the training set, a number j of categories in the validation set, and a number k of categories in the search set. The set of category labels in

T_{s}

,

V_{s}

, and

Q_{s}

would then be

C_{i}

,

C_{j}

, and

C_{k}

, respectively.

Therefore, the label pairs for the images in the training set would be set by the following Equation (8):

T_{s} = \{(x_{i}, x_{j}), A (x_{i}, x_{j})\} \forall i, j = 1 . . n

(8)

for which

(x_{i}, x_{j})

are the image pairs in the training set and

A (x_{i}, x_{j})

will be the affinity score of the image pair. Therefore, if

x_{i}

and

x_{j}

are equal, then the score will have a value of 1 and otherwise it will have a value of 0. It is set to n as the number of training samples or examples.

Continuing with the label specification, Equation (9) names the image label pairs in the validation set as follows:

V_{s} = \{(x_{k}, x_{l}), A (x_{k}, x_{l})\} \forall k, l = 1 . . m

(9)

for which

(x_{k}, x_{l})

are the image pairs in the validation set and

A (x_{k}, x_{l})

is the affinity score of the image pair. Therefore, if

x_{k}

and

x_{l}

are equal, then the score will take a value of 1 and otherwise it will have a value of 0. m is the number of samples or examples for validation.

Finally, the images that form the query set are specified in the following Equation (10):

Q_{s} = \{x_{k}\} \forall k = 1 . . n

(10)

for which

x_{k}

are image samples from categories in the validation set. The final objective sought in this learning is to classify samples in the search/query set given some examples in the validation set meeting the following restrictions:

C_{i} \notin C_{j}

;

T_{s} \cap V_{s} = \emptyset

;

T_{s} \cap Q_{s} = \emptyset

; and

Q_{s} \in V_{s}

. Therefore, the categories in the training versus validation set are disjoint, but those classes in the validation and query set are intersecting sets.

The segmentation of the datasets used for the experimental phase of this study are briefly presented below:

The MiniImageNet dataset, as stated by Vinyals et al. [23], contains 100 classes chosen randomly from the original ImageNet dataset, and each of those classes is itself composed of 600 images. The dataset was divided following what was presented in [31,35,36,37], into 64, 16, and 20 training, validation, and test classes, respectively. The main reason for using this dataset is its complexity and its repeated use to test many other one-shot learning tasks.
The CIFAR-100 dataset, as stated in [32], contains 100 classes with 600 images each. The dataset was divided as suggested in [23,35,36,38] into 64, 16, and 20 training, validation, and test classes, respectively. This division is in line with other research that evaluated one-shot learning models with this dataset.
The CUB-200–2011 dataset defined in [33], as previously mentioned, is a fine-grained dataset consisting of 200 classes and $11, 788$ images. A split was applied to the dataset similar to the one proposed in [35] of 100, 50, and 50 classes for training, validation, and testing, respectively, which in turn is in line with the splits also established in [14,22,23,36,38,39,40].
The DroNet dataset, as described by Loquercio et al. [34], contains 32,000 images distributed in 137 classes for a diverse set of obstacles. The dataset was divided into 88, 22, and 27 training, validation, and testing classes, using a dataset division similar to that proposed in [23,35,36,38].

4. Results

4.1. Experimental Setup

In our experimental phase, two CNN networks were used: EfficientNet-B0 and ResNet-18. These two neural networks were chosen mainly for their characteristics as feature extractors and generalized use in other one-shot learning models and to be able to compare our results against those in the state of the art. For our model, the ResNet-18 implementation was similar to the one shown in [29], except that the input image size was set to 100 × 100. On the other hand, the EfficientNet-B0 network implementation is similar to the one presented by [30], except also that the input image size was also set to 100 × 100. The outputs are then passed through the proposed combined affinity layer with a sigmoid activation to determine similarity or dis-similarity.

In the experimental design, the number of epochs was established as 200, and the size of the processing block was defined as 18. For the training part of the Siamese network, the contrastive loss function (Equation (4)) was used as an objective function. Likewise, an Adam optimizer was also used with an initial learning rate set to 0.0005.

A comparison was made with the current reference models in the literature, which are based on the cosine, Manhattan, and Euclidean similarity layers, with the combined affinity layers proposed for the four datasets MiniImageNet, CIFAR-100, CUB-200–2011, and DroNET detailed in the previous section. The evaluation was specifically performed on the accuracy of five random example images in one-hot mode and five random example images in one-shot mode. The description above is exemplified below in Figure 9.

The following section presents a comparison of our experimental results in a descriptive and detailed manner, using representative tables and figures.

4.2. Comparison of the Model Against Reference Data

Once the experimental phase has been carried out, the results obtained with each of the datasets used to evaluate the behavior, the performance and efficiency of the model are presented below. Likewise, a comparison is made of the model developed in this research against the various models present in the state of the art to also observe its performance and efficiency.

The main objective of the present research focused on developing a biologically inspired computational model, which would allow simplified machine learning using few examples to detect a possible risk of unexpected collision and thereby assist the cyclist in driving within an urban environment. Therefore, the model was specifically evaluated in terms accuracy in the identification of perceived information (images) using the single-example mode (one-shot), and for the five-example mode (5-Shot), likewise a comparison with the datasets, using the two feature extractors indicated.

Table 3 and Table 4 show the average accuracy with 95% confidence when performing image classification using the four datasets, the affinity methods separately, and the proposed SML model (

A_{m a x}

layer). Training was performed using the MiniImageNet dataset and the ResNet-18 and EfficientNet-b0 CNNs as feature extractors.

The results show that, for all datasets and feature extractor networks, our SML model outperformed one-shot mode learning methods (one-shot classification accuracy) using separate similarity layers

A_{1}

(Euclidean) and

A_{2}

(Manhattan). Therefore, our SML model using both the ResNet-18 and EfficientNet-B0 feature extractors had the best performance in all cases.

As can be seen in Figure 10a, the proposed SML model for the MiniImageNet dataset, with a ResNet-18 feature extractor, in five-shot mode, performs better by

16.85 %

compared with the best result, which is the similarity layer A1. On the other hand, in the one-shot mode, the model performs only

6.55 %

better compared with the best result for this mode, which is the similarity layer A1.

Next, for the CIFAR-100 dataset, as shown in Figure 10b, the SML model with the same feature extractor ResNet-18, in five-shot mode, performs better by

17.04 %

compared with the best result, which is the similarity layer A1. Likewise, in one-shot mode, the model performs better only by

4.41 %

compared with the best result for this mode, which is the similarity layer A2.

Continuing with the analysis, as shown in Figure 10c, the proposed SML model using the CUB-200–2011 dataset, with a ResNet-18 feature extractor in five-shot mode, performs better by

8.33 %

compared with the best result, which is the similarity layer A1. Now in one-shot mode, the model performs only

4.31 %

better compared with the best result for this mode, which is the similarity layer A2.

Finally, as seen in Figure 10d, for the DroNet dataset, the SML model with a similar feature extractor (ResNet-18), in five-shot mode, performs better by

21.46 %

compared with the best result, which is the similarity layer A1, being the best performance within the four datasets used for five-shot mode. Similarly, in one-shot mode, the model performs better by

7.64 %

compared with the best result for this mode, which is also the similarity layer A1, this being also the best performance within the four datasets used for one-shot mode.

Figure 11 summarizes the previous results comparatively, as well as the behavior of the four datasets using the ResNet-18 convolutional network as a feature extractor, as well as the affinity layers (

A_{1}

,

A_{2}

, and

A_{m a x}

) used in the comparative.

Continuing with the analysis of the results, it can be seen in Figure 12a that the proposed SML model for the MiniImageNet dataset, the SML model with EfficientNet-b0 as a feature extractor, in five-shot mode performs best with

15.95 %

compared with the best result, which is the similarity layer A1. Likewise, in one-shot mode, the model performs better by

4.30 %

compared with the best result for this mode, which is the similarity layer A1. It should be noted that this is the best performance among the four datasets used in one-shot mode.

Now, for the CIFAR-100 dataset, as shown in Figure 12b, the SML model with the same EfficientNet-b0 feature extractor, in five-shot mode, it performs better by

10.23 %

compared with the best result, which is the A1 similarity layer. On the other hand, in one-shot mode, the model performs only

3.57 %

better compared with the best result for this mode, which is the similarity layer A1.

Continuing with the analysis, as shown in Figure 12c, the proposed SML model using the CUB-200–2011 dataset, with an EfficientNet-b0 feature extractor in five-shot mode, performs better by

5.03 %

compared with the best result, which is the similarity layer A1. Instead, in one-shot mode, the model performs only

1.03 %

better compared with the best result for this mode, which is the similarity layer A1.

Finally, as seen in Figure 12d, for the DroNet dataset, the SML model with EfficientNet-b0 as a feature extractor, in five-shot mode, performs best by

16.94 %

compared with the best result, which is the similarity layer A1, being the best performance within the three datasets used for five-shot mode. Additionally, in one-shot mode, the model performs better by

3.81 %

compared with the best result for this mode, which is also the similarity layer A1.

Figure 13 summarizes the previous results comparatively, as well as the behavior of the four datasets using the EfficientNet-B0 convolutional network as a feature extractor, as well as the affinity layers (

A_{1}

,

A_{2}

, and

A_{m a x}

) used in the comparative.

As a summary on the performance of the SML model, evaluating both feature extractors and the datasets in each of the similarity layers, we can state that the model has the best average accuracy with the ResNet-18 feature extractor for the DroNet dataset both in one-shot mode and five-shot mode, achieving the best percentages of

21.46 %

and

7.64 %

, respectively.

4.3. Performance and Generalization in the State-of-the-Art

When classifying new data with state-of-the-art reference models, the accuracy tends to decrease due to the change in data distribution as they could demonstrate a study by Li et al. [31], where all the data have the same statistical distribution even if they come from different classified groups. Our Siamese network, being the basis of the SML model, used the ResNet-18 and EfficientNet-B0 networks as feature extractors; was trained with the MiniImageNet dataset; and was also validated with the CIFAR-100, CUB-200–2011, and DroNet datasets. For the state-of-the-art models used in the comparison, very similar networks and datasets and CNNs were used, thus allowing the results to be evaluated using their classification accuracies in the two modes, one-shot and five-shot, with a

95 %

trust. Below in Table 5, Table 6, Table 7 and Table 8, we present the results about what we have described.

The above results allow us to conclude that our SML model, compared with the state-of-the-art models, using two of the datasets (CIFAR-100 and DroNet), has better performance and generalization in the one-shot and five-shot mode using both the ResNet-18 and EfficientNet-B0 feature extractors. However, for the MiniImageNet and CUB-200-2011 datasets, results close to the RENet model were obtained, which was the one that obtained the best results in terms of average accuracy.

5. Discussion

The results presented in Table 5, Table 6, Table 7 and Table 8 allow us to observe the comparison of accuracies in the average classification, which includes the models selected in the state of the art and the proposed SML model. Models using very similar feature extractors and datasets were explored and considered, allowing us to evaluate our SML model results with classification accuracy in both modes (one-shot and five-shot) at a

95 %

trustworthiness. Training for the SML model was performed with the MiniImageNet dataset and the ResNet-18 and EfficientNet-B0 CNNs as feature extractors.

Graphically, in Figure 14a, it can be seen that the proposed SML model for the MiniImageNet dataset in one-shot mode and as an EfficientNet-B0 feature extractor performs better than all models except RENet, which was better by

1.04 %

. Similarly, in performance for five-shot mode (Figure 14a), the model RENet was also

1.06 %

better than the SML model. Therefore, RENet was the model that presented the best result for both modes (one-shot and five-shot) of the comparative models using the MiniImageNet dataset.

Similarly, in Figure 15a,b it can be observed that for the CUB-200-2011 dataset, the SML model in both modes (one-shot and five-shot) did not have an average precision as close to the RENet model, since the latter was better by

6.93 %

and

7.92 %

for the one-shot and five-shot modes, respectively.

Continuing with the results obtained, Figure 16a,b show us how the SML model using the CIFAR-100 dataset, and when compared with the state-of-the-art models, surpassed the Dual Trinet model on average in accuracy by

12.24 %

in one-shot mode and by

3.43 %

in five-shot mode to the model, whose results had been the best as observed in Table 7.

Finally, for the DroNet dataset, as can be seen in Figure 17a, the SML model using the one-shot mode outperforms the Dual Trinet model by only

0.91 %

, the latter being the model that presented the best average accuracy of the analyzed models for that particular mode. Similarly, the Dual Trinet model for the five-shot mode was outperformed by our SML model in average accuracy but only by

0.40 %

as seen in Figure 17b.

As shown in this previous section, in the experiments carried out with the described datasets and the CNN networks, the proposed Siamese network model was able to perform better in the one-shot and five-shot learning methods. This is because the ResNet-18 CNN learns from residuals, and as shown in [35], it is a practical feature extractor for one-shot learning tasks. It could also be seen that its demonstrated classification accuracy was very close to EfficientNet-B0 and was consistent with the CNNs that have been used for comparison.

It can also be observed that when classifying new data with the experimental models, the accuracy within the classification decreases, and this is due to the change in the distribution of the data present in the dataset. Although the data used in one-shot learning com from disjoint classes, they all present and come from the same data distribution. Likewise, based on the work presented in [31], we present the accuracy demonstrated in the classification of our model using the CNNs ResNet-18 CNN and EfficientNet-B0, which have been trained with the MiniImageNet dataset and validated with the CIFAR-100 and DroNet datasets, and the classification accuracy is presented in Table 7 and Table 8 with 95% confidence.

Our experimental phase with the datasets and CNN networks resulted in our proposed model performing better in classification accuracy than other one-shot learning methods that use the cosine function as a similarity layer. As shown in particular, our Siamese network model when using the ResNet-18 feature extractor had better performance than when using EfficientNet-B0 for the MiniImageNet dataset. Architectures with fewer parameters were used, similar to those used in [23,35]. It should be noted that the performance of our model is due to the fused affinity layer (bi-layer) that was developed, and it is also emphasized that the careful combination of the affinity layers allowed us to have a significant improvement in the classification accuracy in one-shot and five-shot learning tasks.

6. Conclusions

Feature detection, accuracy, and learning speed represent three of the most important problems in the field of machine learning systems. These systems, in many real-world scenarios, will operate in unstructured environments and will therefore require architectures that can adapt to variations and perturbations in those environments.

The Siamese artificial neural network model that is proposed as a solution to automatically recognize a possible collision risk for a cyclist within the urban environment is basically based on two layers of affinity, resembling human contrastive learning, to perform the few-shot learning task. The results generated experimentally demonstrate that the proposed model performs better and above the baseline set by almost all models in the state of the art using the aforementioned datasets.

It was also observed that our Siamese artificial neural network model produces results whose consistency is compared with other feature extraction networks, but using a smaller size in the example dataset for training. One of the main results has been to demonstrate that the proposed SML model works in a comparable manner and in some cases above the baseline established by the Siamese network models in the state of the art, using similar datasets as well as specific data for the problem at hand. This allows us to infer that the technological tool developed in this research can be considered as an applicable partial solution.

Additionally, results with larger samples are considered consistent and even more accurate, because as the feature space groups and generalizes, it allows object identification to also increase in accuracy. This was demonstrated in the results with the reference data and with the state of the art, because the difference in the number of images in the MiniImageNet dataset [23] compared with the CUB-200–2011 dataset [34] is considerably larger and the results are consistent considering this difference.

Finally, we can mention that this is a work in progress whose usefulness can be extended to various fields of application and whose components can be improved as the research advances, such as feature extractors with better performance and faster and greater precision.

Author Contributions

Conceptualization, A.H.-H.; Data curation, A.H.-H.; Formal analysis, E.R.-E. and R.Á.-V.; funding acquisition, E.R.-E., V.H.P.-P. and R.Á.-V.; Investigation, A.H.-H. and E.R.-E.; Methodology, E.R.-E. and V.H.P.-P.; Project administration, V.H.P.-P.; Resources, E.R.-E., V.H.P.-P. and R.Á.-V.; Supervision, E.R.-E.; Validation, R.Á.-V.; Visualization, V.H.P.-P.; Writing—original draft, A.H.-H.; Writing—review and editing, E.R.-E. and V.H.P.-P. All authors have read and agreed to the published version of the manuscript.

Funding

The authors are thankful for the financial support of the projects to the Secretaría de Investigación y Posgrado del Instituto Politécnico Nacional with grant numbers 20242742, 20242954, and 20242280, as well as the support from Comisión de Operación y Fomento de Actividades Académicas and Consejo Nacional de Humanidades Ciencia y Tecnología (CONAHCYT).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Rogelio Álvarez-Vargas was employed by the company ProfTech Servicios, S. A. de C. V. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	convolutional neural network
CNNs	convolutional neural networks
DL	deep learning
LSTM	long short-term memory
ML	machine learning
SML	simplified machine learning

References

López Gómez, L. La Bicicleta como Medio de Transporte en la Movilidad Sustentable; Technical Report; Dirección General de Análisis Legislativo: Senado de la República, Mexico, 2018. [Google Scholar]
ITDP. Manual Ciclociudades I. La Movilidad en Bicicleta como Política Pública. In Manual Ciclociudades; Instituto de Políticas para el Transporte y el Desarrollo: Cuauhtémoc, Mexico, 2011; Volume I, p. 62. [Google Scholar]
Vision Zero Network. What Is Vision Zero? 2022. Available online: https://visionzeronetwork.org/about/what-is-vision-zero/ (accessed on 17 January 2025).
WHO. Global Status Report on Road Safety 2018; Technical Report; World Health Organization: Geneva, Switzerland, 2018. [Google Scholar]
INEGI. Estadísticas a Propósito del Día de Muertos, DATOS NACIONALES; Technical Report; Instituto Nacional de Estadística y Geografía: Aguascalientes, Mexico, 2019. [Google Scholar]
Hilmkil, A.; Ivarsson, O.; Johansson, M.; Kuylenstierna, D.; van Erp, T. Towards Machine Learning on data from Professional Cyclists. arXiv 2018, arXiv:1808.00198. [Google Scholar]
Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal Deep Learning. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, DC, USA, 28 June–2 July 2011; Getoor, L., Scheffer, T., Eds.; Omnipress: Madison, WI, USA, 2011; pp. 689–696. [Google Scholar]
Srivastava, N.; Salakhutdinov, R. Multimodal Learning with Deep Boltzmann Machines. J. Mach. Learn. Res. 2014, 15, 2949–2980. [Google Scholar]
Zhao, H.; Wijnands, J.S.; Nice, K.A.; Thompson, J.; Aschwanden, G.D.P.A.; Stevenson, M.; Guo, J. Unsupervised Deep Learning to Explore Streetscape Factors Associated with Urban Cyclist Safety. In Smart Transportation Systems 2019; Qu, X., Zhen, L., Howlett, R.J., Jain, L.C., Eds.; Springer: Singapore, 2019; pp. 155–164. [Google Scholar]
Galán, R.; Calle, M.; García., J.M. Análisis de variables que influencian la accidentalidad ciclista: Desarrollo de modelos y diseño de una herramienta de ayuda. In Proceedings of the XIII Congreso de Ingeniería de Organización, Barcelona, Spain, 2–4 September 2009; Asociación para el Desarrollo de la Ingeniería de Organización-ADINGOR: Valencia, Spain, 2009; pp. 696–703. [Google Scholar]
Caterini, A.L.; Chang, D.E. Deep Neural Networks in a Mathematical Framework, 1st ed.; Incorporated; Springer Publishing Company: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Cuomo, S.; Di Cola, V.S.; Giampaolo, F.; Rozza, G.; Raissi, M.; Piccialli, F. Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next. J. Sci. Comput. 2022, 92, 88. [Google Scholar] [CrossRef]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. arXiv 2021, arXiv:2004.11362. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; Daumé, H., III, Singh, A., Eds.; Proceedings of Machine Learning Research; PMLR: London, UK, 2020; Volume 119, pp. 1597–1607. [Google Scholar]
van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2019, arXiv:1807.03748. [Google Scholar]
Tian, Y.; Krishnan, D.; Isola, P. Contrastive Multiview Coding. arXiv 2020, arXiv:1906.05849. [Google Scholar]
Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 9726–9735. [Google Scholar] [CrossRef]
Hernández-Herrera, A.; Rubio-Espino, E.; Álvarez-Vargas, R.; Ponce-Ponce, V.H. Una Exploración Sobre el Aprendizaje Automático Simplificado: Generalización a partir de Algunos Ejemplos. Komput. Sapiens 2021, 3, 36–41. [Google Scholar]
Hernández-Herrera, A.; Rubio-Espino, E.; Álvarez-Vargas, R.; Ponce-Ponce, V.H. Simplified Machine Learning Model as an Intelligent Support for Safe Urban Cycling. Preprints 2024, 2024121671. [Google Scholar] [CrossRef]
Lee, S.W.; O’Doherty, J.P.; Shimojo, S. Neural Computations Mediating One-Shot Learning in the Human Brain. PLoS Biol. 2015, 13, e1002137. [Google Scholar] [CrossRef] [PubMed]
Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Sydney, NSW, Australia, 2017; Volume 30. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Sydney, NSW, Australia, 2016; Volume 29. [Google Scholar]
Hernández-Sampieri, R.; Mendoza, C. Metodología de la Investigación: Las Rutas Cuantitativa, Cualitativa y Mixta; McGraw-Hill Interamericana: New York, NY, USA, 2018. [Google Scholar]
Xing, E.; Jordan, M.; Russell, S.J.; Ng, A. Distance Metric Learning with Application to Clustering with Side-Information. In Advances in Neural Information Processing Systems; Becker, S., Thrun, S., Obermayer, K., Eds.; MIT Press: Cambridge, MA, USA, 2002; Volume 15. [Google Scholar]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1735–1742. [Google Scholar] [CrossRef]
Bromley, J.; Bentz, J.W.; Bottou, L.; Guyon, I.M.; LeCun, Y.; Moore, C.; Säckinger, E.; Shah, R. Signature Verification Using A “Siamese” Time Delay Neural Network. Int. J. Pattern Recognit. Artif. Intell. 1993, 7, 669–688. [Google Scholar] [CrossRef]
Koch, G.R. Siamese Neural Networks for One-Shot Image Recognition. In Proceedings of the 32nd International Conference on Machine Learning, Lile, France, 6–11 July 2015. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Li, X.; Yu, L.; Fu, C.W.; Fang, M.; Heng, P.A. Revisiting metric learning for few-shot image classification. Neurocomputing 2020, 406, 49–58. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset. 2011. Available online: https://www.vision.caltech.edu/datasets/cub_200_2011/ (accessed on 17 January 2025).
Loquercio, A.; Maqueda, A.I.; del Blanco, C.R.; Scaramuzza, D. DroNet: Learning to Fly by Driving. IEEE Robot. Autom. Lett. 2018, 3, 1088–1095. [Google Scholar] [CrossRef]
Chen, Z.; Fu, Y.; Zhang, Y.; Jiang, Y.G.; Xue, X.; Sigal, L. Multi-Level Semantic Feature Augmentation for One-Shot Learning. IEEE Trans. Image Process. 2019, 28, 4594–4605. [Google Scholar] [CrossRef] [PubMed]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to Compare: Relation Network for Few-Shot Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208. [Google Scholar] [CrossRef]
Hilliard, N.; Phillips, L.; Howland, S.; Yankov, A.; Corley, C.D.; Hodas, N.O. Few-Shot Learning with Metric-Agnostic Conditional Embeddings. arXiv 2018, arXiv:1802.04376. [Google Scholar]
Zhou, F.; Wu, B.; Li, Z. Deep Meta-Learning: Learning to Learn in the Concept Space. arXiv 2018, arXiv:1802.03596. [Google Scholar]
Mangla, P.; Kumari, N.; Sinha, A.; Singh, M.; Krishnamurthy, B.; Balasubramanian, V.N. Charting the Right Manifold: Manifold Mixup for Few-shot Learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020. [Google Scholar]
Kang, D.; Kwon, H.; Min, J.; Cho, M. Relational Embedding for Few-Shot Classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 8822–8833. [Google Scholar]
Li, Z.; Zhou, F.; Chen, F.; Li, H. Meta-SGD: Learning to Learn Quickly for Few-Shot Learning. arXiv 2017, arXiv:1707.09835. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; Proceedings of Machine Learning Research (PMLR): London, UK, 2017; Volume 70, pp. 1126–1135. [Google Scholar]

Figure 1. Urban mobility hierarchy (adapted from [2]).

Figure 2. Intuitive features that the proposed machine learning model should be able to handle.

Figure 3. Contrast losses. Self-supervised contrast loss contrasts a single positive image for each anchor (i.e., a magnified version of the same image) with a set of negative images from the sample set (a). However, the supervised contrast loss (b) considered in this article contrasts the set of all samples of the same class as positive with the negative ones of the rest of the set of images. As the condor photo demonstrates, taking the class label information into account results in a representation space (embedding space) where elements of the same class are more aligned than in the self-supervised case.

Figure 4. Learn new information based on some prior knowledge.

Figure 5. Example diagram of the proposed solution operation.

Figure 6. General methodology and research development.

Figure 7. Schematic diagram of the Siamese network.

Figure 8. The architecture of the Siamese cognitive model. The model feeds, through the input images, the extraction of features using CNNs. The outputs are two feature vectors that are passed to the affinity layers

A 1

and

A 2

. They are then integrated into the combined affinity layer (bi-layer), where the maximum affinity is calculated. The output of the combined affinity layer is passed through the activation function to determine the similarity or dissimilarity between the inputs.

Figure 8. The architecture of the Siamese cognitive model. The model feeds, through the input images, the extraction of features using CNNs. The outputs are two feature vectors that are passed to the affinity layers

A 1

and

A 2

. They are then integrated into the combined affinity layer (bi-layer), where the maximum affinity is calculated. The output of the combined affinity layer is passed through the activation function to determine the similarity or dissimilarity between the inputs.

Figure 9. An example of the precision of five random sample images in single example (one-shot) mode. The maximum predicted affinity score is indicated in the fourth row (boxed), which approximates the correct prediction of a value equal to 1 (affinity images); therefore, this would represent a correct prediction of the model.

Figure 10. Comparison with reference data. Performance results with different datasets using ResNet-18 feature extractor. (a) Comparison with MiniImageNet, (b) comparison with CIFAR-100, (c) comparison with CUB-200-2011, and (d) comparison with DroNet.

Figure 11. ResNet-18 Summary. Comparative summary of average accuracy with the different datasets.

Figure 12. Comparison with reference data. Performance results with different datasets using an EfficientNet-B0 feature extractor. (a) Comparison with MiniImageNet, (b) comparison with CIFAR-100, (c) comparison with CUB-200-2011, and (d) comparison with DroNet.

Figure 13. EfficientNet-B0 Summary. Comparative summary of average accuracy with the different datasets.

Figure 14. Performance between the state-of-the-art models and the proposed model (SML) using the MiniImageNet dataset, in one-shot (a) and five-shot mode (b).

Figure 15. Performances between the state-of-the-art models and the proposed model (SML) using the CUB-200-2011 dataset, in one-shot (a) and five-shot mode (b).

Figure 16. Performances between the state-of-the-art models and the proposed model (SML) using the CIFAR-100 dataset, in one-shot (a) and five-shot mode (b).

Figure 17. Performances between the state-of-the-art models and the proposed model (SML) using the DroNet dataset, in one-shot (a) and five-shot mode (b).

Table 1. Intuitive information that is observed in the problem analysis.

Available	Not Available
Position	Types of possible moving obstacles
Orientation	Number of moving obstacles
Velocity	Position of moving obstacles
Acceleration	Known dataset according to the problem for analysis and testing.
Image/Video

Table 2. Summary of advantages that the simplified machine learning model presents compared with other machine learning techniques.

Highlighted Advantage	Overview
Inspired by the process of human cognition	The model mimics the human ability to learn new concepts quickly from a limited exposure of few examples.
Requires less data	The model can learn from a small number of examples, reducing the need for large datasets.
Faster training	The model can be trained faster because it requires fewer examples in the process.
Lower costs	The model from a data collection perspective is more cost-effective because it requires less data collection and labeling.
Greater adaptability	Due to its flexibility in terms of the amount of information it requires, the model can quickly adapt to new tasks or scenarios.
Efficient learning process	The model uses techniques such as contrastive learning, where it learns from a similarity function or based on the transfer of previously learned knowledge.
Improves generalization	The model can generalize from a few examples to unseen data based on learned features because it is designed to generalize from limited data to recognize new instances.

Table 3. Identification accuracy. Average accuracy results with 95% confidence of the four datasets against three different affinity methods (

A_{1}

,

A_{2}

, and proposed SML model) using ResNet-18 as a feature extractor. All experiments with the proposed model with the maximum affinity layer

A_{m a x}

are highlighted in bold.

Table 3. Identification accuracy. Average accuracy results with 95% confidence of the four datasets against three different affinity methods (

A_{1}

,

A_{2}

, and proposed SML model) using ResNet-18 as a feature extractor. All experiments with the proposed model with the maximum affinity layer

A_{m a x}

are highlighted in bold.

Feature Extractor	Dataset	A₁		A₂		SML Model ( $A_{\max}$ )
Feature Extractor	Dataset	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot
ResNet-18	MiniImageNet	62.93	69.66	62.86	67.66	67.05	81.40
	CIFAR-100	62.83	68.50	64.30	67.99	67.14	80.17
	CUB-200–2011	70.46	77.29	70.48	76.32	73.52	83.73
	DroNet	59.93	66.56	59.28	63.07	64.51	80.85

Table 4. Identification accuracy. Average accuracy results with 95% confidence of the four datasets against three different affinity methods (

A_{1}

,

A_{2}

, and proposed SML model) using EfficientNet-b0 as a feature extractor. All experiments with the maximum affinity layer

A_{m a x}

are highlighted in bold.

Table 4. Identification accuracy. Average accuracy results with 95% confidence of the four datasets against three different affinity methods (

A_{1}

,

A_{2}

, and proposed SML model) using EfficientNet-b0 as a feature extractor. All experiments with the maximum affinity layer

A_{m a x}

are highlighted in bold.

Feature Extractor	Dataset	A₁		A₂		SML Model ( $A_{\max}$ )
Feature Extractor	Dataset	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot
EfficientNet-b0	MiniImageNet	64.14	70.47	61.65	64.49	66.90	81.71
	CIFAR-100	68.72	73.59	66.95	68.99	71.17	79.12
	CUB-200–2011	73.58	80.38	71.83	76.90	74.34	84.42
	DroNet	61.99	69.14	60.08	64.28	64.35	80.85

Table 5. Comparison with the state of the art using MiniImageNet. The accuracy between the state-of-the-art models and the SML proposed model is presented using the MiniImageNet dataset. Columns show the name of the model used, the feature extractor, the dataset, and the two k-shot learning modes used. The best results with our model are highlighted in bold.

Model	Feature Extractor	MiniImageNet
Model	Feature Extractor	1-Shot	5-Shot
[23] MatchNet	ResNet-12	63.08	75.99
[41] Meta-SGD	ResNet-50	50.47	64.66
[22] ProtoNet	ResNet-12	62.39	68.20
[36] RelationNet	ResNet-34	57.02	71.07
[40] RENet	ResNet-12	67.60	82.58
SML (Our model)	ResNet-18	67.05	81.40
SML (Our model)	EfficientNet-B0	66.90	81.71

Table 6. Comparison with the state of the art using CUB-200–2011. The accuracy between the state-of-the-art models and the SML proposed model is presented using the MiniImageNet dataset. Columns show the name of the model used, the feature extractor, the dataset, and the two k-shot learning modes used. The best results with our model are highlighted in bold.

Model	Feature Extractor	CUB-200–2011
Model	Feature Extractor	1-Shot	5-Shot
[23] MatchNet	ResNet-12	71.87	85.08
[41] Meta-SGD	ResNet-50	53.34	67.59
[22] ProtoNet	ResNet-12	66.09	82.50
[36] RelationNet	ResNet-34	66.20	82.30
[40] RENet	ResNet-12	79.49	91.11
SML (Our model)	ResNet-18	73.52	83.73
SML (Our model)	EfficientNet-B0	74.34	84.42

Table 7. Comparison with the state of the art using CIFAR-100. The accuracy between the state-of-the-art models and the SML proposed model is presented using the MiniImageNet dataset. Columns show the name of the model used, the feature extractor, the dataset, and the two k-shot learning modes used. The best results with our model are highlighted in bold.

Model	Feature Extractor	CIFAR-100
Model	Feature Extractor	1-Shot	5-Shot
[42] MAML	ResNet-12	49.28	58.30
[23] MatchNet	ResNet-12	50.53	60.30
[41] Meta-SGD	ResNet-50	53.83	70.40
[38] DEML+Meta-SGD	ResNet-50	61.62	77.94
[35] Dual TriNet	ResNet-18	63.41	78.43
SML (Our model)	ResNet-18	67.14	80.17
SML (Our model)	EfficientNet-B0	71.17	81.12

Table 8. Comparison with the state of the art using DroNet. The accuracy between the state-of-the-art models and the SML proposed model is presented using the MiniImageNet dataset. Columns show the name of the model used, the feature extractor, the dataset, and the two k-shot learning modes used. The best results with our model are highlighted in bold.

Model	Feature Extractor	DroNet
Model	Feature Extractor	1-Shot	5-Shot
[42] MAML	ResNet-12	45.59	54.61
[23] MatchNet	ResNet-12	48.09	57.45
[41] Meta-SGD	ResNet-50	48.65	64.74
[38] DEML+Meta-SGD	ResNet-50	62.25	79.52
[35] Dual TriNet	ResNet-18	63.77	80.53
SML (Our model)	ResNet-18	64.51	81.43
SML (Our model)	EfficientNet-B0	64.35	80.85

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hernández-Herrera, A.; Rubio-Espino, E.; Álvarez-Vargas, R.; Ponce-Ponce, V.H. Simplified Machine Learning Model as an Intelligent Support for Safe Urban Cycling. Appl. Sci. 2025, 15, 1395. https://doi.org/10.3390/app15031395

AMA Style

Hernández-Herrera A, Rubio-Espino E, Álvarez-Vargas R, Ponce-Ponce VH. Simplified Machine Learning Model as an Intelligent Support for Safe Urban Cycling. Applied Sciences. 2025; 15(3):1395. https://doi.org/10.3390/app15031395

Chicago/Turabian Style

Hernández-Herrera, Alejandro, Elsa Rubio-Espino, Rogelio Álvarez-Vargas, and Victor H. Ponce-Ponce. 2025. "Simplified Machine Learning Model as an Intelligent Support for Safe Urban Cycling" Applied Sciences 15, no. 3: 1395. https://doi.org/10.3390/app15031395

APA Style

Hernández-Herrera, A., Rubio-Espino, E., Álvarez-Vargas, R., & Ponce-Ponce, V. H. (2025). Simplified Machine Learning Model as an Intelligent Support for Safe Urban Cycling. Applied Sciences, 15(3), 1395. https://doi.org/10.3390/app15031395

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Simplified Machine Learning Model as an Intelligent Support for Safe Urban Cycling

Abstract

1. Introduction

1.1. Urban Mobility

1.2. Road Safety for Cyclists

1.3. Intelligent Urban Cycling

2. Simplified Machine Learning

2.1. The Challenge of Machine Learning

2.2. Related Work

2.3. Contrastive Learning

2.4. Approach Overview and Contributions

3. Materials and Methods

3.1. Methodology

3.2. Cognitive Model Architecture

3.3. Affinity Layer Overview

3.4. Combined Affinity Layer Overview

3.5. Dataset Overview

4. Results

4.1. Experimental Setup

4.2. Comparison of the Model Against Reference Data

4.3. Performance and Generalization in the State-of-the-Art

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI