1. Introduction
Self-supervised learning (SSL) methods are alternatives to supervised ones. In recent years, the gap between SSL and supervised methods has decreased in performing downstream tasks, including image classification [
1], object detection [
2], and semantic image segmentation [
3,
4]. A general idea behind the SSL models for image classification is to train an embedding network, often called an
encoder, on an unlabeled dataset and then to use this pretrained encoder for the downstream tasks.
The general goal is to ensure the invariance of embeddings to different inputs known as
augmentations or
views. However, this approach might lead to trivial solutions when two branches of encoders produce the same output. As a result, one observes an effect known as a
collapse in training when no meaningful representation can be learned for different inputs. Therefore, there have been a lot of recent works tackling this issue by regularizing the networks to avoid such a collapse. Several key approaches have been developed to mitigate these negative impacts, using different tactics. The first group of methods aims to directly maximize the mutual information between the input image
and its positive pairs created on the basis of augmentations. In this view, an exponential prior on the conditional distribution in the representation space and an associated contrastive loss with positive–negative pairs as in InfoNCE [
5] is assumed. Unfortunately, such an approach is quite computationally expensive in practice, due to the need for a large batch size to incorporate the large number of negative pairs. The second group of methods aims to avoid collapse by introducing different asymmetries in two branches at the training stage. Examples of this approach are training one network with gradient descent and updating the other with an exponential moving average of the weights of the first network [
6] or introducing regularizers on the learned representations such as regularization by decorrelation on the dimensions of the embeddings [
7], etc. The third group of methods is Masked Image Modeling (MIM). These methods primarily focus on avoiding collapse and learning rich image representations by predicting the missing parts in masked inputs. This methodology relies on masking a portion of the input image and training the model to predict these masked parts, thereby learning contextual and semantic information. A notable method in this domain is the BEiT [
8], which introduces a transformer-based model that learns to predict the masked visual tokens, analogous to the masked language modeling in NLP. Another significant approach is the MAE (Masked Autoencoder) [
9], which uses an asymmetric encoder–decoder structure, where the encoder processes only visible patches and the decoder reconstructs the masked patches. While MIM effectively learns representations, it is a transformer-specific approach and not transferable to other architectures.
The proposed approach avoids the embeddings’ collapse by introducing the dependence maximization between trainable embeddings and hand-crafted features using
distance correlation [
10]. Distance correlation, unlike other losses in latent space, allows computing dependencies between feature vectors of different shapes. We maximize the dependence between different embeddings while preserving the variance in them. We show that variance preservation maximizes the entropy of embeddings, which makes them unique and distinguishable. Our approach is different from InfoNCE [
5], which advocates a contrastive loss that maximizes the mutual information (MI) between input image
and its positive pairs. In contrast to the InfoNCE, our approach is not contrastive, does not require large batch sizes, and allows computing the distance between embeddings and features of any shape. It is also different from methods such as Barlow Twins [
7] and VICReg [
11] since we do not explicitly minimize the dependencies between the components within the embedding.
We also show that the proposed approach can be used for efficient representation learning and latent space-agnostic knowledge distillation. The approach is based on the dependence maximization between the embeddings of the target trainable encoder, represented by the ResNet50 [
12], and the embeddings of the pretrained encoder, represented by the CLIP [
13] (based on ViT-B-16 [
14]). Since the distance correlation is agnostic to the latent space shape, any pretrained encoder with any latent space can be used for knowledge distillation. To our best knowledge, we are the first to propose a model-distillation method that is agnostic to the latent space shape.
The main goal behind MV–MR is twofold: (i) maximizing the invariance of embeddings, i.e., maximizing the proximity of embeddings for the same image observed under different views, and (ii) maximizing the amount of information in each embedding, i.e., maximizing the variability of the embedding. Furthermore, to avoid the collapse during training, we regularize the branch with the augmentations by imposing the dependence constraints on a set of representations extracted from various encodings.
The proposed approach introduces several unique features: (i) we introduce a novel SSL approach that avoids collapse thanks to an additional regularization term that maximizes the dependence between trainable embeddings and various feature vectors using distance correlation; (ii) up to our best knowledge, the proposed method is among the first that uses the dependence maximization of the latent space based on distance correlation for SSL; (iii) the proposed method is agnostic to the latent space shape and, thus, can be used with any types of features; (iv) we introduce a novel knowledge distillation technique that is agnostic to model and shape of latent space; (v) we demonstrate the state-of-the-art classification results on the STL10 [
15] (89.71%) and CIFAR20 [
16] (73.2%) datasets using a linear evaluation protocol for non-contrastive SSL methods; (vi) we provide the information-theoretic explanation of the proposed method that contributes to the explainable ML; (vii) we demonstrate how the complex CLIP model with 86.2 M parameters trained on 400 M text–image pairs can be distilled into a ResNet50 model with just 23.5 M parameters trained on the STL10, CIFAR100 [
17], and ImageNet-1k [
18] datasets; (viii) we achieve state-of-the-art performance in knowledge distillation in the image-classification task using the ResNet50 model as student and CLIP ViT-B-16 as teacher on CIFAR100, with
78.6% accuracy.
We have three loss terms in our objective function: (a) the first term
consists of the mean square error (MSE) loss between the embeddings from the non-augmented view and augmented views of the same image; it is used for the invariance of embeddings, and we introduce additional variation terms that are used for maximization of the variability of the embeddings (we demonstrate that this term originates from an upper bound on mutual information between these embeddings under corresponding assumptions); (b) the second term
stands for the distance correlation between the embeddings from the augmented and non-augmented views that complements the first term to capture non-linear relations between the embeddings; and (c) the third term
corresponds to the distance correlation between the embeddings from the augmented view and multiple image representations. For the non-learnable or hand-crafted representations, we have studied various techniques of invariant data representation that are well-known in computer vision and image-processing applications. The studied hand-crafted features include, but are not limited tp, ScatNet [
19] features, local standard deviation (LSD)-based [
20] filters, and histograms of oriented gradients (HOG) [
21]. Additionally, to demonstrate the flexibility of the proposed method, we have also considered random augmentations of the original images flattened into feature vectors as instances of hand-crafted features. Since distance correlation is shape-agnostic for the features, we are able to combine features of different shapes in the loss functions. Also, replacing hand-crafted features with embeddings from pretrained networks is used for model distillation, without the need to change losses, architecture, or feature dimensionality.
2. MV–MR: Motivation and Intuition
MV–MR pretraining and distillation schemes are schematically shown in
Figure 1 and
Figure 2, respectively. The dimensions of embeddings with and without augmentations are the same, i.e.,
and
, respectively. These embeddings are extracted from the augmented
and non-augmented
via a generalized parametrized embedder
that can be deterministic or stochastic with parameters
. The encoder can be a parametrized neural network of any architecture. A
hand-crafted descriptor
, where
and
K stands for the total number of hand-crafted descriptors, is generally a tensor of dimensions
and is flattened to
. This descriptor is generally obtained via deterministic assignment
or sometimes via stochastic mapping
, where
denotes the parameters of the
feature extractor.
2.1. Motivation: Regularization in Self-Supervised Representation Learning
The learned representation should contain the informative representation of data with lower dimensionality and should be invariant under some transformations, i.e., to ensure the same latent representation for the data from the same sample passed through certain transformations. The satisfaction of these conflicting requirements in practice is not a trivial task. Many state-of-the-art SSL techniques try to find a reasonable compromise between these requirements and practical feasibility solely in the scope of machine learning formulation by imposing certain constraints on the properties of the learned representation via the optimization of encoder parameters under augmentations.
At the same time, there exists a rich body of achievements in the computer vision community in the domain of the hand-crafted design of robust, invariant, yet discriminating data representations [
21,
22,
23,
24]. Generally, the computer vision descriptors are very rich in terms of targeted invariant features and quite efficient in terms of computation. However, to our best knowledge, such descriptors are not yet fully integrated into the development of SSL techniques. Therefore, one of the objectives of this paper is to propose a framework where the SSL representation learning might be coupled with the constraints on the embedding space offered by the invariant computer vision representations. Our objective is not to consider a case-by-case approach on how to couple SSL with a particular computer vision representation but instead to propose a
generic approach where any form of desirable computer vision representation can be integrated into the SSL optimization problem in an easy and tractable way. This ensures that the learned representation possesses the targeted properties inherited from the used computer vision descriptors. Furthermore, features extracted by such descriptors might be considered as a form of invariant data representation, which is one of the desired properties of trained encoders. Thus, maximizing the dependence between the trainable embedding and such representation might be a novel form of regularization, leading to an increased-invariance yet collapse-avoiding technique. Since a single computer vision descriptor might not capture all desirable properties and have different representation formats, the targeted framework should be flexible enough to deal uniformly with all these descriptors within a simple optimization problem. Distance correlation is very useful for this kind of representation learning, since it allows one to incorporate features of any shapes, without the need to match the shape of learnable embeddings and hand-crafted target embeddings.
In summary, our motivation is to include regularization constraints on the solution by borrowing some inherent computer vision feature invariance to certain transformations. In this way, we target learning the low-dimensional embedding, which contains only essential information about the data that might be of interest for the targeted downstream task and where all information about the augmentations is excluded.
2.2. Intuition
The basic idea behind MV–MR is to introduce constraints on the invariance of embedding via a new loss function. Our overall objective is to maximize the mutual information between the augmented embedding and the embedding without the augmentation and to maximize the mutual information between and some invariant feature extracted from using a mapper that ensures a known invariance to the desired transformation.
2.2.1. Measuring Dependencies between Embeddings of Non-Augmented and Augmented Data
Upper bound on mutual information: In the first case, one can decompose the mutual information as
where
denotes the differential entropy and
denotes conditional differential entropy (we assume that the differential entropy is non-negative under the considered settings). Since the computation of the marginal distribution
and conditional distribution
is difficult in practice, we proceed by bounding these terms. We assume that the desired embeddings need to be bounded by some variance
to avoid a training collapse when the encoders produce constant and non-informative vectors so that the entropy-maximizing distribution for the first entropy term is the Gaussian one, i.e.,
, where
represents the covariance matrix.
The conditional entropy is minimized when the embedding contains as much information as possible about , i.e., when two vectors are dependent. Assuming that , where denotes some distance between two vectors such as the -norm for the Gaussian distribution or -norm for the Laplacian one, where stands for the normalization constant and denotes a scaling parameter. Thus, the minimization of the conditional entropy reduces to the minimization of the distance .
Distance covariance: Another way to measure the dependency between the data is based on distance covariance, as proposed by [
10]. In the general case of dependence between the data, the distance covariance is non-invariant to strictly monotonic transformations, unlike mutual information. Nevertheless, the distance covariance has several attractive properties: (i) it can be efficiently computed for two vectors that have generally different dimensions
and
, such that
, and (ii) it is easier to compute in practice in contrast to the mutual information. Additionally, the distance covariance captures higher-order dependencies between the data, in contrast to the Pearson correlation. The
distance covariance , proposed by [
10], is defined as
which measures the distance between the joint characteristic function
and the product of the marginal characteristic functions
[
10]. This definition has a lot of similarities to the mutual information in (
1), which measures the ratio between the joint distribution
and the product of marginals
. Since
when
and
are independent random vectors, the distance covariance is equal to zero.
In the following, we proceed with the normalized version of distance covariance, known as
distance correlation, defined as
where
and
.
Sample distance covariance, for a given
, denoting a batch of size
B of embeddings from original views, and
, referring to a batch of embeddings from augmented views, is defined as
In Equation (
4), we use the notations
, where
, where
. Finally,
sample distance correlation is defined as:
with
.
2.2.2. Dependence between Embeddings of Augmented Data and Multiple Hand-Crafted Representations
The second mutual information
between
and some invariant feature
deals with vectors of different dimensions. Thus, one can either map these vectors to the same dimension and apply the above arguments, use the Hilbert–Schmidt proxy [
25], or proceed with the distance correlation dependence measure for the uniformity of consideration. We focus on the distance correlation case due to its property of handling vectors of different dimensions and its ability to capture higher-order data statistics.
3. Related Work
Pretext task methods. The main idea behind these methods is to design a specific task,
a.k.a. pretext task, for the dataset that contains some “labels” of the pretext task without having any access to the labels of the target task. Such pretext tasks include, but are not limited to, applying and predicting parameters of the geometric transformations [
26], jigsaw puzzle solving [
27], inpainting [
28] and colorization [
29] of the images, and reversing augmentations. Typically, the pretext task methods have been coupled with other SSL techniques in recent years [
30,
31,
32].
Contrastive methods. Most of the contrastive SSL methods are based on different extensions of the InfoNCE [
5] formulation. The InfoNCE method is based on the direct maximization of the mutual information between the input image and its positive pairs via minimization of the contrastive loss. Examples of contrastive methods are SimCLR [
33], SwAV [
34], and DINO [
35].
Clustering methods. Clustering-based SSL methods are based on the idea of assigning cluster labels to the learned representations in an unsupervised manner with some regularization, such as maintaining uniformity of these cluster labels. The DeepCluster [
36] method iteratively groups the features from the encoder using the standard
k-means clustering and then uses them as an assignment for the supervision to update the weights of the encoder at the next iterations. SwAV [
34] and DINO [
35] are other notable clustering-based SSL methods that combine contrastive learning and clustering by clustering the data while requiring the same cluster assignment for different views of the same image.
Distillation methods. Distillation-based SSL methods like BYOL [
37], SimSiam [
6], and others use the teacher–student type of training, where the student network is trained with the gradient descent, while the teacher network is not updated with gradient descent, but rather with an exponential moving-average update or other method. Such a design is used to avoid collapse.
Collapse- preventing methods. Similar to distillation, collapse-preventing methods try to prevent the collapse by the usage special regularization of embeddings. The Barlow Twins [
7] method aims to make the covariance matrix of the embeddings to be an identity matrix. This means that each dimension of the embeddings should be decorrelated with all other dimensions. Additionally, the minimum variance of embedding per each dimension in the batch is constrained. The VICReg [
11] method extends the Barlow Twins [
7] approach by imposing an additional constraint on the distance between the embeddings with and without augmentations.
Masked Image Modeling. Masked Image Modeling (MIM) for self-supervised learning has emerged as a compelling approach, diverging from traditional methods like pretext task, contrastive, or clustering methods. Central to MIM is the principle of intentionally masking portions of an input image and training a model to predict these occluded parts. This process enables the model to learn valuable representations of the data without relying on explicit labels. Unlike contrastive learning methods like SimCLR or SwAV that require negative samples, MIM directly utilizes the spatial coherence of images to enhance the model’s ability to recognize and predict the structure within masked areas. Pioneering examples include the BEiT [
8] algorithm, which employs a transformer architecture to predict the masked visual tokens, drawing inspiration from masked language modeling. Another notable implementation is the MAE (Masked Autoencoder) [
9], which uses an asymmetric encoder–decoder structure to efficiently reconstruct masked patches. These approaches contrast with distillation methods like BYOL, where a teacher–student model is used, and clustering methods like DeepCluster that focus on feature clustering. MIM’s uniqueness lies in its direct engagement with the raw image data, offering a pathway to learn intricate image features in a self-supervised manner without the need for complex negative sample handling or clustering mechanisms.
Knowledge distillation. Knowledge distillation [
38] is a type of model optimization, where a simple small model (student model) is trained to match the bigger complex model (teacher model). There are multiple types of knowledge-distillation schemes: offline distillation [
39], online distillation [
40], and self-distillation [
41]. There are multiple types of knowledge types that are used for distillation: response-based knowledge [
39,
42], feature-based knowledge [
43], and others. We show how our method can be used for offline feature-based knowledge distillation.
7. Implementation Details
The architecture of the MV–MR is similar to ones used in other SSL methods such as BarlowTwins [
7], VICReg [
11], and others. The model
, shown in
Figure 1, consists of two main parts: (i) the encoder, which is used for downstream tasks, and (ii) the projector, which is used for the mapping of encoder outputs to the embeddings used for the training loss functions in (
Figure 1). In our experiments, we use standard ResNet50 [
12], available in the
library [
55], as the encoder and projector, which consists of two linear layers of size 8192, followed by batch normalization, ReLU, and output linear layer.
We use computer vision feature-extraction methods applied to the original data: original RGB image (that is being flattened into a feature vector), ScatNet features of the image [
56], randomly augmented images, flattening into a feature vector, histogram of oriented gradients (HOG), and local standard deviation filter (LSD filter) [
20].
ScatNet transform: ScatNet [
19,
56] is a class of Convolutional Neural Networks (CNNs) that have a set of useful properties: (i) deformation stability, (ii) fixed weights, (iii) sparse representations, (iv) interpretable representation.
Randomly augmented image: In our experiments, we have applied the following augmentations to the image: random cropping, horizontal flipping, random color augmentations, grayscale, and Gaussian blur. Then, the image is flattened into a one-dimensional feature vector.
HOG: Histogram of oriented gradients (HOG) [
21] is a feature description that is based on the counting of occurrences of gradient orientation in the localized portion of an image.
LSD filter: A local standard deviation filter [
20] is a filter that computes a standard deviation in a defined image region over the image. The region is usually of a rectangular shape of size
or
pixels.
We use the PyTorch framework [
55] for the implementation of the proposed approach. We use ScatNet with the following parameters:
and
. We use the HOG feature extractor with the following parameters: number of bins—24 and pool size—8. We use a kernel of size
in the STD filter. As augmentations, for both image representation and as the input to the encoder, we use randomly resized cropping; random horizontal flipping with probability
; random color-jittering augmentation with brightness
, contrast
, saturation
, hue
, and probability
; random grayscale with probability
; and Gaussian blur with a kernel size of
of the image size, mean 0, and sigma in the range
.
For the losses, the margin parameter
is set to 1, and
is set to
in (
7).
During the
self-supervised pretraining experiments that are presented in
Table 1 and
Table 2, we train models for 1000 epochs, with batch size 256, gradient accumulation every 4 steps, base learning rate
, Adam [
57] optimizer, cosine learning rate schedule, and 16-bit precision. During
linear evaluation, we train a single-layer linear model for 100 epochs with batch size 256, learning rate
, and Adam optimizer. During
semi-supervised evaluation on ImageNet-1K, we train a model for 100 epochs with batch size 128, learning rate
, and Adam optimizer. During the knowledge distillation, we train the model for 200 epochs, with batch size 512, base learning rate
, Adam optimizer, cosine learning rate schedule, and 16-bit precision.
When training, weight parameters and in in , in , and in .