Next Article in Journal
Integration of 24 Feature Types to Accurately Detect and Predict Seizures Using Scalp EEG Signals
Next Article in Special Issue
A Novel Fault Diagnosis Method for Rotating Machinery Based on a Convolutional Neural Network
Previous Article in Journal
Displacements Study of an Earth Fill Dam Based on High Precision Geodetic Monitoring and Numerical Modeling
Previous Article in Special Issue
Cross-Participant EEG-Based Assessment of Cognitive Workload Using Multi-Path Convolutional Recurrent Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unsupervised Learning for Monaural Source Separation Using Maximization–Minimization Algorithm with Time–Frequency Deconvolution

1
School of Electrical and Electronic Engineering, Newcastle University, Newcastle upon Tyne NE1 7RU, UK
2
School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China
3
Department of Computer and Information Sciences, Northumbria University, Newcastle upon Tyne NE1 8ST, UK
4
Faculty of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China
5
Faculty of Science Agriculture and Engineering, Newcastle University, Singapore 599493, Singapore
*
Author to whom correspondence should be addressed.
This paper is an extended version of Yu, K.; Woo, W.L.; Dlay, S.S. Variational regularized two-dimensional nonnegative matrix factorization with the flexible β4-divergence for single channel source separation. In Proceedings of the 2nd IET International Conference in Intelligent Signal Processing (ISP), London, UK, 1–2 December 2015.
Sensors 2018, 18(5), 1371; https://doi.org/10.3390/s18051371
Submission received: 9 April 2018 / Revised: 22 April 2018 / Accepted: 24 April 2018 / Published: 27 April 2018
(This article belongs to the Special Issue Sensor Signal and Information Processing)

Abstract

:
This paper presents an unsupervised learning algorithm for sparse nonnegative matrix factor time–frequency deconvolution with optimized fractional β -divergence. The β -divergence is a group of cost functions parametrized by a single parameter β . The Itakura–Saito divergence, Kullback–Leibler divergence and Least Square distance are special cases that correspond to β = 0 ,   1 ,   2 , respectively. This paper presents a generalized algorithm that uses a flexible range of β that includes fractional values. It describes a maximization–minimization (MM) algorithm leading to the development of a fast convergence multiplicative update algorithm with guaranteed convergence. The proposed model operates in the time–frequency domain and decomposes an information-bearing matrix into two-dimensional deconvolution of factor matrices that represent the spectral dictionary and temporal codes. The deconvolution process has been optimized to yield sparse temporal codes through maximizing the likelihood of the observations. The paper also presents a method to estimate the fractional β value. The method is demonstrated on separating audio mixtures recorded from a single channel. The paper shows that the extraction of the spectral dictionary and temporal codes is significantly more efficient by using the proposed algorithm and subsequently leads to better source separation performance. Experimental tests and comparisons with other factorization methods have been conducted to verify its efficacy.

1. Introduction

Blind source separation (BSS) [1,2,3,4,5,6,7,8] is an ill-posed problem that cannot be totally solved without some prior information. This entails a certain number of assumptions have to be imposed to render the problem solvable such as channel type (linear [1] versus nonlinear [2]), mutual statistical independence among the sources [3], the number of sources [4], how the sources are mixed (instantaneous [5] versus convolutive [6]), and the location of the sources with respect to the microphones. Several recent solutions have been developed to mitigate some of these constraints. In the work [7], it was previously shown that non-Gaussian stationary process can be approximated as non-stationary Gaussian process which enabled separation involving mixtures of non-Gaussian sources. Of similar concept, a method is proposed for separation by decorrelating multiple non-stationary stochastic sources using a multivariable crosstalk-resistant adaptive noise canceller [8]. In a related method, the problem of speech quality enhancement is tackled using adaptive and non-adaptive filtering algorithms [9]. A two-microphone Gauss–Seidel pseudo affine projection algorithm combined with forward blind source separation is proposed. A higher efficiency in speech enhancement in noisy environment has been attained. The paper [10] proposes rational polynomial functions to replace the original score functions used in standard independent component analysis (ICA). The rational polynomials are derived by the Pade approximant from Taylor series expansion of the original nonlinearities which can be quickly evaluated to enable large-scale multidimensional sets of data characterized by super-Gaussian distribution to be separated within a short period of time. Recently, a bi-variate empirical mode decomposition algorithm combined with complex ICA by entropy bound minimization technique is proposed for convolutive signal separation [11]. In telecommunication problems, neither the direction of arrival (DOA) nor a training sequence is assumed to be available at the receiver. The only assumption is that the transmitted signals satisfy the constant modulus property. In the work [12], a multistage space–time equalizer is proposed to blindly separate signals received by an antenna array from different sources simultaneously. In the algorithm, each stage consists of an adaptive beamformer, a DOA estimator and an equalizer which are jointly optimized using the constant modulus property of the sources. Other than statistical independence and non-Gaussianity, signal separation approach based on second-order statistics of the speech signals using canonical correlation approach [13] has also been proposed. The work [14] considers complex-valued mixing matrix estimation and direction-of-arrival estimation of synchronous orthogonal frequency hopping signals in the underdetermined blind source separation (UBSS). A mixing matrix estimation algorithm is proposed by detecting single source points where only one source contributes its power. While traditional algorithms are usually applied in the ideal sparse environment, the work [15] proposes a solution where multiple input multiple output mixed signals are insufficiently sparse in both time and frequency domains under noisy conditions. The work [16] demonstrates the application of UBSS in addresses the mixing of pipe abrasive debris problem and focuses on the superimposed abrasive debris separation of a radial magnetic field abrasive sensor. Through accurately separating and calculating the morphology and amount of the abrasive debris, the abrasive sensor has provided the system with wear trend and sizes estimation of the wear particles.
In recent years, an alternate class of solutions for BSS based on nonnegative matrix factorization (NMF) [17] has been proposed. Compared to ICA, NMF gives a more part based decomposition and the decomposition is unique under certain conditions, making it unnecessary to impose the constraints in the form of orthogonality and independence [18]. These properties have led to a significant interest in NMF lately for its application in areas of BSS [5,19,20,21,22,23,24], pattern recognition [25], and dimensionality reduction [26]. Multiplicative update-based families of parameterized cost functions such as the Csiszar’s divergences [27,28] were also presented. The NMF is a matrix decomposition technique. Let the data matrix V be a nonnegative matrix of dimensions I × J . The aim of NMF is to find two matrices W and H such that:
V = W H
or in scalar form,
V i , j = k W i , k H k , j
where i = 1, 2, ..., k = 1 ,   2 , ,   K , and j = 1 ,   2 ,   ,   J . When W and H are nonnegative matrices of dimensions I × K and K × J , then is usually chosen such that
I × K + K × J I × J
A sparseness constraint can be added to the cost function [26,27,28,29,30,31], and this can be achieved by regularization using the L1-norm leading to Sparse NMF (SNMF). Here, “sparseness” refers to a representational scheme where only a few units (out of a large population) are effectively used to represent typical data vectors. In effect, this implies most units taking values close to zero while only few take significantly non-zero values. Several other types of prior distribution over W and H can be defined, e.g., it is assumed that the prior of W and H satisfy the exponential density and the prior for the noise variance is chosen as an inverse gamma density [27]. In the work [28], Gaussian distributions are chosen for both W and H. The model parameters and hyper parameters are adapted by using the Markov chain Monte Carlo (MCMC) [32]. In all cases, a fully Bayesian treatment is applied to approximate inference for both model parameters and hyper parameters. While these approaches increase the accuracy of matrix factorization, it only works efficiently when a large sample dataset is available. Moreover, it consumes significantly high computational complexity at each iteration to adapt the parameters and its hyper parameters. The NMF with the β-divergence has been previously used in music signal processing [33,34]. In our previous paper [35], we investigated β-divergence for source separation problem. It was shown that improved performance has been attained over integer-based β-divergence. Thus, this motivates research of using β-divergence for music signal processing and source separation. However, all of these works fixed β to some constant values within 0–2, and have not presented any method to determine the desired β value. This significantly constrains the performance of matrix factorization and its ability in separating mixed sources. In addition, these works do not consider the issue of sparsity of the temporal codes which would undermine the quality of matrix factorization when the β value is inappropriately chosen. The selection of the β value should consider the sparseness constraint used in the cost function.
Regardless of the cost function and sparseness constraint being used, the standard NMF or SNMF models are only satisfactory for solving source separation provided that the spectral frequencies of the analyzed audio signal do not change over time. However, this is not the case for many realistic signals such as music and speech. As a result, the spectral dictionary obtained via the NMF or SNMF decomposition is not adequate to capture the temporal dependency of the frequency patterns within the signal. To remedy the situation, a pragmatic approach is to work on a more holistic model based on matrix factor deconvolution [21,22,23,24]. In this paper, we work with NMF model extended to two-dimensional time–frequency deconvolution of W and H where (W, H) are considered as the matrix factors [22]. Mathematically, this is expressed as
V i , j = k = 1 K τ = 0 τ m a x ϕ = 0 ϕ m a x W i ϕ , k τ H k , j τ ϕ V = τ = 0 τ m a x ϕ = 0 ϕ m a x W τ ϕ H ϕ τ
where i and i represent the frequency and time index, respectively, k indicates the factor number, τ represents the temporal shift and ϕ is the frequency shift. The terms τ m a x and ϕ m a x are the maximum temporal and frequency shift, respectively. With this definition, both W i , k τ and H k , j ϕ have tensorial structures with dimension I × K × τ m a x and K × J × ϕ m a x , respectively. Thus, W i , k τ represents the τ th -slice of the k th -spectral basis while H k , j ϕ represents the associated ϕ th -slice of the k th -temporal code. The downward and rightward arrow signs denote the corresponding shifting direction of each column in W τ and each row in H ϕ by the amount indicated by τ and ϕ , respectively.
Model (4) represents both temporal structure and the pitch change which occur when an instrument plays different notes. In the log-frequency spectrogram, the pitch change corresponds to a displacement on the frequency axis. Where previous NMF methods needed one component to model each note for each instrument, Model (4) represents each instrument compactly by a single time–frequency profile convolved in both time and frequency by a time–pitch weight matrix. This model dramatically decreases the number of components needed to model various instruments and effectively solves the blind single channel source separation problem for certain classes of musical signals. When polyphonic music is modeled by factorizing the magnitude spectrogram with NMF, each instrument is modeled by an instantaneous frequency signature which can vary over time. However, the NMF requires multiple basis functions to represent tones with different pitch values. The two-dimensional time–frequency deconvolution model implicitly solves the problem of grouping notes. Thus, all notes for an instrument is an identical pitch shifted time–frequency signature, Model (4) will give better estimates of these signatures, because more examples of different notes are used to compute each time–frequency signature. In the event when this assumption does not hold, it might still hold in a region of notes for an instrument. Furthermore, the two-dimensional time–frequency deconvolution model can explain the spectral differences between two notes of different pitch by the two-dimensional deconvolution of the time–frequency signature.
The novelty of this paper can be summarized as follows: Firstly, a new algorithm is developed for sparse nonnegative matrix factor time–frequency deconvolution optimized with fractional β-divergence. Secondly, the maximization–minimization algorithm is developed to derive the auxiliary cost function which caters for any β value. The paper shows that the optimal β that leads to the desired performance is not necessarily limited to the special cases of integer β but extends to fractional values. Thirdly, it is analytically shown that the convergence of the proposed algorithm is guaranteed under the auxiliary function. Fourthly, a method is proposed to estimate the fractional β within the context of monoaural source separation. Finally, the paper proposes an adaptive method to estimate the sparsity parameter for each of the individual temporal code.
The remainder of the paper is organized as follows: In Section 2, the new algorithm for matrix factor time–frequency deconvolution model with β-divergence based on the maximization–minimization algorithmic framework is derived. Real application of blind source separation using the proposed method and comparisons with other matrix factorization methods are presented in Section 3. Finally, Section 4 concludes the paper.

2. Background

2.1. β -Divergence Cost Function

The NMF problem can be written as the minimization of an objective function:
D ( V | W H ) = i , j d β ( V i , j | Λ i , j )
The general β-divergence [24,31] is defined as:
d β ( y | x ) = { y β β ( β 1 ) + x β β y x β 1 β 1 , β R / { 0 , 1 } y ( log y log x ) + ( x y ) , β = 1 y x log y x 1 β = 0
when β = 2 , this matches with the first β-divergence and the update algorithm is referred to as the “Least Square” [17]. When we use the second β-divergence with β = 1 , the update algorithm is referred to as the “Kullback–Leibler” [17]. When the third β-divergence with β = 0 is used, the update algorithm is referred to as the “Itakura–Saito” [33]. These algorithms have their own advantages and disadvantages. If the sources have large dynamic difference in the power, the Itakura–Saito divergence would have better performance than other NMF algorithms. The Least Square and Kullback–Leibler NMFs are more suited when the power of sources are close to other. However, it is difficult to define the difference of power between the sources, and therefore it is difficult to choose the algorithms. In this paper, we present the results to show that the best results are not necessarily limited to the above integer β special cases. The use the fractional β-divergence is expected to yield more realistic and optimized results than the previous NMF algorithms. For completeness of presentation, the following section briefly reviews the update function based on the Least Square and Kullback–Leibler criterion.

2.1.1. Least Square Distance

The Least Square NMF algorithm introduced by Lee and Seung [17] defines the β-divergence as Least Squares divergence when β = 2 . First, we consider the least square cost function:
C L S = 1 2 | | V Λ | | F 2 = i , j d 2 ( V i , j | Λ i , j ) = 1 2 i , j ( V i , j Λ i , j ) 2
Differentiating C L S with respect to W i , k τ and H k , j ϕ , and plugging the multiplicative update algorithm for θ = { ( W i , k τ ) I × K × τ m a x , ( H k , j ϕ ) K × J × ϕ m a x } :
θ θ · ( [ d β ( y | x ) ] [ d β ( y | x ) ] + )
where d β ( y | x ) / θ = [ d β ( y | x ) ] + [ d β ( y | x ) ] , which leads to the following W τ and H ϕ updates:
W τ = W τ · ϕ V ϕ τ   · H ϕ τ   T ϕ Λ ϕ τ   · H ϕ τ   T
H ϕ = H ϕ · τ W τ ϕ   T · V ϕ τ   τ W τ ϕ   T · Λ ϕ τ  
where “ A · B ” represents element-wise multiplication.

2.1.2. Kullback–Liebler Divergence

When β = 1 , the β-divergence is identical to the Kullback–Leibler divergence. The Kullback–Leibler divergence is expressed as:
C K L = i , j d 1 ( V i , j | Λ i , j ) = i , j V i , j log V i , j Λ i , j V i , j + Λ i , j
By following similar steps as the Least Square, we can derive the update function as follow:
W τ = W τ · ϕ ( V ϕ τ   Λ ) · H ϕ τ   T ϕ 1 · ϕ T
H ϕ = H ϕ · τ W τ ϕ   T · ( V ϕ τ   Λ ) τ W τ ϕ   T · 1
where “ A B   ” represents element-wise division and “1” is a column vector of unit elements.

2.2. Auxiliary Cost Function of Fractional β-Divergence for Matrix Factors Time–Frequency Deconvolution

In this subsection, we introduce the cost function for the fractional β -divergence matrix factors time–frequency deconvolution model. The algorithm allows the user to choose a fractional β value instead of using the previous NMF algorithms which constrain β to special cases of integer value. After the derivation, this paper shows the steps on how the update function of the fractional β-divergence is obtained for the parameters. Firstly, the first derivative of d β ( y | x ) are given by
d β ( y | x ) = y β 2 ( y x )
This shows that y is continuous in β and thus the second derivative of d β ( y | x ) is given by
d β ( y | x ) = y β 3 [ ( β 1 ) y ( β 2 ) x ]
The second derivative shows that the β-divergence is convex for y in β [ 1 , 2 ] . Outside of this range, d β ( y | x ) can be expressed as:
d β ( y | x ) = d ˇ ( y | x ) + d ^ ( y | x ) + d ¯ ( x )
where d ˇ ( y | x ) is a convex function of y , d ^ ( y | x ) is a concave function of y, and d ¯ ( x ) is a constant of y. Table 1 shows the various functions for d ˇ ( y | x ) , d ^ ( y | x ) and d ¯ ( x ) . The problem we want to tackle is to minimize the following function with respect to θ = { ( W i , k τ ) I × K × τ m a x , ( H k , j ϕ ) K × J × ϕ m a x } where β can assume fractional number:
G ( θ ) = i , j d β ( V i , j | k = 1 K τ = 0 τ m a x ϕ = 0 ϕ m a x W i ϕ , k τ H k , j τ ϕ ) = 1 β ( β 1 ) i , j V i , j β + i , j 1 β ( k , τ ϕ W i ϕ , k τ H k , j τ ϕ ) β G 1 ( β ) + i , j V i , j ( 1 β 1 k , τ ϕ W i ϕ , k τ H k , j τ ϕ ) β 1 G 2 ( β )
In Equation (16), G 1 ( β ) is convex for β 1 and concave for β < 1 , and G 2 ( β ) is convex for β 2 and concave for β > 2 . Thus, there is a need to alleviate this problem by decomposing the above function into several terms to be either convex or concave depending on the value of β and use the appropriate inequalities to build an auxiliary function.
Lemma 1.
For the case of β 1 , we have
1 β ( k , τ ϕ W i ϕ , k τ H k , j τ ϕ ) β 1 β k , τ ϕ ω i , j , k , τ , ϕ ( W i ϕ , k τ H k , j τ ϕ ω i , j , k , τ , ϕ ) β = P i , j ( β )
where ω i , j , k , τ , ϕ 0 for all k , τ , ϕ and k , τ ϕ ω i , j , k , τ , ϕ = 1 . The equality holds when
ω i , j , k , τ , ϕ = W i ϕ , k τ H k , j τ ϕ k , τ , ϕ W i ϕ , k τ H k , j τ ϕ
Proof. 
Let f : R R be a convex function. If α k ( k = 1 ,   2 , , K ) satisfies k , α k > 0 and k α k = 1 , then for any x k ( k = 1 ,   2 , , K ) R ,
f ( k x k ) k α k f ( x k α k )
and with equality holds if and only if α k = x k / k x k . Substituting f ( · ) = 1 β ( · ) β with β 1 , x k = W i ϕ , k τ H k , j τ ϕ and α k = ω i , j , k , τ , ϕ yields Equation (16). ☐
Lemma 2.
For the case of β < 1 , we have
1 β ( k , τ ϕ W i ϕ , k τ H k , j τ ϕ ) β Λ i , j β 1 ( k , τ ϕ W i ϕ , k τ H k , j τ ϕ Λ i , j ) + Λ i , j β β = Q i , j ( β )
The equality holds when
Λ i , j = k , τ ϕ W i ϕ , k τ H k , j τ ϕ
Proof. 
Let f : R R be a continuously differentiable and concave function. Then, for any point z ,
f ( x ) f ( x ) ( x z ) + f ( z )
and with equality holds if and only if x = z . Substituting f ( · ) = 1 β ( · ) β with β < 1 , x = k , τ ϕ W i ϕ , k τ H k , j τ ϕ and z = Λ i , j yields Equation (17). ☐
Using Lemmas 1 and 2, we may proceed with the following analysis. When β < 1 , we use Q i , j ( β ) instead of G 1 ( β ) and V i , j P i , j ( β 1 ) instead of G 2 ( β ) , then the cost function becomes a convex function. Let us denote G + ( θ | θ ^ ) as an auxiliary function for and θ ^ = { ( ω i , j , k , τ , ϕ ) I × J × K × τ m a x × ϕ m a x , ( Λ i , j ) I × J } as the auxiliary variables. For G + ( θ | θ ^ ) to qualify as auxiliary function, it must satisfy G ( θ ) = min θ ^ G + ( θ | θ ^ ) . Thus, the cost function can be shown to be bounded by the auxiliary function G + ( θ | θ ^ ) :
G ( θ ) G + ( θ | θ ^ ) = i , j V i , j β β ( β 1 ) + Q i , j ( β ) V i , j P i , j ( β 1 )
when 1 β 2 , we use P i , j ( β ) instead of G 1 ( β ) and V i , j P i , j ( β 1 ) instead of G 2 ( β ) , then the cost function becomes a convex function and is bounded by the auxiliary function G + ( θ | θ ^ ) :
G ( θ ) G + ( θ | θ ^ ) = i , j V i , j β β ( β 1 ) + P i , j ( β ) V i , j P i , j ( β 1 )
Finally, when β > 2 , we use P i , j ( β ) instead of G 1 ( β ) and V i , j Q i , j ( β 1 ) instead of G 2 ( β ) , then the cost function is bounded by
G ( θ ) G + ( θ | θ ^ ) = i , j V i , j β β ( β 1 ) + P i , j ( β ) V i , j Q i , j ( β 1 )
From above, we can conclude that
G ( θ ) G + ( θ | θ ^ ) = i , j V i , j β β ( β 1 ) + { Q i , j ( β ) V i , j P i , j ( β 1 ) , ( β < 1 ) P i , j ( β ) V i , j P i , j ( β 1 )   , ( 1 β 2 ) P i , j ( β ) V i , j Q i , j ( β 1 )   , ( β > 2 )
The equality holds when θ ^ satisfies Equations (18) and (20). The above function yields three different sub-functions which depend on the β value. In different β range, we use different cost function in the algorithm. This allows the user to choose the optimal β value to separate the mixture and caters for more flexibility than the previous algorithms.

2.3. Auxiliary Update Function of “Fractional” β-Divergence

To minimize G + ( θ | θ ^ ) , we formulate the derivative of G + ( θ | θ ^ ) with respect to θ . First, we consider the derivative for W i , k τ :
G + ( θ | θ ^ ) W i , k τ = V w W w
where
V w = { j , ϕ Λ i , j β 1 H k , j τ ϕ , ( β < 1 ) ( W i , k τ ) β 1 j , ϕ ω i + ϕ , j , k , τ , ϕ 1 β ( H k , j τ ϕ ) β , ( β 1 )
W w = { ( W i , k τ ) β 2 j , ϕ V i + ϕ , j ω i + ϕ , j , k , τ , ϕ 2 β ( H k , j τ ϕ ) β 1 ( β 2 ) j , ϕ V i + ϕ , j Λ i + ϕ , j β 2 H k , j τ ϕ   , ( β > 2 )
The second derivative of G + ( θ | θ ^ ) with respect to W i , k τ in then expressed as:
2 G + ( θ | θ ^ ) W i , k τ W i , k τ = ( V w W w ) δ i , i δ k , k δ τ , τ
where δ i , j = 1 if i = j and δ i , j = 0 if i j , and
V w = { 0 , ( β < 1 ) ( β 1 ) ( W i ϕ , k τ ) β 2 j , ϕ ω i + ϕ , j , k , τ , ϕ 1 β ( H k , j τ ϕ ) β   ,     ( β 1 ) ( β 1 )
W w = { ( β 2 ) ( W i , k τ ) β 3 j , ϕ V i + ϕ , j ω i + ϕ , j , k , τ , ϕ 2 β ( H k , j τ ϕ ) β 1 , ( β 2 ) 0 , ( β > 2 )
We can see G ( θ | θ ^ ) is a convex function in W i , k τ , so by setting G ( θ | θ ^ ) W i , k to 0, we can then express the update function for W i , k τ as:
W i , k τ = { ( j , ϕ V i + ϕ , j   ω i + ϕ , j , k , τ , ϕ 2 β ( H k , j τ ϕ ) β 1 j , ϕ Λ i , j β 1 H k , j τ ϕ ) 1 2 β   , ( β < 1 ) j , ϕ V i + ϕ , j   ω i + ϕ , j , k , τ , ϕ 2 β ( H k , j τ ϕ ) β 1 j , ϕ ω i + ϕ , j , k , τ , ϕ 1 β ( H k , j τ ϕ ) β   , ( 1 β 2 ) ( j , ϕ V i + ϕ , j   Λ i + ϕ , j β 2   H k , j τ ϕ j , ϕ ω i + ϕ , j , k , τ , ϕ 1 β ( H k , j τ ϕ ) β ) 1 β 1   , ( β > 2 )
We next consider the auxiliary variables θ ^ . Since both Equations (17) and (18) minimize G + ( θ | θ ^ ) with respect to θ ^ , substituting these into Equation (30) gives the following update rule:
W i , k τ = W i , k τ ( j , ϕ V i + ϕ , j   Λ i + ϕ , j β 2   H k , j τ ϕ j , ϕ Λ i + ϕ , j β 1   H k , j τ ϕ ) δ ( β )
where
δ ( β ) = { 1 2 β , ( β < 1 )   1 , ( 1 β 2 ) 1 β 1 , ( β > 2 )
The above can be written in the matrix form:
W τ = W τ · [ ϕ ( V ϕ · Λ ( β 2 ) ϕ ) H ϕ τ   T ϕ Λ ( β 1 ) ϕ   H ϕ τ T ] δ ( β )
Similarly, for H k , j ϕ update function, first we have
G + ( θ | θ ^ ) H k , j ϕ = V H W H
where
V H = { i , τ Λ i , j + τ β 1   W i ϕ , k τ   , ( β < 1 ) ( H k , j ϕ ) β 1 i , τ ω i , j + τ , k , τ , ϕ 1 β ( W i ϕ , k τ ) β   , ( β 1 )
W H = { ( H k , j ϕ ) β 2 i , τ V i , j + τ   ω i , j + τ , k , τ , ϕ 2 β ( W i ϕ , k τ ) β 1   , ( β 2 ) i , τ V i , j + τ Λ i , j + τ β 2 W i ϕ , k τ   , ( β > 2 )
From Equations (17)–(31), we can minimize the cost-function by setting G s ( θ | θ ^ ) H k , j φ = 0 and obtain H k , j ϕ as:
H k , j ϕ = { ( i , τ V i , j + τ   ω i , j + τ , k , τ , ϕ 2 β ( W i ϕ , k τ ) β 1 i , τ Λ i , j + τ β 1   W i ϕ , k τ ) 1 2 β ( β < 1 ) i , τ V i , j + τ   ω i , j + τ , k , τ , ϕ 2 β ( W i ϕ , k τ ) β 1 i , τ ω i , j + τ , k , τ , ϕ 1 β ( W i ϕ , k τ ) β ( 1 β 2 ) ( i , τ V i , j + τ Λ i , j + τ β 2 W i ϕ , k τ i , τ ω i , j + τ , k , τ , ϕ 1 β ( W i ϕ , k τ ) β ) 1 β 1 , ( β > 2 )
Again, since both Equations (19) and (20) minimizes G + ( θ | θ ^ ) with respect to θ ^ , substituting these into Equation (38) gives the following update rule for H k , j ϕ :
H k , j ϕ = H k , j ϕ ( i , τ V i , j + τ   Λ i , j + τ β 2   W i ϕ , k τ i , τ Λ i , j + τ β 1   W i ϕ , k τ ) δ ( β )
In matrix form, the above can be written as
H ϕ = H ϕ · [ τ W τ ϕ   T ( V τ   · Λ ( β 2 ) τ   ) τ W τ ϕ   T Λ ( β 1 ) τ   ] δ ( β )

2.4. Sparsity-Aware Optimization

The cost-function in Equation (21) can be augmented with a regularization term to render sparsity to the solution. We can define a prior on H as an exponential distribution with independent decay parameters, namely,
p ( H | λ ) = ϕ p ( H ϕ | λ ϕ ) = ϕ k j p ( H k , j ϕ | λ k , j ϕ )
where p ( H k , j ϕ | λ k , j ϕ ) = ϕ k j λ k , j ϕ exp ( λ k , j ϕ H k , j ϕ ) . The negative log prior on H is defined as log p ( H | λ ) = f ( H ) = ϕ , k , j ( λ k , j ϕ H k , j ϕ log λ k , j ϕ ) . It is worth pointing out that each individual element in H is constrained to an exponential distribution with independent decay parameter λ k , j ϕ so that each element in H can be driven to be optimally sparse in the L 1 -norm. Other forms of sparseness exist19 but the proposed L 1 -norm is computationally favourable. First, we define G s ( θ ) and G s ( θ | θ ^ ) as follow:
G s ( θ ) G ( θ ) + α f ( H ) G + ( θ | θ ^ ) + α f ( H ) = G s ( θ | θ ^ )
where α is the regularization constant. To avoid the scaling misbehavior when incorporating the sparseness for H , we reformulate the cost function to work with normalized matrix for W τ i.e.,
W ¯ i , k τ = W i , k τ τ , i ( W i , k τ ) 2 = W i , k τ | | W k | | 2
and
Λ ¯ i , j = k , τ , ϕ W ¯ i ϕ , k τ   H k , j τ ϕ
Thus, the cost function takes the following form:
G s ( θ ) G s ( θ | θ ^ ) = i , j V i , j β β ( β 1 ) + α ϕ , k , j ( λ k , j ϕ H k , j ϕ log λ k , j ϕ ) + { Q ¯ i , j ( β ) V i , j P ¯ i , j ( β 1 )   , ( β < 1 ) P ¯ i , j ( β ) V i , j P ¯ i , j ( β 1 )   , ( 1 β 2 ) P ¯ i , j ( β ) V i , j Q ¯ i , j ( β 1 )   , ( β > 2 )
where
Q ¯ i , j ( β ) = Λ ¯ i , j β 1 ( k , τ ϕ W ¯ i ϕ , k τ H k , j τ ϕ Λ ¯ i , j ) + Λ ¯ i , j β 1 β
and
P ¯ i , j ( β ) = 1 β k , τ , ϕ ω i , j , k , τ , ϕ ( W ¯ i ϕ , k τ H k , j τ ϕ ω i , j , k , τ , ϕ ) β
To obtain λ k , j ϕ , we minimize the cost function with respect λ k , j ϕ and set it to zero which results:
λ k , j ϕ = 1 H k , j ϕ
provided that H k , j ϕ 0 . However, it has been observed in many cases that optimizing the factor matrices with β-divergence and the sparseness in Equation (46) increases the likelihood for some H k , j ϕ to converge very close to zero, thus leading to numerical divergence when dividing by zero. Other practices introduced a small constant to H k , j ϕ to prevent direct division by zero. Unfortunately, such approach is identical to constant sparsity and no longer preserves the L 1 -norm optimal solution. In this paper, we adopt the maximum likelihood approach to formulating the adaptive estimation of sparsity parameter λ k , j ϕ . Considering the following maximum likelihood criterion [31,36]:
λ M L = arg   max λ ln p ( v | λ , W ˇ )
where ln p ( v | λ , W ˇ ) is the log-likelihood conditional probability of the observations given W ˇ and λ . By using the Jensen’s inequality, for any distribution Q ( h ) , the log-likelihood function satisfies the following:
λ M L = arg   max λ Q ( h ) ln p ( v , h | λ , W ˇ ) d h = arg   max λ Q ( h ) ( ln p ( v | h , W ˇ ) + ln p ( h | λ ) ) d h = arg   max λ Q ( h ) ln p ( h | λ ) d h
Since each element of H is constrained to be exponential distributed with independent decay parameters, Equation (48) becomes:
λ g M L = arg   max λ Q ( h ) ( ln λ g λ g h g ) d h
Thus, we have
λ g M L = 1 h g Q ( h ) d h  
One can easily check that the distribution that maximizes the maximum likelihood is given by Q ( h ) = p ( h | v , λ ,   W ˇ ) = p ( v | h , λ ,   W ˇ ) p ( h | λ ) / p ( v | λ ,   W ˇ ) which is the posterior distribution of h and p ( v | h , λ ,   W ˇ ) is the log-likelihood function of the observation which is usually expressed by a Gaussian density function with mean centered at k , τ , ϕ W ¯ i ϕ , k τ   H k , j τ ϕ . However, as H k , j ϕ is directly acquired from the original code matrix H k , j 0 , we can simply work with τ m a x = 0 . This allows us to express the log-likelihood function of the posterior distribution of h up to a constant as
ln p ( h | v , λ ,   W ˇ ) = ˙ ln p ( v | h , λ ,   W ˇ ) + ln p ( h | λ ) = ˙ 1 2 | | v e c ( V ) ϕ ( I W ¯ ϕ   ) v e c ( H ϕ ) | | F 2 + α ϕ { ( λ ϕ ) T v e c ( H ϕ ) ( log λ ϕ ) T 1 } = F ( H , λ )
where “ = ˙ ” denotes equality up to a constant, “ “ is the Kronecker product, 1 is vector contains unit elements, I is the identity matrix, α assumes the role of a regularization constant to balance the cost function fit and smoothness of H . For ease of presentation, we simplify the above terms as v = v e c ( V ) , W ˇ = [ I W ¯ 0   I W ¯ ϕ m a x   ] , h = { h g } = [ v e c ( H 0 ) T v e c ( H ϕ m a x ) T ] T , λ = { λ g } = [ λ 0 T λ ϕ m a x T ] T which enables us to rewrite Equation (46) as
F ( H , λ ) = 1 2 | | v W ˇ h | | F 2 + α ( λ T h ( log λ ) T 1 )
For ease of analysis, Q ( h ) is represented using Gibbs distribution as Q ( h ) = 1 Z exp ( F ( h ) ) where Z = exp ( F ( h ) ) d h . Let P represents the index set of inactive code i.e., P = { ϕ , k , j | H k , j ϕ = 0 } and M the index set of active code i.e., M = { ϕ , k , j | H k , j ϕ 0 } . Thus, Q ( h ) can be factorized as
Q ( h ) = 1 Z exp ( F ( h , λ ) ) 1 Z P exp ( F ( h P , λ P ) ) 1 Z M exp ( F ( h M , λ M ) ) = Q P ( h P ) Q M ( h M )
Since h M corresponds to the original non-zero value of h , it then follows that Q M ( h M ) is not of interest to us. We are only interested in h P and therefore, to characterize Q P ( h P ) , we need to allow some positive deviation to h P . A suitable distribution is to use the factorized exponential distribution given by
Q ^ P ( h P 0 ) = p P 1 u p exp ( h p u p )
as the approximate distribution. The variational parameters u = { u p } are determined by minimizing the Kullback–Leibler divergence between true Q P and approximate Q ^ P :
u = arg   min u Q ^ P ( h P ) ln Q ^ P ( h P ) Q P ( h P ) d h P = arg   min u Q ^ P ( h P ) { ln Q ^ P ( h P ) Q P ( h P ) } d h P
which leads to the following optimization:
min u p b P T u + 1 2   u T C u p P ln u p
where b P = ( C h W ˇ T v + λ ) P and C = C P + d i a g ( C P ) with C = W ˇ T W ˇ , C P = W ˇ P T W ˇ P . Solving Equation (56) for u p leads to the following update:
u p u p b p + b p 2 + 4 ( C u ) p u p 2 ( C u ) p
Once u p is obtained and re-arranged to the original form u k , j ϕ , the final update for λ k , j ϕ takes the form of:
λ k , j ϕ = 1 H k , j ϕ + δ k , j ϕ
where
δ k , j ϕ = {   0   if   H k , j ϕ 0   u k , j ϕ   if   H k , j ϕ = 0
Equipped with above, we obtain the multiplicative update for the normalized W as
W τ = W ¯ τ · [ ϕ ( V ϕ   · Λ ¯ ( β 2 ) ϕ   ) H ϕ τ   T + W ¯ τ d i a g ( τ 1 ( ( Λ ¯ ( β 1 ) ϕ   H ϕ τ   T ) · W ¯ τ ) ) ϕ Λ ¯ ( β 1 ) ϕ     H ϕ τ   T + W ¯ τ d i a g ( τ 1 ( ( ( V ϕ   · Λ ¯ ( β 2 ) ϕ   )   H ϕ τ   T ) · W ¯ τ ) ) ] δ ( β )
for τ = 0 ,   1 ,   ,   τ m a x . By using the same approach, we can obtain the update for the sparse H as follows:
H ϕ = H ϕ · [ τ W ¯ τ ϕ   T ( V τ   · Λ ¯ ( β 2 ) τ   ) τ W ¯ τ ϕ   T Λ ¯ ( β 1 ) τ   + α λ ϕ ] δ ( β )
for ϕ = 0 ,   1 ,   ,   ϕ m a x . In Equation (61), α assumes the role of a regularization constant to balance the cost function fit and smoothness of H . In this work, we set α [ 0.5 ,   1 ] which has been found to give satisfactory results.

2.5. Optimizing the Fractional β

To determine the optimal value for β, we perform the investigation from the source separation viewpoint. Mathematically, the single-channel signal separation (SCSS) [37,38,39] problem can be treated as one mixture of N unknown source signals:
y ( t ) = x 1 ( t ) + x 2 ( t ) + + x N ( t )
where t = 1 ,   2 ,   ,   T denotes time index and the goal is to estimate the sources x k ( t ) ,   k N of length T when only the observation signal y ( t ) is available. For simplicity, we consider only N = 2 sources in the mixture. We also use 50 different pieces of piano music, 50 different pieces of trumpet music and 50 different pieces of violin music from the RWC [40] database to generate different mixtures. The signal-to-distortion (SDR) [41] is used to measure the performance. The SDR results and its corresponding β value that produces the best performance in the separation of mixtures are shown in Table 2. From these experiments, we can propose some general ideas of how to choose a suitable β value: (i) The mixtures from same type of music share similar β value, e.g., the best results of piano and trumpet mixture occur around β = 2 but the best results of piano and violin mixture occur around β = 1 . (ii) If the power of one source is clearly weaker than the other source, then a smaller β value should be selected. (iii) When there is a large amount of overlap between the two sources in the in the time–frequency domain, a larger β value should be selected.
Table 2 strongly suggests that the β value depends on the mixture of original sources. Generally, it depends mainly on the two factors: (i) the weight of each source in the mixture; and (ii) the frequency spread of each source which the frequency band contains most weight of the signal. Firstly, we define weight of each source in the mixture by the following function:
γ k = 1 | x k ( t ) y ( t ) | l = 1 N | x l ( t ) y ( t ) | 2
for k = 1 , , N . The term γ k is nonnegative and bounded to unity. It measures the dominance of k -th source in the mixture. The higher value of γ k , the greater the contribution from the k -th source is to the mixture. Secondly, we consider the separability of each source in the time–frequency domain by the following function:
η k = | | M k ( i , j ) X k ( i , j ) | | F 2 | | M k ( i , j ) l = 1 , l k N X l ( i , j ) | | F 2 | | X k ( i , j ) | | F 2
where | | · | | F is the Frobenius norm, X k ( i , j ) is the short-time Fourier Transform (STFT) of x k ( t ) with i representing the frequency bins and j the timeslot, and M k ( i , j ) is the binary mask Obtained from the k -th source as
M k ( i , j ) = { 1 if | X k ( i , j ) | 2 > | X l ( i , j ) | 2 0 otherwise
The function η k is also nonnegative and determines the degree separability of the signal in each frequency band. Based on the experiments conducted, both γ k and η k have an inverse relation to β . Thus, one possible empirical approach to determine β is proposed as follows:
β ( n + 1 ) = ρ ( n ) β ( n ) + ( 1 ρ ( n ) ) min [ ( k = 1 N ε 1 · η k + ( 1 ε 1 ) · γ k γ k η k ) ,   ε 2 ]
where ρ ( n ) is step size, ε 1 is a constant to weight the effects of η k and γ k . For example, in the experiments conducted, we have given more emphasis to γ k and set ε 1 = 1 / 3 . The term ε 2 is a constant to control the value of β ( n + 1 ) to ensure its value is bounded within an interval chosen by the user (for example, in the experiments conducted we have set ε 2 = 4 as normally does not exceed 4). Equation (66) is inserted into the update funtions in Equations (60) and (61) to update β at every iteration in conjunction with the update of W and H . In this case, β can be optimized based on the type of sources and the separation process. This enables the separation process to be fully automated and enables more accurate performance. In the case of SCSS, the sources are unknown and these are estimated from the mixture as:
X ^ k ( i , j ) = M k ( i , j ) Y ( i , j )
where
M k ( i , j ) = { 1   i f   | X ˇ k ( i , j ) | 2 > | X ˇ l ( i , j ) | 2 0   o t h e r w i s e  
and
| X ˇ k ( i , j ) | 2 = τ = 0 τ m a x ϕ = 0 ϕ m a x W ¯ i ϕ , k τ H k , j τ ϕ
The expression in (69) is computed using the time–frequency deconvolution model. The main steps of the proposed algorithm have been summarized in Algorithm 1.
Algorithm 1. Overview Proposed Algorithm
1. 
Initialize W τ and H ϕ with non-negative random values.
2. 
Compute the STFT:
    Y ( i , j ) = STFT ( y ( t ) ) , and let V i , j = | Y ( i , j ) | 2 .
3. 
Compute Λ ¯ i , j = k , τ , ϕ W ¯ i ϕ , k τ   H k , j τ ϕ .
4. 
Compute u p u p { ( b p + b p 2 + 4 ( C u ) p u p ) / 2 ( C u ) p }
5. 
Assign λ k , j ϕ = 1 H k , j ϕ + δ k , j ϕ where δ k , j ϕ = {   0   if   H k , j ϕ 0   u k , j ϕ   if   H k , j ϕ = 0
6. 
Update H ϕ = H ϕ · [ τ W ¯ τ ϕ   T ( V τ   · Λ ¯ ( β 2 ) τ   ) τ W ¯ τ ϕ   T Λ ¯ ( β 1 ) τ   + α λ ϕ ] δ ( β )
7. 
Compute Λ ¯ i , j = k , τ , ϕ W ¯ i ϕ , k τ   H k , j τ ϕ .
7. 
Update the spectral bases:
W τ = W ¯ τ · [ ϕ ( V ϕ · Λ ¯ ( β 2 ) ϕ ) H ϕ τ T + W ¯ τ d i a g ( τ 1 ( ( Λ ¯ ( β 1 ) ϕ H ϕ τ T ) · W ¯ τ ) ) ϕ Λ ¯ ( β 1 ) ϕ H ϕ τ T + W ¯ τ d i a g ( τ 1 ( ( ( V ϕ · Λ ¯ ( β 2 ) ϕ ) H ϕ τ T ) · W ¯ τ ) ) ]
9. 
For k = 1 , , N , compute:
| X ˇ k ( i , j ) | 2 = τ = 0 τ m a x ϕ = 0 ϕ m a x W ¯ i ϕ , k τ H k , j τ ϕ M k ( i , j ) = { 1   i f   | X ˇ k ( i , j ) | 2 > | X ˇ l ( i , j ) | 2 0   o t h e r w i s e   X ^ k ( i , j ) = M k ( i , j ) Y ( i , j ) x ^ k ( t ) = STFT 1 [ X ^ k ( i , j ) ] γ k = 1 | x ^ k ( t ) y ( t ) | l = 1 N | x ^ l ( t ) y ( t ) | 2 η k = M k ( f , t ) X ^ k ( f , t ) F 2 M k ( f , t ) l = 1 , l k N X ^ l ( f , t ) F 2 X ^ k ( f , t ) F 2 β ρ β + ( 1 ρ ) min [ ( k = 1 N ε 1 η k + ( 1 ε 1 ) γ k γ k η k ) ,   ε 2 ]
10. 
Repeat Steps 3–9 until it converges or reaches the pre-defined number of iteration.

3. Experiments, Results and Analysis

In this section, we conduct in-depth investigations of the proposed algorithm to analyze the impact of fixed and adaptive sparsity, the adaptive behavior of the sparsity parameter, and the analysis of fractional β-divergence. The analysis is necessary as the issue of sparsity of the temporal codes would undermine the quality of matrix factorization when the β value is inappropriately chosen. The selection of the β value should consider the sparseness constraint used in the cost function. In addition, the proposed algorithm based on matrix factor time–frequency deconvolution is also compared to conventional NMF models. This allows us to quantify the impacts of fractional β-divergence and sparsity behaviors when using the time–frequency deconvolution model.

3.1. Experimental Set-Up

To investigate the proposed method, we use the algorithm to separate several pieces of mixed music signals. Several experimental simulations under different conditions have been designed to investigate the efficacy of the proposed method. All simulations were performed using MATLAB as the programming platform and performed using a PC with dual core processor @ 2.4 GHz (i7 Intel processor) 8 GB RAM and 320 GB HDD. The tested signals are generated by mixing several music sources. The polyphonic music is 4 s long and the sampling frequency is 16 kHz. In this experiment, we randomly chose 50 different pieces of piano music, 50 different pieces of trumpet music and 50 different pieces of violin music from the RWC database to produce the different mixtures. The mixed signal was then generated by adding the chosen sources. In all cases, the sources were mixed with equal average power over the duration of the signals. The time–frequency (TF) representation was obtained by first normalizing the time-domain signal to unit power and then by computing the STFT using 2048 point Hanning window FFT with a 50% overlap. We evaluated our separation performance in terms of the signal-to-distortion ratio (SDR) which is one form of perceptual measure. This is a global measure that unifies source-to-interference ratio (SIR), source-to-artifacts ratio (SAR) and source-to-noise ratio (SNR). The definition and mathematical expression and MATLAB routines for computing these criteria can be obtained online [42].

3.2. Analysis of Adaptive and Fixed Sparsity

In this implementation, we conducted several experiments to compare the performance of the proposed method using different β values. Our aim was to investigate the impact of β value used in the separation. Figure 1 shows the time and TF domains of the original trumpet, piano music and its mixture. The TF domain is displayed using the log-frequency spectrogram. The trumpet and the piano play a different short melodic passage each consisting of three distinct notes. However, both trumpet and piano overlap in time, and the piano notes are interspersed in frequency with the trumpet notes. Hence, this is a challenging task for single channel separation which tests the impact of flexible β for matrix factorization.
Figure 2 shows the estimation of bases W τ and temporal codes H ϕ when using different λ values. In Figure 2a, we set λ = 0 which renders non-sparse solution. There is obvious spreading of the estimated temporal codes, as shown in the red part of the figure. In Figure 2b, when λ = 0.1 , there are some improvements over the spreading but they still exist in the red parts. Here, the sparseness is not strong enough and, as a result, the estimated mixture becomes under-sparse. In Figure 2c, when λ = 100 , it is visibly shown in the blue parts that some information has been lost in the estimated temporal codes and the resulting estimated mixture becomes very noisy. Finally, Figure 2d shows the case where the sparseness parameters are adaptively and individually estimated using the prior information of H . The obtained result has shown that the estimated temporal codes are just appropriately sparse and, by visual inspection, the resulting estimated mixture retains all information, as evidenced by the musical notes, while the noise level has been kept small, which very closely resembles the original mixture.

3.3. Adaptive Behavior of Sparsity Parameter

In this section, we use the piano and trumpet mixture, and fix β value to 1 and the λ value ranges from 0 to 3 with each step of 0.1 to show the results. The adaptive behavior of the sparsity parameters using the proposed method is demonstrated. Figure 3 presents the convergence trajectory of four adaptive sparsity parameters, λ 1 , 1 ϕ = 0 , λ 1 , 5 ϕ = 0 , λ 1 , 10 ϕ = 0 and λ 1 , 15 ϕ = 0 , corresponding to their respective element codes, h 1 , 1 ϕ = 0 , h 1 , 5 ϕ = 0 , h 1 , 10 ϕ = 0 and h 1 , 15 ϕ = 0 . All sparsity parameters are initialized as λ = 1 . After 150 iterations, the above sparsity parameters converge to their steady-states. By examining Figure 3, it is noted that the converged steady-state values are significantly different for each sparsity parameter, e.g., λ 1 , 1 ϕ = 0 = 0.9 , λ 1 , 5 ϕ = 0 = 0.18 , λ 1 , 10 ϕ = 0 = 0.29 and λ 1 , 15 ϕ = 0 = 0.08 , even though they started at the same initial condition. This shows that each element code has its own sparseness.
In Figure 4, we compare the SDR results of using different λ values. In the figure, we can see the λ value that can get the best result changes with the mixture. For each different mixture, the best λ values are different. In Figure 4, we can see the best separation results of piano and trumpet mixture occurs near λ = 1 , and the SDR = 14.7 dB. However, as λ increases, the SDR performance begins to deteriorate rapidly due to over-sparseness of the temporal code H ϕ .
Although the SDR performance in Figure 4 seems to suggest good performance, it may not necessary refer to the optimum setting. In fact, when λ is fixed to a constant value, the matrix factors deconvolution process may still be subjected to under- and over-fitting. In the previous sparsity algorithm, λ is fixed for the whole process and this sparsity may not necessarily suited for the whole signal. This calls for the need to allow each temporal code to have its own sparsity parameter. In the adaptive sparsity algorithm in Equation (53), the sparsity parameter λ is updated alongside W τ and H ϕ in the process. Therefore, the sparsity parameter is optimized for each element of the temporal code. In addition, we plot the histogram of the converged adaptive sparsity parameters in Figure 5. The plot strongly suggests that the histogram can be represented as a bimodal distribution. We have used the Gaussian mixture model (GMM) [43] to learn the distribution of this histogram and the result produces two Gaussian distributions with mean 0.16 and 1.1. The global mean of the GMM is given by 0.92.
With the GMM analysis, we can proceed to further investigate the assignment of sparsity parameters and compare them with the adaptive approach. We considered the following sparseness assignments:
Case (1):
No sparseness λ = 0 .
Case (2):
Uniform and constant sparseness λ = 0.16 corresponding to the mean of the first Gaussian distribution of the GMM.
Case (3):
Uniform and constant sparseness λ = 1.1 corresponding to the mean of the second Gaussian distribution of the GMM.
Case (4):
Uniform and constant sparseness λ = 0.92 corresponding to the global mean of the converged adaptive sparsity.
Case (5):
Uniform and constant sparseness λ = 1 corresponding to the global mean of the converged adaptive sparsity.
Case (6):
Maximum likelihood adaptive sparseness, i.e., Equation (55).
The SDR results are tabulated in Table 3 where we can see the separation results of all the six cases. The obtained results readily informed that the source separation with adaptive sparsity has rendered the best separation result.

3.4. Analysis of Fractional β-Divergence

Figure 6 shows the SDR values of the separation results using different values of β values. In this implementation, we fixed the sparsity parameter λ to 0, and the value of β ranges from 0 to 4, and each step is 0.1. In the figure, we can observe that the SDR value changes as the β value is increased. The best result is obtained at when β = 2.5 , where SDR = 14.26 dB. The SDR value keeps increasing for values of β value within the range of 0 β < 2.5 , and, after the best performance is attained, the performance deteriorates as β increases. In this figure, we can see that the best performance does not necessarily occur at value of β used other algorithms, i.e., β = 2 ,   1 ,   0 . This means, if we choose the best β to carry out the separation, we can obtain better results than the other algorithms.
We also conducted several experiments to compare the performance of the proposed method with the non-sparse and normal sparse methods using different β value. To investigate the impact of β and λ value used in the separation, we used β that ranges from 0 to 4 (with every increment of 0.1), and λ was set to several configurations, i.e., non-sparse, best fixed sparsity that is obtained in Figure 4 and the proposed adaptive sparsity. The results are plotted in Figure 7, which shows the mutual influence between λ and λ . It is noted that the SDR performance of non-sparse algorithm is the lowest of which its best result occurs when β = 2.5 with SDR = 14.26 dB. The normal sparse algorithm using best λ value gives better performance than the non-sparse method with its best result occuring when β = 0.5 with SDR = 15.96 dB. On the other hand, the proposed algorithm with adaptive λ delivers the best performance where β occurs around 0.3 with SDR = 16.71 dB. It is also noted that the worst SDR performance given by β = 2.9 with adaptive λ is still higher than the highest SDR when β = 2.5 without sparsity optimization λ = 0 .
All the above experiments used the pre-determined β that are fixed for the whole process. In this section, we present the results of experiments where β is adaptively tuned alongside with the adaptation of W τ , Hϕ and λ ϕ . We applied this method to the source separation problem and compared the results with the situation where β = 1 and β = 2 corresponding to the Kullback–Leibler divergence and Least Square cost function, respectively. In the update of adaptive β , the step size ρ ( n ) = 0.95 n is selected which represents an exponential decay update process. The SDR results of the top five best performance are tabulated in Table 4. It is interesting to note from the table that the proposed algorithm with adaptive β delivers better performance by about 2.1 dB compared to that when β = 1 (Kullback–Leibler divergence) and 1.9 dB compared to that when β = 2 (Least Square distance). The obtained performance improvement is attributed to the fact that the joint optimization of β with W τ , H ϕ and λ ϕ has enabled the current estimate of β to better fit mixture of sources and thus rendered better source separation performance.

3.5. Comparison with Other Nonnegative Factorization Models

In this section, we compare the proposed algorithm with other signal separation algorithms, in both time-domain representation and analysis the SDR results of all algorithms. The signal chosen was the same piano and trumpet mixture music used in Section 3.4. We compared the Least Square NMF (NMF-LS) and Kullback–Leibler NMF (NMF-KLD) algorithms introduced in the earlier sections of this paper, NMF with temporal continuity and sparseness criteria (NMF-TCS) [23], and NMF with automatic relevance determination (NMF-ARD) [44]. The obtained results are summarized in Table 5.
When using the various NMF models, it is seen that the average improvement per source of about 3 dB has been gained by leveraging the fractional β value and adaptive sparsity. In addition, a step jump of approximately 5–8 dB in performance improvement is further obtained when the model is switched to the matrix factor time–frequency deconvolution. This is attributed to the latter model which represents both temporal structure and the pitch change which occur when an instrument plays different notes. The pitch change corresponds to a displacement on the frequency axis. Where NMF methods needed one component to model each note for each instrument, the time–frequency deconvolution model represents each instrument compactly by a single time–frequency profile convolved in both time and frequency by a time–pitch weight matrix [45]. The model dramatically decreases the number of components needed to model various instruments and effectively solves the blind single channel source separation problem for certain classes of musical signals.

4. Conclusions

This paper presents an adaptive fractional β-divergence with sparsity-aware optimization for non-negative factor time–frequency deconvolution algorithm. The impetus behind this work is that the previous β -divergence algorithms are all limited to special cases of β , and the previous sparsity methods are limited to a fixed sparsity parameter which are determined manually. Thus, these algorithms may not always produce the best results. In the proposed method, β is made adaptive and takes on fractional value. The sparsity parameter is also concurrently updated along with the estimation of β and model parameters. The convergence is theoretically proven for any β based on the auxiliary function method. This paper has shown that the proposed method is more general and can deliver better performance than other algorithms, as demonstrated using real audio recordings.

Author Contributions

W.L.W. and B.G. contributed to the mathematical development, implementation of the codes and writing of this article. A.B., B.W-K.L. and C.S.C. contributed to the simulations, technical analysis and comparison of the results.

Funding

This research was funded by National Natural Science Foundation of China (No. 61401071, No. 61527803), supported by NSAF (Grant No. U1430115) and EPSRC IAA Phase 2 funded project: “3D super-fast and portable eddy current pulsed thermography for railway inspection (EP/K503885/1).

Acknowledgments

The authors would like to thank K.Y. and S.S.D. for the helpful information on the conference edition of the work.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Mitianoudis, N.; Davies, M.E. Audio source separation: Solutions and problems. Int. J. Adapt. Control Signal Process. 2004, 18, 299–314. [Google Scholar] [CrossRef]
  2. Gao, P.; Woo, W.L.; Dlay, S.S. Nonlinear signal separation for multi-nonlinearity constrained mixing model. IEEE Trans. Neural Netw. 2006, 17, 796–802. [Google Scholar] [CrossRef] [PubMed]
  3. Alvarez, S.C.; Cichocki, A.; Ribas, L.C. An iterative inversion approach to blind source separation. IEEE Trans. Neural Netw. 2000, 11, 1423–1437. [Google Scholar]
  4. Gao, B.; Woo, W.L.; Dlay, S.S. Single channel blind source separation using EMD-subband ariable regularized sparse features. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 961–976. [Google Scholar] [CrossRef]
  5. Zha, D.; Qiu, T. A new blind source separation method based on fractional lower-order statistics. Int. J. Adapt. Control Signal Process. 2006, 20, 213–223. [Google Scholar] [CrossRef]
  6. Ozerov, A.; Févotte, C. Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 550–563. [Google Scholar] [CrossRef]
  7. Zhang, J.; Woo, W.L.; Dlay, S.S. Blind source separation of post-nonlinear convolutive mixture. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 2311–2330. [Google Scholar] [CrossRef]
  8. Moir, T.J.; Harris, J.I. Decorrelation of multiple non-stationary sources using a multivariable crosstalk-resistant adaptive noise canceller. Int. J. Adapt. Control Signal Process. 2013, 27, 349–367. [Google Scholar] [CrossRef]
  9. Djendi, M. A new two-microphone Gauss-Seidel pseudo affine projection algorithm for speech quality enhancement. Int. J. Adapt. Control Signal Process. 2017, 31, 1162–1183. [Google Scholar] [CrossRef]
  10. He, X.; He, F.; Zhu, T. Large-scale super-Gaussian sources separation using Fast-ICA with rational nonlinearities. Int. J. Adapt. Control Signal Process. 2017, 31, 379–397. [Google Scholar] [CrossRef]
  11. Kemiha, M.; Kacha, A. Complex blind source separation. Circuits Syst. Signal Process. 2017, 36, 1–18. [Google Scholar] [CrossRef]
  12. Moazzen, I.; Agathoklis, P. A multistage space–time equalizer for blind source separation. Circuits Syst. Signal Process. 2016, 35, 185–209. [Google Scholar] [CrossRef]
  13. Kumar, V.A.; Rao, C.V.R.; Dutta, A. Performance analysis of blind source separation using canonical correlation. Circuits Syst. Signal Process. 2018, 37, 658–673. [Google Scholar] [CrossRef]
  14. Zhang, C.; Wang, Y.; Jing, F. Underdetermined blind source separation of synchronous orthogonal frequency hopping signals based on single source points detection. Sensors 2017, 17, 2074. [Google Scholar] [CrossRef] [PubMed]
  15. Guo, Q.; Ruan, G.; Liao, Y. A time-frequency domain underdetermined blind source separation algorithm for mimo radar signals. Symmetry 2017, 9, 104. [Google Scholar] [CrossRef]
  16. Li, T.; Wang, S.; Zio, E.; Shi, J.; Hong, W. Aliasing signal separation of superimposed abrasive debris based on degenerate unmixing estimation technique. Sensors 2018, 18, 866. [Google Scholar] [CrossRef] [PubMed]
  17. Lee, D.; Seung, H. Learning the parts of objects by nonnegative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [PubMed]
  18. Donoho, D.; Stodden, V. When Does Non-Negative Matrix Factorisation Give a Correct Decomposition into Parts; MIT Press: Cambridge, MA, USA, 2004. [Google Scholar]
  19. Bertin, N.; Badeau, R.; Vincent, E. Enforcing harmonicity and smoothness in Bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 538–549. [Google Scholar] [CrossRef]
  20. Vincent, E.; Bertin, N.; Badeau, R. Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 528–537. [Google Scholar] [CrossRef]
  21. Smaragdis, P. Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs. Int. Conf. Indep. Compon. Anal. Blind Signal Sep. 2004, 3195, 494–499. [Google Scholar]
  22. Schmidt, M.N.; Morup, M. Nonnegative matrix factor two-dimensional deconvolution for blind single channel source separation. Intl. Conf. Indep. Compon. Anal. Blind Signal Sep. 2006, 3889, 700–707. [Google Scholar]
  23. Virtanen, T. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1066–1074. [Google Scholar] [CrossRef]
  24. Laroche, C.; Papadopoulos, H.; Kowalski, M.; Richard, G. Drum extraction in single channel audio signals using multi-layer non-negative matrix factor deconvolution. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
  25. Zhi, R.C.; Flierl, M.; Ruan, Q.; Kleijn, W.B. Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition. IEEE Trans. Syst. Man Cybern. Part B 2011, 41, 38–52. [Google Scholar]
  26. Okun, O.; Priisalu, H. Unsupervised data reduction. Signal Process. 2007, 87, 2260–2267. [Google Scholar] [CrossRef]
  27. Kompass, R. A generalized divergence measure for nonnegative matrix factorization. Neural Comput. 2007, 19, 780–791. [Google Scholar] [CrossRef] [PubMed]
  28. Cichocki, A.; Zdunek, R.; Amari, S. Csiszar’s divergences for non-negative matrix factorization: Family of new algorithms. Int. Conf. Indep. Compon. Anal. Blind Signal Sep. 2006, 3889, 32–39. [Google Scholar]
  29. Gao, B.; Woo, W.L.; Ling, B.W.K. Machine learning source separation using maximum a posteriori nonnegative matrix factorization. IEEE Trans. Cybern. 2014, 44, 1169–1179. [Google Scholar] [PubMed]
  30. Wu, Z.; Ye, S.; Liu, J.; Sun, L.; Wei, Z. Sparse non-negative matrix factorization on GPUs for hyperspectral unmixing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 3640–3649. [Google Scholar] [CrossRef]
  31. Gao, B.; Woo, W.L.; Dlay, S.S. Adaptive sparsity non-negative matrix factorization for single-channel source separation. IEEE J. Sel. Top. Signal Process. 2011, 5, 989–1001. [Google Scholar] [CrossRef]
  32. Cemgil, A.T. Bayesian inference for nonnegative matrix factorization models. Comput. Intell. Neurosci. 2009. [Google Scholar] [CrossRef] [PubMed]
  33. Fevotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural Comput. 2009, 21, 793–830. [Google Scholar] [CrossRef] [PubMed]
  34. Fevotte, C.; Idier, J. Algorithms for nonnegative matrix factorization with the β-divergence. Neural Comput. 2010, 23, 2421–2456. [Google Scholar] [CrossRef]
  35. Yu, K.; Woo, W.L.; Dlay, S.S. Variational regularized two-dimensional nonnegative matrix factorization with the flexible β-divergence for single channel source separation. In Proceedings of the 2nd IET International Conference in Intelligent Signal Processing (ISP), London, UK, 1–2 December 2015. [Google Scholar]
  36. Gao, B.; Woo, W.L.; Dlay, S.S. Variational regularized two-dimensional nonnegative matrix factorization. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 703–716. [Google Scholar] [PubMed]
  37. Parathai, P.; Woo, W.L.; Dlay, S.S. Single-channel blind separation using L1-sparse complex nonnegative matrix factorization for acoustic signals. J. Acoust. Soc. Am. 2015. [Google Scholar] [CrossRef] [PubMed]
  38. Tengtrairat, N.; Woo, W.L.; Dlay, S.S.; Gao, B. Online noisy single-channel blind separation by spectrum amplitude estimator and masking. IEEE Trans. Signal Process. 2016, 64, 1881–1895. [Google Scholar] [CrossRef]
  39. Tengtrairat, N.; Gao, B.; Woo, W.L.; Dlay, S.S. Single-channel blind separation using pseudo-stereo mixture and complex 2-D histogram. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 1722–1735. [Google Scholar] [CrossRef] [PubMed]
  40. Goto, M.; Hashiguchi, H.; Nishimura, T.; Oka, R. RWC music database: Music genre database and musical instrument sound database. In Proceedings of the International Symposium on Music Information Retrieval, Baltimore, MD, USA, 26–30 October 2003. [Google Scholar]
  41. Vincent, A.; Gribonval, R.; Fevotte, C. Performance measurement in blind audio source separation. IEEE Trans. Speech Audio Process. 2005, 14, 1462–1469. [Google Scholar] [CrossRef]
  42. Signal Separation Evaluation Campaign (SiSEC 2018). 2018. Available online: http://sisec.wiki.irisa.fr (accessed on 22 April 2018).
  43. Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
  44. Mørup, M.; Hansen, K.L. Tuning pruning in sparse non-negative matrix factorization. In Proceedings of the 17th European Signal Processing Conference (EUSIPCO’09), Glasgow, Scotland, 24–28 August 2009. [Google Scholar]
  45. Al-Tmeme, A.; Woo, W.L.; Dlay, S.S.; Gao, B. Underdetermined convolutive source separation using GEM-MU with variational approximated optimum model order NMF2D. IEEE Trans. Audio Speech Lang. Process. 2017, 25, 35–49. [Google Scholar] [CrossRef]
Figure 1. Time-domain representation and log-frequency spectrogram of: the piano music (top panels); the trumpet music (middle panels); and the mixed signal (bottom panels).
Figure 1. Time-domain representation and log-frequency spectrogram of: the piano music (top panels); the trumpet music (middle panels); and the mixed signal (bottom panels).
Sensors 18 01371 g001
Figure 2. Estimation of W τ and H ϕ with: (a) λ = 0 ; (b) λ = 0.1 ; (c) λ = 100 ; (d) λ = a d a p t i v e .
Figure 2. Estimation of W τ and H ϕ with: (a) λ = 0 ; (b) λ = 0.1 ; (c) λ = 100 ; (d) λ = a d a p t i v e .
Sensors 18 01371 g002
Figure 3. Convergence trajectory of the sparsity parameter: (A) λ 1 , 1 ϕ = 0   ; (B) λ 1 , 5 ϕ = 0 ; (C) λ 1 , 10 ϕ = 0 ; and (D) λ 1 , 15 ϕ = 0 .
Figure 3. Convergence trajectory of the sparsity parameter: (A) λ 1 , 1 ϕ = 0   ; (B) λ 1 , 5 ϕ = 0 ; (C) λ 1 , 10 ϕ = 0 ; and (D) λ 1 , 15 ϕ = 0 .
Sensors 18 01371 g003
Figure 4. SDR results of piano and trumpet mixture when using different λ values.
Figure 4. SDR results of piano and trumpet mixture when using different λ values.
Sensors 18 01371 g004
Figure 5. Histogram of the converged adaptive sparsity parameter.
Figure 5. Histogram of the converged adaptive sparsity parameter.
Sensors 18 01371 g005
Figure 6. SDR values of the separation results of mixture using different β values.
Figure 6. SDR values of the separation results of mixture using different β values.
Sensors 18 01371 g006
Figure 7. SDR values of the separation results of mixture using different β values and sparse methods.
Figure 7. SDR values of the separation results of mixture using different β values and sparse methods.
Sensors 18 01371 g007
Table 1. Differentiable Convex-Concave-Constant Decomposition of β -Divergence.
Table 1. Differentiable Convex-Concave-Constant Decomposition of β -Divergence.
Range d ˇ ( y | x ) d ^ ( y | x ) d ¯ ( x )
β < 1 and β 0 1 β 1 x y β 1 1 β y β y β 1
β = 0 x y β 1 log y y 1
1 β 2 d β ( x | y ) 0 0
β > 2 1 β y β 1 β 1 x y β 1 x y β 2
Table 2. Results using different sources.
Table 2. Results using different sources.
MixturesSDR (dB)β
Piano + trumpet16.112.11
9.192.13
9.431.93
7.731.82
12.212.09
Piano + violin13.071.07
8.151.23
6.250.92
9.331.20
8.190.89
Trumpet + violin14.630.68
8.140.62
7.810.67
9.810.51
7.550.52
Table 3. Results of separation for different mixture.
Table 3. Results of separation for different mixture.
MethodsSDR (dB)
Case (1)12.77
Case (2)13.01
Case (3)14.60
Case (4)14.62
Case (5)14.70
Case (6)15.60
Table 4. SDR (dB) results of adaptive β versus fixed β .
Table 4. SDR (dB) results of adaptive β versus fixed β .
MixturesSDR (dB) Using Adaptive βSDR (dB) Using β = 1SDR (dB) Using β = 2
Piano + Trumpet16.8514.1115.93
10.747.959.01
9.938.129.11
8.956.577.44
13.6410.2612.03
Piano + Violin14.1712.1211.67
9.047.957.11
8.136.095.81
10.49.088.71
9.597.857.19
Trumpet + Violin15.4012.4912.13
8.876.236.31
9.146.877.17
10.517.928.11
9.177.777.95
Table 5. Performance comparison of proposed method with NMF models.
Table 5. Performance comparison of proposed method with NMF models.
AlgorithmSDR (dB)
NMF-LS4.17
NMF-KLD3.47
NMF-TCS5.12
NMF-ARD3.98
NMF using proposed method7.63
Proposed method using matrix factor time–frequency deconvolution12.02

Share and Cite

MDPI and ACS Style

Woo, W.L.; Gao, B.; Bouridane, A.; Ling, B.W.-K.; Chin, C.S. Unsupervised Learning for Monaural Source Separation Using Maximization–Minimization Algorithm with Time–Frequency Deconvolution . Sensors 2018, 18, 1371. https://doi.org/10.3390/s18051371

AMA Style

Woo WL, Gao B, Bouridane A, Ling BW-K, Chin CS. Unsupervised Learning for Monaural Source Separation Using Maximization–Minimization Algorithm with Time–Frequency Deconvolution . Sensors. 2018; 18(5):1371. https://doi.org/10.3390/s18051371

Chicago/Turabian Style

Woo, Wai Lok, Bin Gao, Ahmed Bouridane, Bingo Wing-Kuen Ling, and Cheng Siong Chin. 2018. "Unsupervised Learning for Monaural Source Separation Using Maximization–Minimization Algorithm with Time–Frequency Deconvolution " Sensors 18, no. 5: 1371. https://doi.org/10.3390/s18051371

APA Style

Woo, W. L., Gao, B., Bouridane, A., Ling, B. W. -K., & Chin, C. S. (2018). Unsupervised Learning for Monaural Source Separation Using Maximization–Minimization Algorithm with Time–Frequency Deconvolution . Sensors, 18(5), 1371. https://doi.org/10.3390/s18051371

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop