1. Introduction
The nature of scientific discovery typically proceeds via the falsification of the null hypothesis via a test that is guided by an alternative hypothesis we wish to compare to. In many cases, we can write the alternative as an extension of the null hypothesis, such that, for example, there is a parameter of the alternative hypothesis whose value is fixed under the null hypothesis. In these situations, the null hypothesis is well-specified in terms of the prior distributions of its parameters, while we may have little or no idea what the prior distribution of the parameters of the alternative hypothesis is. Often, we are performing the search over many parameters, some with a well-specified prior and some without (e.g., the amplitude of the new effect). In this paper, we are interested in these mixed prior situations.
The Bayesian methodology of hypothesis testing compares the ratio of the marginal likelihoods of the two hypotheses to form a Bayes factor [
1]. If the prior distribution for the alternative is known, this is a valid methodology that yields optimal results. If the prior is only partially known, the resulting answer is sensitive to the features of the model that we have little control over. In particular, we can always arbitrarily down-weight the alternative hypothesis in the Bayes factor by choosing a very broad prior for its amplitude parameters. This may lead to rejecting the alternative when using a Bayes factor. This is not justifiable when the prior is poorly known: the end result can be a missed discovery opportunity. The opposite situation can also occur, where the chosen prior and the corresponding Bayes factor are too optimistic for the alternative hypothesis.
To overcome the arbitrariness of the prior choice, one could try to learn the prior from the data under some hyperparametrization, leading to a hierarchical or empirical Bayesian analysis [
2,
3]. This approach is not possible if we have no data that can inform the hyperparameters, as is the case when testing a new model (such as a new theory or a new phenomenon) that has never been observed before: a test of a new model should therefore not depend on the reported Bayes factor alone.
Various test statistics have been proposed that attempt to remove the dependence of the hypothesis test on the prior. They fall into two categories. One is basing the test on the posterior rather than the prior. One approach is to compare the posterior predictive data densities under both hypotheses, which can also be related to the cross-validation [
4]. Other proposed test statistics include suspiciousness [
5] (the posterior averaged log-likelihood ratio) and the e-value [
6] (the posterior probability of the parameters which have the posterior density under the alternative higher than the posterior density under the null).
A common failure of the posterior-based hypothesis tests is that they do not account for the look-elsewhere effect or the effective multiplicity of the test: if we perform N tests, the probability of having one of them with an anomalous value under the null hypothesis is increased by N.
A related class of methods are the so-called information criterion heuristics. The AIC and WAIC are meant to be approximations to the posterior predictive densities, and as such, they are closely related to the cross-validation [
4]. Another is the Bayesian information criterion (BIC), which is meant to approximate the Bayes factor, but the approximations do not hold for the class of scanning parameters we are concerned about in this paper. Thus, the look-elsewhere effect is not correctly taken into account by these ICs either, even if we account for the “effective number” of dimensions [
5]. This is because the look-elsewhere effect does not depend only on the dimensionality difference between the null and the alternative but also on the prior range of the additional parameters: if one searches over a broader prior, the multiplicity increases, and the look-elsewhere effect is more severe.
One typical example is searching for a signal in a time series, such as a localized feature (e.g., planet transit in stellar flux): the signal could be anywhere in the time series, and we must pay the multiplicity penalty for looking at many time stamps. The larger the scan, the more false positives our search will produce. This is completely ignored if we only look at the posterior, which is typically very localized.
An alternative approach is that of frequentist hypothesis testing, which is defined in terms of the false-positive rate of the null hypothesis (the p-value, or Type I error). This is independent of the validity of the assumed prior for the alternative hypothesis: we can rule out the null hypothesis even in the absence of a well-developed alternative hypothesis. However, to do so, we need a test statistic, and the best test statistic is the one developed based on the expected properties of an alternative hypothesis. In some situations, we may have a known prior distribution for some parameters, and taking the advantage of this information will increase the power of the test statistic.
The strength of the Bayes factor is thus that there are parameters other than the amplitude that have well-specified priors, for which Bayesian hypothesis testing has Occam’s razor built-in [
7], and it automatically accounts for effects, such as the look-elsewhere effect [
8]. For example, if we scan for a signal with a large template bank, we must account for the trials factor, and the corresponding
p-value significance is increased [
9,
10]. There is no single established frequentist procedure to do this, while the Bayes factor automatically accounts for it. In this case, the dependence on the prior in the Bayes factor analysis is an asset. In many realistic situations, the alternative hypothesis has several parameters, some with well-specified priors and some without. This paper aims to address these situations from both the Bayesian and frequentist perspectives and proposes a solution that takes advantage of both methodologies. The main themes of this paper are:
The Bayes factor is the test statistic with the highest power and should be used even in frequentist analyses, assuming some of the priors are informative (known).
For many applications, some of the priors are known, and some are unknown. This mixed prior information requires an analysis that combines the Bayes factor with the frequentist p-value.
For mixed prior problems with some unknown priors, the frequentist p-value or Type I error (the false-positive rate) evaluated on the Bayes factor is a better way to summarize the significance of the alternative hypothesis than the Bayes factor itself. In many situations, this can be performed analytically without the simulations.
The Neyman–Pearson lemma guarantees that the maximum likelihood ratio (MLR) is the highest power test statistic for a simple alternative hypothesis with no free parameters. Similarly, the Bayes factor is the test statistic with the highest power (it minimizes the Type II error (the false-negative rate) at a fixed Type I error) if the alternative hypothesis has multiple parameters with some prior information [
11,
12]. This is because the Bayes factor optimally summarizes the prior information we have about the alternative hypothesis. It does not mean that the estimated Type II error is correct, as the prior we assumed could be wrong. It does however mean that we cannot find a better test statistic on a Type II error without knowing a better prior.
The main contributions of this paper are:
The systematic treatment of the hypothesis testing with only partial prior information.
Analytic methods for the evaluation of the p-value with the Bayes factor test statistic. Our formalism enables evaluating the p-value without running expensive simulations and generalizes the Wilks theorem to non-Gaussian posteriors and high p-values.
The interpretation of the results in terms of the statistical mechanics, using concepts such as the counting of states, uncertainty quantification, and entropy. These connections have been explored before in the context of Bayesian statistics [
13]; here, we extend it to the frequentist statistics.
The outline of the paper is as follows: In
Section 2, we define the Bayesian hypothesis testing. In
Section 3, we develop an analytic formalism for computing the false-positive rate. In
Section 4, we apply this formalism to practical examples, notably an exoplanet transit search. In
Section 5, we offer an interpretation of the developed formalism based on statistical mechanics.
2. Bayesian Hypothesis Testing
We are given some data and want to test them against competing hypotheses. We will assume there is a single null hypothesis , which assumes there is no discovery in the data, and a collection of the alternative hypotheses , all predicting some new signal. For example, when we are looking for a planet transit in the time-series data, the null hypothesis predicts that the stellar variability and noise alone are responsible for the flux variations, while the alternative hypothesis also predicts that the presence of the exoplanet transit dips. There are multiple alternative hypotheses because we do not know the planet’s properties, such as its period, phase, or amplitude, so the alternative hypothesis is not simple and has parameters we need to vary.
The Bayesian approach to hypothesis testing is to examine the ratio of the marginal likelihoods, i.e., the Bayes factor
where the marginal integral is
Here, is the prior for the alternative hypothesis parameters and is the data likelihood under a general . A typical situation is that the null hypothesis corresponds to some specific values of , such as , where is the amplitude of the signal for the alternative hypothesis.
We are interested in the posterior odds between the competing hypotheses which follow directly from the Bayes factor by , where and are the prior probabilities. We typically assume the prior probabilities of the two hypotheses to be equal, in which case the Bayes factor is also the posterior odds, and in the following, we will focus on the Bayes factor only.
The Bayes factor is an integral over all possible alternative hypotheses, but usually, a relatively small range of parameters, where the data likelihood peaks, dominates the integral. In this case it suffices to apply a local integration at the peak of the posterior mass, which is often (but not always) at the maximum a posteriori (MAP) peak, where the posterior density peaks. If the location of the highest posterior mass peak under
is
, this gives the Bayes factor as [
8]
The Bayes factor depends on the MLR at the peak
, alternative hypothesis prior
, and the posterior volume
, which is a result of the local integration of the posterior ratio
around the peak. It can be approximated using the Laplace approximation as
, where
d is the dimension of the parameter space and
is the covariance matrix, given by the inverse Hessian of the negative log-likelihood evaluated at
,
Here,
is the derivative with respect to
. The choice of prior is subjective, and when the prior is known, it is informative and should be used. When the prior is not known, an option for a non-informative prior is Jeffrey’s prior, which is given by the square root of the determinant of the Fisher information matrix. Fisher matrix is the data-averaged Hessian:
where
. Note that the Jeffrey’s prior is an inverse of the data-averaged posterior volume, computed under the Laplace approximation. Therefore, it will on average cancel out the parameter dependence of
in the Bayes factor, making it dependent only on the local MLR.
We can generalize these concepts to non-Gaussian posteriors. As we will show in
Section 4.3, the Laplace approximation can be poor if the posterior is non-Gaussian, but the local Bayes factor integration is still well-defined, in which case Equation (
3) can be viewed as the definition of
(we use this in, e.g., Equations (
A8) and (
25)). The properly generalized Jeffrey’s prior is then
. This is the prior, which on average makes the Bayes factor directly proportional to the MLR. Such prior choice may be called non-informative, as only the likelihood is used to test the two hypotheses.
3. Frequentist Hypothesis Testing
In frequentist hypothesis testing, we first define a test statistic, which should be chosen to maximize the contrast against the alternative hypothesis. As guaranteed by the Neyman–Pearson argument [
11,
12], the optimal statistic between two well-specified hypotheses are the posterior odds. Any monotonically increasing function of the posterior odds is an equally good test statistic in the frequentist sense, so for convenience, we will work with
.
In frequentist methodology, we quantify a test statistic using its Type I false-positive rate (
p-value) under the null hypothesis
. If this number is sufficiently small, the null hypothesis is rejected. Unlike the posterior odds, the
p-value is independent of the correctness of the prior of the alternative hypothesis: it only depends on the null hypothesis itself. This is a great advantage of the
p-value over the posterior odds and is the main reason why posterior odds have not been widely adopted for hypothesis testing, even in Bayesian textbooks [
4,
14].
While the posterior odds can be derived entirely from the data using the likelihood (the likelihood principle), the
p-value is given by the frequency distribution of the test statistic under the null hypothesis, which generally requires simulating the null hypothesis many times to obtain it. Because this requires evaluating the test statistic for each simulation, and because we argue the test statistic itself should be the posterior odds, this can be significantly more expensive in the frequentist methodology than evaluating posterior odds once on the data, as performed in the Bayesian methodology. For this reason, we will study analytic techniques for evaluating the false-positive rate. The discussion here will not be rigorous, so we rederive the results under the Gaussian likelihood assumption in
Appendix A.
The log-Bayes factor
F and the maximum log-likelihood ratio
E are functions of a particular data realization, but here we will approximate them solely as a function of the local MAP parameters
(for a motivating example, see
Appendix A.2). Note that this is not possible in general, for example, if multiple correlated peaks contribute to the Bayes factor or if the data realizations are very discrete. Nevertheless, here we proceed with this assumption and note that the
p-value of the Bayes factor can then be inferred from the distribution of the parameters
under the null hypothesis
. If the prior is correct, the Bayes factor correctly predicts the relative occurrence of the two hypotheses in a long series of trials:
The distribution
follows the prior: each sample with local MAP parameters
will have a true value of parameters
approximately within
of
. Thus, as long as the prior is sufficiently smooth relative to the posterior, we have
. Therefore, the distribution of the local MAP parameters under the null hypothesis is
Note that the distribution
is independent of the prior because the prior dependence of the Bayes factor cancels
. Therefore, Equation (
7) holds regardless of the correctness of the prior. The
p-value is found by integrating over the parameters which yield at least the desired test statistic
:
Here,
is the prior, marginalized over the parameters
which yield
F. This result is saying that the probability density of finding some
F under the null hypothesis is given by the probability density of finding it under the alternative hypothesis prior, but exponentially damped with
F. Note that Equation (
8) is equivalent to
so the region of integration depends on the prior, but the integrand does not. This means that the prior selects the parameter range where the false positives can be generated, but the rate at which they are generated at those parameters is prior independent.
These expressions are useful when we have an analytic expression for the posterior volume and can perform the integral analytically (see examples in
Section 4.1 and
Section 4.2). Here, we will derive an approximation which directly relates the
p-value to the Bayes factor, so it is applicable even if the Bayes factor is only available numerically.
Assuming that the alternative has the amplitude parameter whose prior is separable from the other parameters
and whose posterior volume can be approximated by the Laplace approximation
, we rewrite Equation (
3) and define the reduced Bayes factor
as
Computing is completely analogous to computing the Bayes factor, but there is no need to perform the integral as this is performed analytically by the Laplace approximation.
Typically, the variations of
are much slower than the likelihood suppression
so we may evaluate the
integral in (
9) by considering the posterior volume to be a constant evaluated at the observed local MAP parameters. This gives
implying that the
p-value can be inferred directly from the observed reduced Bayes factor
with no need to perform any additional integrals.
Non-Asymptotic p-Value
We have found an analytic expression for the false-positive rate; however, it is only valid when the false-positive rate is low. By following [
8], we will show that this result can be easily extended to all
p-values if the posterior volume is much smaller than the prior volume, so precisely when the look-elsewhere effect is important.
Let us partition the alternative hypothesis manifold in K smaller manifolds and consider searches over the smaller manifolds. Let us choose the partition in a way that FPR in all the small searches is the same and call it .
Suppose the posterior volume is much smaller than the prior volume. In that case, most data realizations will have their posterior volume well within the prior range of some smaller manifold, even when
K is relatively large. Therefore, the probability of not finding a false positive in the original search equals the probability of not finding it in any of the small searches:
If
K is large,
becomes small, and we can compute it using the asymptotic expression of Equation (
8) or (
11):
. Taking the large
K limit, we find
This is a continuous parameters generalization of Sidák correction, which itself is a generalization of Bonferroni correction, which is commonly used for discrete states where the trials factor is a well-defined concept and referred to as multiple test comparison.
4. Results
In this section, we apply the developed formalism to three examples with increasing complexity. We start with the linear model and periodogram and compare our results with the literature. We then turn to the more realistic analysis of the exoplanet transit search.
In all cases, the data are a real vector
. The null hypothesis assumes the data are independently identically distributed (IID):
where
q is the probability density function of the noise. The alternative predicts some feature in the data, such that the residuals
are described by the null hypothesis. Note that in the exoplanet case, we will need to go to the Fourier basis and normalize by the power spectrum to make the data IID.
4.1. Linear Model
In this first example, the noise is standard Gaussian distributed and the model is a linear superposition of
d features
:
Without the loss of generality, we may assume the features are orthonormal by applying the Gram–Schmidt algorithm if they are not.
The MAP model is a projection of the data on the model plane and has the log-MLR . The posterior is Gaussian, so the Laplace approximation is exact and gives the posterior volume .
Often, we do not have any prior information, and it makes sense to adopt the Jeffrey’s prior, which is uniform for the linear parameters. For simplicity, we assume the prior volume to be a ball with
. The Bayes factor (
3) is
where
is the volume of the
d-dimensional unit ball. Note that the Bayes factor cannot be taken at a face value because it is sensitive to the unknown amplitude cutoff
. Furthermore, there is no advantage in using the Bayes factor over the MLR because
F and
E only differ by a constant and are therefore equally good as frequentist test statistics. We will therefore use
E as a test statistic in this example. The posterior volume is a constant, so the integral of Equation (
9) just picks up the volume of the constant-likelihood surface, which is a
-dimensional sphere. We obtain
where we have used that the volume of a
-dimensional sphere is
, with
the Gamma function. Note that the
p-value is independent of the unknown amplitude cutoff
.
The resulting cumulative distribution function is a
-distribution with
d degrees of freedom, and we reproduce the well-known result that
of a linear model with
d features is distributed as a
distribution with
d degrees of freedom [
15]. The
p-value is increasing with
d at a constant
, which is a reflection of the entropy versus energy competition discussed in
Section 5: there are more states on the shell of a constant-energy
E if
d is higher.
4.2. Periodogram
In this example, we are given
n time-series measurements
with Gaussian uncertainties
. We are searching for harmonic periodic signals [
16,
17,
18]:
Here,
is the signal’s amplitude,
is the unknown frequency, and
is the phase. We have introduced an additional normalizing factor
for later convenience. We assume a uniform prior on all three parameters, so
where
is some arbitrary large cutoff on the amplitude and
is set by some physical properties of the signal or by the experimental limitations, e.g., by the Nyquist frequency.
Using Equations (
A7) and (
A15), we compute the Bayes factor
which again suffers from the unknown amplitude cutoff, so we turn to the frequentist analysis to interpret the Bayes factor. We observe that the Bayes factor and the MLR are a simple function of one another, so their null distributions are related by the change in variable formula
.
Once again, the posterior volume is independent of the
parameters, and the integral (
9) picks up the volume of the constant-likelihood surface, which is, in this case, a cylinder, so
:
where
due to the convenient normalization of the template.
Note that while the amplitude cutoff cancels in the p-value, the frequency cutoff does not. This is the look-elsewhere effect: the larger the frequency space search, the larger the false-positive rate at a fixed MLR.
These results agree with the expressions of [
19], which use the formalism of [
20,
21]. This mathematical formalism is based on the extremes of random processes of various distributions, such as the gamma (
) distribution, and needs to be derived separately for each distribution. In addition, this formalism formally only gives an analytic lower limit to the corresponding extreme value distributions, while in practice, equality is assumed without proper justification. Our formalism provides a different derivation that results in the same expressions in the periodogram case. However, here, we do not consider the more challenging periodogram problem with sparse data sampling and correlated noise.
4.3. Exoplanet Search
As a more complex non-trivial example of the formalism we developed, we consider exoplanet detections in the transit data, where the planet orbiting the star dims the star when it transits across its surface. We have a time series of a star’s flux measurements
. In the absence of a transiting planet, the data are described by a stationary correlated Gaussian noise modeling the stellar variability. Note that the long-term trends in the Kepler data are removed by the preprocessing module [
22] and outliers can be efficiently Gaussianized without affecting the planet transits [
23]. Here, we ignore other defects in the data, such as binary stars, sudden pixel-sensitivity dropouts, etc. Such data are, for example, collected by the Kepler Space Telescope [
24] and the Transiting Exoplanet Survey Satellite (TESS) [
25].
One would like to compare the hypothesis
that we have a planet in the data to the null hypothesis that there is only noise. As argued in this paper, we will use the Bayes factor as a test statistic to incorporate informative prior information and non-Gaussian posteriors. The significance of a discovery is then reported as the probability of a false positive exceeding the observed value of the Bayes factor. We will first outline the procedure for calculating the Bayes factor given the prior for the transit model parameters. Then, using simulations of the null hypothesis, we will demonstrate that Equation (
13) gives reliable results for the
p-value of the Bayes factor (
Section 4.3.2). In
Section 4.3.3, we will discuss several prominent prior choices and demonstrate how a realistic prior choice can reduce the Type II error at a constant Type I error. In
Section 4.4, we will consider a noise distribution with strong power-law tails and show that the
p-value can still be computed by an analytical formula (
11).
4.3.1. Bayes Factor
The planet transit model
can be parametrized by
parameters: transit amplitude, period, phase, and transit duration,
. The transit model is of the form
It is a periodic train of
M transits with period
P, phase
, amplitude
A, and transit duration
.
is a U-shaped transit template that is nonzero in the region (−1/2, 1/2) and depends on the limb darkening of the stellar surface [
26].
The likelihood is Gaussian, and stationarity ensures that the Fourier transformation
diagonalizes the covariance matrix with the power spectrum on the diagonal. The power spectrum
can be learned from the data [
27].
To make the noise standard Gaussian IID, we introduce
We also rescale the amplitude A to make the model normalized.
It will be convenient to first optimize over the linear parameter
and then over the remaining nonlinear parameters. The amplitude’s MAP value at fixed
is by Equation (
A3)
where we have restored a model normalization factor. This is the matched filtering expression for the signal-to-noise ratio [
23] and can be computed efficiently using Fourier transforms [
23] for all phases
on a fine grid at once. Our strategy for finding the highest MAP solutions is therefore to first find a maximum likelihood estimator (MLE) by scanning over the entire parameter space (for details, see [
23]) and then use it as an initial guess in a nonlinear MAP optimization.
We calculate the posterior volume associated with the most promising peaks by marginalizing over the planet’s parameters in the vicinity of the peak. The amplitude has a Gaussian posterior which is not correlated with the other parameters, because the likelihood is Gaussian, and we are properly normalizing the template (see the discussion below Equation (
A7)). The posterior volume is therefore
One would be tempted to employ the Laplace approximation.
Figure 1 and
Figure 2 show this does not give satisfactory results: we are not in the asymptotic limit. While a frequentist approach using Wilks’ theorem would become invalid, here we can compute the full marginal integral of Bayesian evidence. Integral (
25) is only three dimensional, so we take the Hessian at the peak to define the Gaussian quadrature integration scheme [
28] of degree 7, implemented in [
29], which requires 24 integrand evaluations.
4.3.2. p-Value
To obtain the
p-value, we use Equation (
11) and extend it to all
p-values using (
13). We test our analytical expression for the
p-value of the Bayes factor by simulating the null hypothesis and comparing the computed
p-value with the empirically determined value. We evaluate 4600 realizations of the null hypothesis. We take a realistic power spectrum extracted from Kepler’s data for the star Kepler 90. The power spectrum and example realizations are shown in Figures 2 and 4 of [
23]. A realization is a uniformly distributed choice of the phases of the Fourier components and normally distributed amplitudes of the Fourier components with a zero mean and variance, given by the power spectrum. In each of the resulting time series, we then determine the Bayes factor of the planet hypothesis against the null hypothesis (
Section 4.3.1) and its
p-value.
The analytic and empirical
p-value are compared in
Figure 2, achieving an excellent agreement. This shows that our formalism enables the evaluation of the
p-value beyond the asymptotic limit, thus generalizing Wilks’ theorem.
4.3.3. The Choice of Prior
Here, we discuss a prior choice for the period, phase, and transit duration. We will see that using an informative prior when available can significantly improve detection efficiency.
We scan over the period range from 3 to 300 days, and at each period over all phases from 0 to 1. As we argued in this paper, when we do not know the prior, it is simplest to adopt Jeffrey’s choice. We will assume this for the period and phase parameters. Note that for the phase, Jeffrey’s prior is flat from 0 to 1, which we can also view as a known (informative) prior because the phase cannot affect the exoplanet detectability. There is a natural value for the transit duration parameter , given the planet’s period and assuming the circular, non-inclined orbit. We illustrate the impact of the prior choice by considering three scenarios:
- (1)
A fixed value of , defined by assuming a circular, perfectly aligned orbit. The transit duration is fixed by the Kepler’s third law: , where is the star’s density. This choice can be too restrictive for non-circular or inclined orbits and can penalize real planets.
- (2)
Jeffrey’s prior on a broad domain . This choice may include physically implausible transit times, leading to a larger multiplicity penalty.
- (3)
A realistic prior distribution, taking into account inclined and eccentric orbits. Orbits are assumed to be isotropic, with eccentricities drawn from a beta function with parameters that match the observed planets in the Kepler’s data [
30]. In addition, the star’s density is measured with an uncertainty on the order of 15%, causing uncertainty in
. The transit duration prior is obtained by marginalizing over the orbit inclination, eccentricity, and star’s parameters. The distribution is a broadened version of the delta function at
.
All three choices are visualized in
Figure 3. In
Appendix C, we derive the approximate Jeffrey’s prior by deriving the scaling of the Fisher information matrix. In the first two scenarios, we, respectively, obtain
and
. One can take the transit template and calculate the Fisher information matrix numerically for more accurate results. In
Figure 4, we show that the peaks of the null hypothesis are distributed by Jeffrey’s prior, in agreement with our predictions.
Taking
as a completely unconstrained parameter will in general lead to detecting planets that are not physically plausible and will therefore increase the false-positive rate and force us to reject more real planets than necessary. In the present example, shown in
Figure 2, this effect is not very large if the prior is chosen such that it still introduces a reasonable cutoff on the transit duration. However, choosing such a cutoff is a choice that must be based on physical arguments: the prior plays an essential role.
A prior that is too narrow (case 1) is also suboptimal because it will reject some real planets. Using a template with duration
when in fact a planet has transit duration
will reduce the planet’s
by
where this is an expected reduction over the noise realizations.
In
Figure 5, we show the ROC curves (Receiver-Operated Characteristic) for all three prior choices, that is, a true-positive rate as a function of the false-positive rate, both parametrized by the detection threshold. The false-positive probabilities are taken from
Figure 2. The true-positive probability is a probability that the signal with some
is detected above the threshold. The detected
can be approximated as a Gaussian-distributed variable with unit variance. Its mean is
with the realistic and wide priors. It is additionally reduced with the circular orbit prior because the search template differs from the true template (Equation (
26)). Using a realistic prior improves the true detection probability at a fixed false-positive probability relative to a prior that is too narrow (a circular orbit prior) because it improves the fit. It also improves the ROC relative to a prior that is too broad because it does not include the templates that rarely happen in the data and would lead to a larger multiplicity and hence a larger false-positive rate. This figure thus demonstrates that hypothesis testing with a realistic prior Bayes factor gives an optimal ROC (power versus
p-value).
4.4. Non-Gaussian Likelihood
Our final example is a single exoplanet transit search in the Student’s t-distributed noise:
where
is a parameter of the distribution (degrees of freedom).
corresponds to the Gaussian noise, while lower values of
result in strong power-law tails. This noise is an exaggerated version of the outlier distribution in the actual Kepler data [
27]. The alternative model is a single U-shaped transit with a fixed transit duration
days. The parameters of the model are the location of the transit
and its amplitude
A.
As in the main exoplanet example, we have a time series of flux measurements, equally spaced every 30 min, and the total length here is 200 days. We do not consider time-domain noise correlations here. Matched filtering is not possible due to the non-Gaussian noise, so we scan over the phase in 30 min increments and find the MLE
solution at each phase using the Newton’s method. We identify the five highest MLR peaks and compute the Bayes factor around each peak. We find that the Laplace approximation is accurate for the amplitude parameter, but the
integral has to be taken numerically. We take the peak with the highest Bayes factor, compute the associated
of Equation (
11), and see an excellent agreement with the
p-value prediction (
13), shown in
Figure 6.
5. Statistical Mechanics Interpretation
Here, we interpret the Bayesian and the frequentist hypothesis testing as an energy versus entropy competition, where the energy is the maximum log-likelihood which favors the alternative model and the entropy is the influence of the look-elsewhere effect.
In a continuous parameter space, the nearby models are not independent. In fact, we can consider the models which cannot be distinguished after seeing the posterior as one independent unit, which we call a state. This is analogous to the shift from classical statistical mechanics to quantum statistical mechanics, where the discrete states are counted in units of their uncertainty volume. Similar ideas were used in [
13]. In this context, the generalized Jeffrey’s prior is non-informative in the sense that it assigns an equal probability to each state [
13], so to each effective indistinguishable model.
5.1. Bayes Factor
Using the Jeffrey’s prior, the logarithm of the Bayes factor (
3)
is reminiscent of the thermodynamics relation for the free energy. We identify
E with energy and
with entropy. Note that
N is independent of
and equals the total number of states that fit in the prior volume [
31]).
For a general prior, the thermodynamics relation has to be generalized to , where the potential energy measures the extent to which the prior is informative, meaning that it favors some states over the others.
Note that the entropy is always positive and therefore always favors , because the posterior is narrower than the prior. The energy has to surpass the entropy for the alternative hypothesis to prevail. This is the Occam’s razor penalty, which is built into the Bayes factor.
Our definitions of energy and entropy should be viewed from the hypothesis testing point of view, where the energy
is the only “macroscopic” parameter that influences the outcome of the test. The other parameters are “microscopic” in the sense that we do not care about their values in the test. Entropy is the logarithm of the number of microstates, given the macrostate, as usual. To be precise, the entropy should only count the states which give rise to the same macrostate, so the states with the same energy. Such a count corresponds to the Bayes factor, which ignores the look-elsewhere effect associated with the amplitude parameter. This makes little sense from the Bayesian perspective as the prior is determined only after seeing the data, such that the amplitude is fixed to its MAP value under the original prior. However, we have seen that it is exactly the reduced Bayes factor
from Equation (
10) that appears in the frequentist analysis.
5.2. p-Value
We recognize that the integral
in Equation (
9) is the continuous version of the sum over states. The asymptotic
p-value
is therefore the sum over all states which generate false positives, each weighted with the Boltzmann factor
:
However, this is exactly the partition function of the canonical ensemble with
, so
is the frequentist definition of the free energy. The
p-value approximation (
11) implies that the Bayesian free energy
and the frequentist free energy
are one and the same thing, up to the logarithmic corrections in the energy. Note that in physics, the thermodynamic and statistical mechanic free energies are also the same in the thermodynamic limit.
Note that in the frequentist analysis, not all the states contribute to the look-elsewhere effect equally but according to their Boltzmann factor. In other words, trials with parameters which are very unlikely under the null hypothesis do not increase the multiplicity. We show an illustration of this phenomenon in
Figure 7.
6. Discussion
This paper compares the frequentist and Bayesian significance testing between hypotheses of different dimensionality, where the null hypothesis is a well-defined accepted model for the reality of the data, while the alternative hypothesis tries to replace the null hypothesis. Both the Bayesian and frequentist methodologies have advantages and disadvantages in the setting where the alternative hypothesis has not been observed with sufficient frequency to develop a reliable prior. We argue that for optimal significance testing, the Bayes factor between the two hypotheses should be used as the test statistic with the highest power. However, the Bayes factor should not be used to quantify the test significance when the prior of the alternative hypothesis is poorly known. Instead, the frequentist false-positive rate of a null hypothesis (Type I error or
p-value) can be used, which only depends on the properties of the null hypothesis, which is assumed to be well understood. The sensitivity of the Bayes factor to the choice of prior is known in the context of Lindley’s paradox [
32]. While there is no actual paradox, it highlights the dependence of the Bayes factor on the choice of prior, which is undesirable because we often do not know it. Our solution to this paradox is to relate the Bayes factor to the
p-value, which is independent of the alternative hypothesis, as it only tests the distribution of the null hypothesis. In this way, one can use the
p-value for hypothesis testing even when using Bayesian methods, by using the Bayes factor as a test statistic.
While it is common to use the MLR as a test statistic, this is not prior independent but corresponds to the generalized Jeffrey’s prior. If some prior information is available, as in our exoplanet example of transit duration being determined by the period P via the Kepler law, one should use it to reduce the Type II error. We note that Jeffrey’s prior can be unreasonable, even in unknown prior situations: if, in a given experiment, the posterior volume is strongly varying across the parameter space, the Jeffrey’s prior is very experiment-specific. A prior that is smooth across the parameter space is undoubtedly a better prior, even if we do not know what the specific form should be. Nevertheless, when Jeffrey’s prior is reasonable, it simplifies the analytic calculation of the p-value.
We show that both the p-value and the Bayes factor can be expressed as an energy versus entropy competition. We define energy as the maximum logarithm of the likelihood ratio and Bayesian entropy as the logarithm of the number of posterior volumes that fit in the prior volume. The constant-energy Bayes factor is analogous to the thermodynamical free energy. Conversely, the p-value in the asymptotic regime corresponds to the canonical partition function, which is a Boltzmann factor weighted sum over the posterior states with the test statistic above the observed one. In the low p-value regime, only the states close to the observed test statistic contribute to the sum. Therefore, the constant-energy Bayes factor and the asymptotic p-value are related. This also happens in physics, where the statistical and thermodynamical definitions of the free energy coincide in the thermodynamical limit. As an example, we show that the p-value of the standard distribution of d degrees of freedom can be interpreted as an energy versus entropy competition, with the latter defined as the logarithm of the number of states on the constant-energy shell. The entropy grows as the log of the area of a sphere in d dimensions with a radius proportional to the square root of the energy.
The formalism developed here extends the Wilks’ theorem [
15] in several ways. First, the connection to the Bayes factor allows us to define the posterior volume beyond the Gaussian approximation inherent in the asymptotic limit assumed for the Wilks’ theorem. Second, the Wilks’ theorem assumes the parameter values are inside the boundaries. We show that a generalization counting the states as a function of energy provides a proper generalization that gives better results. Third, Wilks’ theorem does not account for the multiplicity of the look-elsewhere effect. Our method correctly handles these situations.
As an example application, we apply the formalism to the exoplanet transit search in the stellar variability-polluted data. We search for exoplanets by scanning over the period, phase, and transit duration, and we show that the multiplicity from the look-elsewhere effect is of order . We find that the Laplace approximation for the uncertainty volume is inaccurate, while the numerical integration of the Bayesian evidence gives very accurate results when compared to simulations. We emphasize the role of informative priors, such as the planet transit time prior, which reduce the Type II error, leading to a higher fraction of true planets discovered at a fixed Type I error (p-value) threshold. The method enables a fast evaluation of the false-positive rate for every exoplanet candidate without running expensive simulations.
There are other practical implications that follow from our analysis. For example, the multiplicity depends not only on the prior range but also on the posterior error on the scanning parameters. If this error is small in one part of the parameter space but large in others, then the MLR test statistic leads to a large multiplicity that will increase the p-value for all events. We show that analytic predictions reproduce the distribution of false positives as a function of the period and transit duration. An informed choice of the prior guided by what we know about the problem and what our goals are may change this balance and reduce the multiplicity penalty, thus reducing the Type II error: the Bayes factor can be a better test statistic for the Type II error than the MLR. One is of course not allowed to pick and choose the prior a posteriori: we must choose it prior to the data analysis.
In many situations, it is possible to analytically obtain the false-positive rate as a function of the Bayes factor test statistic from Equation (
11), which gives a
p-value estimate that is more reliable for hypothesis testing than the corresponding Bayes factor in situations of a new discovery where the prior is not yet known. As a general recommendation, we thus advocate that Bayesian analyses report the frequentist
p-value using the Bayes factor test statistic against the null hypothesis as a way to quantify the significance of a new discovery, and that frequentist analyses use the Bayes factor as the optimal test statistic for hypothesis testing while using frequentist methods to quantify its significance.