The Fermi–Dirac distribution provides a calibrated probabilistic output for binary classifiers

Significance While it would be desirable that the output of binary classification algorithms be the probability that the classification is correct, most algorithms do not provide a method to calculate such a probability. We propose a probabilistic output for binary classifiers based on an unexpected mapping of the probability of correct classification to the probability of occupation of a fermion in a quantum system, known as the Fermi–Dirac distribution. This mapping allows us to compute the optimal threshold to separate predicted classes and to calculate statistical parameters necessary to estimate confidence intervals of performance metrics. Using this mapping we propose an ensemble learning algorithm. In short, the Fermi–Dirac distribution provides a calibrated probabilistic output for binary classification.

A. Notation. With the above problem setup, we let Prob(R = r|Y = y) denote the probability, over all realizations from Ω(N, N1), that a sample of class y ∈ {0, 1} has been assigned rank r ∈ {1, · · · , N } by classifier g. Analogously, let Prob(Y = y|R = r) denote the probability, over all realizations from Ω(N, N1), that the true label of an item placed at rank r by classifier g is y.
We will call P (r|y) the class-conditioned rank probability and P (y|r) the rank-conditioned class probability. 30 The conditional expected value of R given the class in the (N, N1)-ensemble will be denoted by r|y , and is computed from  B. Calculation of the rank-conditioned class and class-conditioned rank probabilities. Theoretically, at least, it is possible to design a direct way to calculate P (y|r) for a classifier. To do that, we create a finite (N, N1)-ensemble with L 1 instances T L i=1 . We apply the classifier to each of the instances T L i=1 and for instance Ti we create a ranked list r ik ∈ {1, 2, ..., N } as discussed in 1B. The rank-conditioned positive-class probability P (1|r) in this case is where y ik(r) is the label of the item k(r) that was placed at rank r in test set instance Ti. In other words, P (1|r) is the 35 frequency with which classifier g places positive class items at rank r. Eq. (1) is conceptual, as in most cases we won't have 36 the labels y ik of the test sets. However, when we have the labels, as in the simulations done in Fig. 1, P (1|r) is computed as 37 prescribed using Eq. (1). 38 We can calculate the class-conditioned rank probability P (r|y) from P (y|r) using Bayes' theorem: P (r|y) · P (Y = y) = P (y|r) · P (R = r), where P (R = r) = 1 N , P (Y = 1) = ρ and P (Y = 0) = 1 − ρ. Therefore, P (r|1) = P (1|r)/N1, and P (r|0) = P (0|r)/(N − N1). 39 We showed in the main text that the average value of the traditional binary classification performance metrics like the true positive rate (TPR), false positive rate (FPR), precision (Prec) and balanced accuracy (bac) can be expressed in terms of the rank-conditioned class probability:

Performance metrics and the rank-conditioned class probability
Prec(k) = 1 k k r=1 P (1|r), [4] bac(k) = 1 2 ( TPR(k) + 1 − FPR(k) ) . [5] In this section, we will show that the area under the ROC curve (AUC) can be expressed in terms of the class-conditioned 40 averages of the ranks. These relationships are known in statistics in terms of the Wilcoxon-Mann-Whitney U test for a sample. 41 Here we will re-derive them using rank-conditioned class probabilities as averages in the (N, N1)-ensemble. We will also show 42 that the AU C of a classifier can be expressed in terms of an average over all ranks of the average balanced accuracy in the 43 (N, N1)-ensemble. Finally, we will characterize the average area under the precision-recall curve AU P RC in terms of an 44 average over all possible thresholds using the second moment of the precision.

45
A. Expressing the AU C in terms of class-conditioned averages. To compute the AU C , we will use the trapezoidal rule to 46 integrate the ROC curve. Using k as a parameter, the ROC curve can be expressed as the set of points ( FPR(k) , TPR(k) ) 47 with k = 0, · · · , N and TPR(0) = FPR(0) = 0: where we used the the forward difference operator defined as f (k) = f (k + 1) − f (k). Before stating the main result of 50 this subsection (Theorem 1), it will be useful to prove a few auxiliary results. 51 Lemma 1. The sum of the TPR(k) and FPR(k) over all thresholds k is related to the class-conditioned expected rank: Proof. Rewriting the sum in Eq. (7) using Eq. (2), and recalling that which proves Eq. (7). We can prove Eq. (8) following a similar derivation. However, let's prove it by way of a symmetry 52 argument. Let us swap the names of the classes such that what we called the positive class before is now our negative class, 53 and vice versa. The items 1 ≤ r ≤ k that were positives are now negatives, and therefore the TPR(k) becomes the FPR(k). 54 Therefore, by changing TPR for FPR and 1 for 0 in Eq. (7) we get Eq. (8). 55 We can now state the following result.
Sung-Cheol Kim, Adith S. Arun, Mehmet Eren Ahsen, Robert Vogel and Gustavo Stolovitzky and the average sum of the rank of the positive class, N1 r|1 , can be expressed in terms of the AU C as Proof. This derivation extends to averages in the (N, N1)-ensemble, a similar result proved in (1) for one realization of the test set. We rewrite Eq. (6), using from Eq. (2) and Eq. (3) that TPR(k) = P (1|k + 1)/N1 and FPR(k) = P (0|k + 1)/N0 = (1 − P (1|k + 1))/N0, as follows After expanding the product inside the sum, the last equation can be parsed into two halves. Operating over the first half we get where we used that TPR(0) = 0 and TPR(N ) = 1. Operating over the second half we get: where we used that the sum over the difference of squares in the first line of the last equation is a telescopic sum, and that TPR(0) = 0 and TPR(N ) = 1. Adding the two halves results in the following expression for the AU C : Using Lemma 1 we obtain Similarly, we can re-write AU C in terms of r|0 by a symmetry argument. If the items that were called positives are now called negatives (and vice versa) then 1 changes to 0, 0 changes to 1, the AU C changes to 1 − AU C , N1 changes to N0 and N0 changes to N1. Doing those changes in Eq. (14) and after some algebra we get Multiplying Eq. (14) times N0 and Eq. (15) times N1 and adding the resulting equations we get: from where Subtracting r|1 from r|0 from the previous equation we get (2 bac(k) − 1).
Using theorem 1 we can express r|0 − r|1 in terms of AU C in the previous equation, from where which proves the theorem.

67
Next, we use the rank-conditioned class probability to characterize the average area under the precision-recall curve (AUPRC) in terms of the average squared precision over all thresholds. The precision at threshold rank k is defined in Eq. (4). We will integrate the precision vs recall curve (PRC) using the trapezoidal rule: where by definition Prec(0) = 1. We can now prove the following:
where we used the notation g = 1 . This proves the first part of the theorem.

76
When N 1, we can approximate the finite sums for integrals, to get where for the last equation we used g = 1 . This proves the second part of the theorem. We made the point in the main text that the mapping of fermionic systems to binary classification and the fact that the FD 79 distribution is the maximum entropy distribution in the physical sense for the fermionic system justified the choice of the FD 80 distribution as the least biased distribution of the rank-conditioned class probability given the constraints. That the maximum 81 entropy from a physics point of view corresponds to a distribution that has the maximum uncertainty given the information at 82 hand has been argued in the seminal 1957 paper by Jaynes (2). Jaynes argued that statistical mechanics can be viewed as a 83 form of statistical inference rather than as a physical theory, and all of statistical mechanics can be derived from maximizing 84 the Shannon Entropy. Therefore, it should be possible for us to derive that the rank-conditioned rank distribution introduced 85 here coincides with the FD distribution when we maximize Shannon entropy with the appropriate constraints. This will justify 86 using the FD distribution in binary classification, not as the true rank-conditioned class probability, but as the distribution 87 that is maximally non-committal with regard to missing information. In this section we derive such a distribution. Except (yN = 0). For this realization, the class vector will be (1, 0, 1, . . . , 0). We will call Q(y1, y2, . . . , yN ) the probability of class 94 vector (y1, y2, . . . , yN ). 95 We will assume that the probability that there is an item of class yr at rank r is independent of the probability that there is an item of rank y k at rank k. Therefore the probability of the class vector factorizes where P (yr|r) is the rank-conditioned class probability at rank r. 96 The entropy of the probability of the class vector is and using its factorization, we obtain If we did not know anything else about the problem (except for the obvious fact that each P (0|r) + P (1|r) = 1), maximizing the above entropy would yield P (0|r) = P (1|r) = 1/2, in agreement with Laplace's intuition in his "Principle of Insufficient Reason", which stated that two events are to be assigned equal probabilities if there is no reason to think otherwise. But in our case there is reason to think otherwise. Besides the normalization constraint P (0|r) + P (1|r) = 1 for each r, we know the average number of class 1 items in the (N, N1)-ensemble is N1. That is a property of the prevalence ρ of class 1 items in the dataset, as N1 = N ρ. There is another constraint we can include, which is a property of the classifier's performance, which is related to the average rank of class 1 items. These two conditions (the one expressing a property of the dataset and the other expressing a property of the classifier) and the normalization conditions are our constraints for P (y|r). These constraints can be written as follows: R is the average sum of the ranks r on the (N, N1)-ensemble for which yr = 1 and is equal to N1 r|1 . We already showed (Eq. (13) in Theorem 1) that R can be written in terms of AU C . These constraints need to be added with their corresponding Lagrange multipliers in the expression to be maximized To find the set of P (y|r)'s that maximizes the entropy with the given constraints, we take the variations of C(Q) by perturbing each P (y|r) with arbitrary variation δPr := δP (y|r) with respect to each P (y|r) δPr [ln P (y|r) + 1 + αr + λy + βyr] = 0.
As the variations δPr are arbitrary, the terms in brackets have to be 0, that is The constants αr are determined from the constraints in Eq. (21), from where P (1|r) = 1 1 + e β(r−µ) where µ = λ/β. The constants β and µ are determined using Eq. (22) and Eq. (23) respectively. It needs to be re-stated 101 that we do not claim that the FD distribution is the true rank-conditioned class distribution. However, as is the case with 102 distributions resulting from a maximum entropy argument, it is the most non-committal, and therefore least biased distribution 103 that we can use when only the given constraints are known.

Calculation of β and µ
As discussed in the main text, the parameters β and µ are calculated from the two constraints on the rank-conditioned class probability: We now re-scale r with N to generate the new variable ξ = r/N and call β = βN and µ = µ/N . Dividing the above equations 106 by N , calling dξ = 1/N , and approximating the sums by integrals (which is justified for sufficiently large N 1), we can 107 rewrite Eqs. (48) and (49) as Integrating Eq. (28) we get from where µ can be expressed as a function of ρ and β : In like manner, Eq. (29) can be integrated to yield where Li2(z) is the dilogarithm function. A general analytical expression to express β and µ in terms of ρ and AU C does not exist. Therefore given ρ and AU C we numerically solve for β using Eq. (32) after expressing µ in terms of β and ρ using Eq. (30). After solving for β , we use Eq. (30) to find µ . The sought for parameters β and µ can then be obtained from β and µ from the knowledge of N : β = β /N ; and µ = µ N.
In some some limiting cases, such as β 1, β 1 and ρ = 1/2 (see next subsections) it is possible to express β and µ 111 explicitly in terms of ρ and AU C .

112
A. Case β 1. The case β 1 corresponds to weak classifiers (high temperature) for which the rank-conditioned class probability has a very weak dependence on the rank r or in the re-scaled version, on ξ. In this case we can expand [33] Replacing this expansion in Eq. (28) we get [34] Replacing the expansion in Eq. (29) we get Combining Eqs. (34) and (35) we get the desired results: It is not difficult to verify that Eq. (37) results from Eq. (30) and Eq. (36) in the limit of β 1. Note that when AU C → 1/2 116 (limit of random classifiers) the parameter β → 0 (limit of infinite temperature). In this same limit, µ is undefined. However, 117 the parameter βµ = 6 ( AU C − 1/2) − log((1 − ρ)/ρ) is well defined and tends to log((1 − ρ)/ρ) as AU C → 1/2. 118 B. Case β 1. As β increases (or equivalently, as the temperature decreases), the AU C approaches 1. In this limit, the 119 rank-conditioned class probability is a step function, taking the value of 1 when r ≤ µ and 0 when µ < r ≤ 1. It is clear 120 from Eq. (30) that in the limit of β 1 it is µ = ρ. Therefore, in this limit β 1, µ = N1 and the rank-conditioned class 121 probability close to the perfect classifier is [38] 123 The slope of this distribution at r = N1 is Given that in this limit the P (1|r) is a steep logistic function, we will then approximate it with a linear function, where the 126 slope of the line interpolating between 1 and 0 is determined by the slope of the actual distribution at r = N1 where r− = N1 − 2/β and r+ = N1 + 2/β. Now we apply the constraint of Eq. (49) where we approximated the sum with an integral and used Eq.40. Solving for β from the last equation we find that close to the 134 perfect classifier  In the main text we recovered a result previously advanced in (3) stating that the AU C computed from the scores sj outputted by a classifier from a finite test set using a rectangular integration rule can be written as where H is the Heaviside step function which is 1 if the argument is positive, 0 if negative and 1/2 if 0, and N1 and N0 are the number of positive and negative items in the test set. The double sum in Eq. (45) is known as the Mann-Whitney U statistics. As already discussed in the main text, if we take the average in the (N, N1)-ensemble of test sets in both sides of Eq. (45), we get AU C = Prob(sj > si|yj = 1; yi = 0). [46] Next we show that Eq. (46) can be directly derived from the expressions of the TPR(k) and FPR(k) in terms of the rank-conditioned class probability as expressed in Eq.
(2) and Eq. (3). Writing the AU C as the area of the average ROC in TPR vs FPR space using a rectangular integration rule, and then using Eqs.
(2) and (3) we find that [47] Eq. (47) is equal to the probability that the rank of a negative item is larger than the rank of a positive item. This is so 149 because if a positive item is at rank r, the probability that a negative item has a rank higher than r is N i=r+1 P (i|0). As 150 the positive item can be at any rank r, to compute Prob(rj < ri|yj = 1; yi = 0) we need to add the previous sum over all the 151 possible ranks r where the positive item could be, weighted by the probability P (r|1) that there is a positive item at rank r.

153
As argued throughout this paper, the maximum entropy distribution for the rank-conditioned class probability P (1|r) is given by the FD statistics: P (y = 1|r) = 1 1+e β(r−µ) where the parameters β and µ are determined by the constraints given in Eqs. (48) and (49). Therefore it may seem that Eq. (47), which also links P (1|r) and AU C will over-determine β and µ. We next show that this is not the case. Indeed Eq. (47) is valid for any rank-conditioned class probability function P (y|r) that verifies the following constraints: r P (1|r). [49] For simplicity we will assume that N 1, and rewrite the previous sums as integrals over the rescaled variable The replacement of sums by integrals imply that the results should be correct to within an error of the order of 1/N . Let's now compute Prob(sj > si|yj = 1; yi = 0) in terms of the probabilities of positive class given the rank Prob(sj > si|yj = 1; yi = 0) = 1 ρ(1 − ρ) [52] [53] The first integral in Eq. (53) can be easily found from Eq. (50) and Eq. (51): . [54] The second double integral in Eq. (53) can be solved by multiplying Eq. (50) by itself and doing some algebra: where we used that, because of symmetry, it is As discussed in (4), the variance of the AUC of a classifier can be computed using Eq. (45). To do that we recall that for any random variable X, σ 2 X = X 2 − X 2 , which we will apply to the AUC. We first compute AU C 2 by squaring Eq. (45), from where we get: where sP,i and s N,k are the scores assigned by the classifier to the i-th positive item and the k-th negative example respectively and H(∫ ) is the Heaviside function which takes the value of 1 for s > 0 and 0 for s < 0. For the sake of simplicity we will assume that there are no ties in the scores, which is a reasonable assumption if the scores lie on a continuous scale. Therefore H(sP,i − s N,k ) 2 = H(sP,i − s N,k ). Taking the expected value in the above equation over the (N, N1)−ensembles of test sets we get j − sN,m) . [62] Note that H(sP,i − s N,k ) is the probability that a positive item has a score larger than a negative item, which was earlier shown to be equal to AU C (see Eq. (46)). We will call  − sN,m), which is 1 when the two randomly sampled negative items s N,k and sN,m have a score that is lower than a randomly sampled positive item sP,i and 0 otherwise. Therefore P100 is the probability that the classifier assigns two randomly sampled negative items scores that are smaller than the score assigned to a randomly sampled positive item. Similarly P110 is the expected value of H(sP,i − s N,k )H(sP,j − s N,k ), which is 1 when the two randomly sampled positive items sP,i and sP,j have scores that are larger than a randomly sampled negative item s N,k and 0 otherwise. Therefore P110 is the probability that the classifier assigns two randomly sampled positive items scores that are larger than the score assigned to a randomly sampled negative item. Assembling these results, σ 2 AU C can be written as [63] Eq. (63) is the expression for the variance of the AUC that we used in the main text.
as defined in (9) with Φ −1 being the inverse of the standard normal cumulative. As in the main paper we use the convention 176 that ranks assigned by better than random classifiers (i.e., AU C > 0.5) are such that positive class items get ranks that will be 177 lower in average than the ranks of the negative class items. Ranks were produced from the scores of the base classifiers using 178 the rank transform algorithm (within the stats module of the scipy package, (7)) assigning low and high sample scores, low and 179 high sample ranks, respectively.

180
For the purpose of determining the effect of the conditional dependence on the performance of FiDEL, we will limit the 181 class conditioned covariance matrices of the base classifier scores to the form of: As Eq. (65) specifies that the diagonal entries are 1 and that 0 ≤ r < 1, we may then interpret the class conditioned covariance 184 matrix ΣY as a class conditioned correlation matrix and r the class conditioned correlation coefficient of the classifier scores. target AU Ci for classifier i, the empiricalÂU Ci will be slightly different from its intended target due to sampling variations.

188
Similarly, while we set a target class conditioned score covariance matrix with off-diagonal terms equal to r, we will report  Figures S1 and S2. 197 We first establish the relative performance of FiDEL, WOC, and the best individual base classifier when r = 0 (and therefore 198r = 0), a case in which our assumption of class conditional independence holds. Figure S1A shows that an increase in the number of base classifiers corresponds to a monotonic increase in the empirical performance of FiDEL and WOC ensemble 200 classifiers, with FiDEL showing a performance that is slightly better than the WOC ensemble. The trend presented in these 201 empirical results suggest that for relatively large numbers of base classifiers, the performance of FiDEL and WOC are far 202 superior than any one of their base classifier constituents, converging asymptotically to an AUC of 1.

203
Next, we examine the influence that conditionally dependent base classifier predictions have on the performance of FiDEL 204 and WOC ensemble classifiers. Figure S1B (r = 0.19) suggests that the AUC of FiDEL and WOC reach distinct performance 205 maxima, each of which are less than the maximum AUC reported for conditionally independent base classifier predictions 206 in Figure S1A. In addition, these results show that the performance of the best individual method is greater than or equal 207 to that of the WOC ensemble classifier for M ≤ 7. Distinctly, the performance of the FiDEL ensemble classifier is greater 208 than that of the WOC and it is clearly better than the best individual classifier for all M > 2. With an increase in the class 209 conditional dependence tor=0.38, the performances of the ensemble classifiers continue to decline. Figure S1C suggests that 210 the performance of FiDEL is close to that of the best individual base classifier, but better than that of the WOC. The WOC 211 performance, on the other hand, is worse than the best individual base classifier for all M . A further increase of conditional 212 correlation tor=0.58 in Figure S1D results in an empirical AUC value of each ensemble that is less than the best individual base 213 classifier for all M tested. These results are summarized for M = 10 in Figure S2, and together suggest that the performance 214 of FiDEL is robust to mild violations in the assumption of class conditional independence and its performance is greater than 215 the best individual classifiers up to a class conditioned correlation ofr 0.4, and is better than the WOC ensemble for all 216 values ofr tested.  (Table S1). For the WNV 220 dataset we used all the 10,506 items available, but deleted several features that contained outliers. This reduced the feature set 221 from 85 to 75. For the SMR dataset we sub-selected 22,000 items chosen randomly out of the original training data set of 222 145,231 items. This was done to reduce the computational burden and memory usage required to train some of the algorithms. 223 We also removed features with missing values (N/A or specific codes to mark the unavailable values such as 9X, 99X, 999X, 224 99999999X). This led to a reduced feature set of 259 features compared to the original 1,934 features in the SMR data.

225
The processing of the datasets as well as the results of Figure 4 can be reproduced using the code in 226 https://github.com/sungcheolkim78/FiDEL/tree/master/kaggle and https://github.com/sungcheolkim78/FiDEL respectively. 227 This code has the random seed fixed as 200 (set.seed(200)) to ensure reproducibility. However, it should be mentioned that if 228 the random seed were changed, different training and test set partitions would be created, and different algorithms would be 229 assigned to the new partitions, resulting in slightly different performances for FiDEL than the ones reported in Figure 4 of the 230 main paper.

231
The classification methods used to create FiDEL ensembles are shown in Table S2. There are a total of 23 methods which 232 were selected to cover a broad range of algorithms. Not all methods were compatible with the features of each data set.