## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Information-theoretic model comparison unifies saliency metrics

Edited by Wilson S. Geisler, The University of Texas at Austin, Austin, TX, and approved October 27, 2015 (received for review May 28, 2015)

## Significance

Where do people look in images? Predicting eye movements from images is an active field of study, with more than 50 quantitative prediction models competing to explain scene viewing behavior. Yet the rules for this competition are unclear. Using a principled metric for model comparison (information gain), we quantify progress in the field and show how formulating the models probabilistically resolves discrepancies in other metrics. We have also developed model assessment tools to reveal where models fail on the database, image, and pixel levels. These tools will facilitate future advances in saliency modeling and are made freely available in an open source software framework (www.bethgelab.org/code/pysaliency).

## Abstract

Learning the properties of an image associated with human gaze placement is important both for understanding how biological systems explore the environment and for computer vision applications. There is a large literature on quantitative eye movement models that seeks to predict fixations from images (sometimes termed “saliency” prediction). A major problem known to the field is that existing model comparison metrics give inconsistent results, causing confusion. We argue that the primary reason for these inconsistencies is because different metrics and models use different definitions of what a “saliency map” entails. For example, some metrics expect a model to account for image-independent central fixation bias whereas others will penalize a model that does. Here we bring saliency evaluation into the domain of information by framing fixation prediction models probabilistically and calculating information gain. We jointly optimize the scale, the center bias, and spatial blurring of all models within this framework. Evaluating existing metrics on these rephrased models produces almost perfect agreement in model rankings across the metrics. Model performance is separated from center bias and spatial blurring, avoiding the confounding of these factors in model comparison. We additionally provide a method to show where and how models fail to capture information in the fixations on the pixel level. These methods are readily extended to spatiotemporal models of fixation scanpaths, and we provide a software package to facilitate their use.

Humans move their eyes about three times/s when exploring the environment, fixating areas of interest with the high-resolution fovea. How do we determine where to fixate to learn about the scene in front of us? This question has been studied extensively from the perspective of “bottom–up” attentional guidance (1), often in a “free-viewing” task in which a human observer explores a static image for some seconds while his or her eye positions are recorded (Fig. 1*A*). Eye movement prediction is also applied in domains from advertising to efficient object recognition. In computer vision the problem of predicting fixations from images is often referred to as “saliency prediction,” while to others “saliency” refers explicitly to some set of low-level image features (such as edges or contrast). In this paper we are concerned with predicting fixations from images, taking no position on whether the features that guide eye movements are “low” or “high” level.

The field of eye movement prediction is quite mature: Beginning with the influential model of Itti et al. (1), there are now over 50 quantitative fixation prediction models, including around 10 models that seek to incorporate “top–down” effects (see refs. 2⇓–4 for recent reviews and analyses of this extensive literature). Many of these models are designed to be biologically plausible whereas others aim purely at prediction (e.g., ref. 5). Progress is measured by comparing the models in terms of their prediction performance, under the assumption that better-performing models must capture more information that is relevant to human behavior.

How close are the best models to explaining fixation distributions in static scene eye guidance? How close is the field to understanding image-based fixation prediction? To answer this question requires a principled distance metric, yet no such metric exists. There is significant uncertainty about how to compare saliency models (3, 6⇓–8). A visit to the well-established MIT Saliency Benchmark (saliency.mit.edu) allows the reader to order models by seven different metrics. These metrics can vastly change the ranking of the models, and there is no principled reason to prefer one metric over another. Indeed, a recent paper (7) compared 12 metrics, concluding that researchers should use 3 of them to avoid the pitfalls of any one. Following this recommendation would mean comparing fixation prediction models is inherently ambiguous, because it is impossible to define a unique ranking if any two of the considered rankings are inconsistent.

Because no comparison of existing metrics can tell us how close we are, we instead advocate a return to first principles. We show that evaluating fixation prediction models in a probabilistic framework can reconcile ranking discrepancies between many existing metrics. By measuring information directly we show that the best model evaluated here (state of the art as of October 2014) explains only 34% of the explainable information in the dataset we use.

## Results

### Information Gain.

Fixation prediction is operationalized by measuring fixation densities. If different people view the same image, they will place their fixations in different locations. Similarly, the same person viewing the same image again will make different eye movements than they did the first time. It is therefore natural to consider fixation placement as a probabilistic process.

The performance of a probabilistic model can be assessed using information theory. As originally shown by Shannon (9), information theory provides a measure, information gain, to quantify how much better a posterior predicts the data than a prior. In the context of fixation prediction, this quantifies how much better an image-based model predicts the fixations on a given image than an image-independent baseline.

Information gain is measured in bits. To understand this metric intuitively, imagine a game of 20 questions in which a model is asking yes/no questions about the location of a fixation in the data. The model’s goal is to specify the location of the fixation. If model A needs one question less than model B on average, then model A’s information gain exceeds model B’s information gain by one bit. If a model needs exactly as many questions as the baseline, then its information gain is zero bits. The number of questions the model needs is related to the concept of code length: Information gain is the difference in the average code length between a model and the baseline. Finally, information gain can also be motivated from the perspective of model comparison: It is the logarithm of the Bayes factor of the model and the baseline, divided by the number of data points. That is, if the information gain exceeds zero, then the model is more likely than the baseline.

Formally, if *SI Text, Kullback*–*Leibler Divergence*).

For image-based fixation prediction, information gain quantifies the reduction in uncertainty (intuitively, the scatter of predicted fixations) in where people look, given knowledge of the image they are looking at. To capture the image-independent structure in the fixations in a baseline model, we use a 2D histogram of all fixations cross-validated between images: How well can the fixations on one image be predicted from fixations on all other images?

In addition to being principled, information gain is an intuitive model comparison metric because it is a ratio scale. Like the distance between two points, in a ratio-scaled metric “zero” means the complete absence of the quantity (in this case, no difference in code length from baseline). Second, a given change in the scale means the same thing no matter the absolute values. That is, it is meaningful to state relationships such as “the difference in information gain between models A and B is twice as big as the difference between models C and D.” Many existing metrics, such as the area under the ROC curve (AUC), do not meet these criteria.

To know how well models predict fixation locations, relative to how they could perform given intersubject variability, we want to compare model information gain to some upper bound. To estimate the information gain of the true fixation distribution, we use a nonparametric gold standard model: How well can the fixations of one subject be predicted by all other subjects’ fixations? This gold standard captures the explainable information gain for image-dependent fixation patterns for the subjects in our dataset, ignoring additional task- and subject-specific information (we examine this standard further in *SI Text, Gold Standard Convergence* and Fig. S1). By comparing the information gain of models to this explainable information gain, we determine the proportion of explainable information gain explained. Like variance explained in linear Gaussian regression, this quantity tells us how much of the explainable information gain a model captures. Negative values mean that a model performs even worse than the baseline.

### Reconciling the Metrics.

Now that we have defined a principled and intuitive scale on which to compare models we can assess to what extent existing metrics align with this scale. In Fig. 1*B* we show the relative performance on all metrics for all saliency models listed on the MIT Saliency Benchmark website as of February 25, 2015. If all metrics gave consistent rankings, all colored lines would monotonically increase. They clearly do not, highlighting the problem with existing metrics.

Fig. 1*C* shows how the fixation prediction models we evaluate in this paper perform on eight popular fixation prediction metrics (colors) and information gain explained. As in Fig. 1*B*, the metrics are inconsistent with one another. This impression is confirmed in Fig. 1*C*, *Inset*, showing Pearson (below the diagonal) and Spearman (above the diagonal) correlation coefficients. If the metrics agreed perfectly, this plot matrix would be red. When considered relative to information gain explained, the other metrics are generally nonmonotonic and inconsistently scaled.

Why is this the case? The primary reason for the inconsistencies in Fig. 1 *B* and *C* is that both the models and the metrics use different definitions of the meaning of a saliency map (the spatial fixation prediction). For example, the “AUC wrt. uniform” metric expects the model to account for the center bias (a bias in free-viewing tasks to fixate near the center of the image), whereas “AUC wrt. center bias” expects the model to ignore the center bias (10). Therefore, a model that accounts for the center bias is penalized by AUC wrt. center bias whereas a model that ignores the center bias is penalized by AUC wrt. uniform. The rankings of these models will likely change between the metrics, even if they had identical knowledge about the image features that drive fixations.

To overcome these inconsistencies we phrased all models probabilistically, fitting three independent factors. We transformed the (often arbitrary) model scale into a density, accounted for the image-independent center bias in the dataset, and compensated for overfitting by applying spatial blurring. We then reevaluated all metrics on these probabilistic models. This yields far more consistent outcomes between the metrics (Fig. 1*D*). The metrics are now monotonically related to information gain explained, creating mostly consistent model rankings (compare the correlation coefficient matrices in Fig. 1 *C* and *D*, *Insets*).

Nevertheless, Fig. 1*D* also highlights one additional, critical point. All model relative performances must reconverge to the gold standard performance at (1, 1). That all existing metrics diverge from the unity diagonal means that these metrics remain nonlinear in information gain explained. This creates problems in comparing model performance. If we are interested in the information that is explained, then information gain is the only metric that can answer this question in an undistorted way.

### How Close Is the Field to Understanding Image-Based Fixation Prediction?

We have shown above that a principled definition of fixation prediction serves to reconcile ranking discrepancies between existing metrics. Information gain explained also tells us how much of the information in the data is accounted for by the models. That is, we can now provide a principled answer to the question, “How close is the field to understanding image-based fixation prediction?”.

Fig. 1*E* shows that the best-performing model we evaluate here, ensemble of deep networks (eDN), accounts for about 34% of the explainable information gain, which is 1.21 bits per fixation (bits/fix) in this dataset (*SI Text, Model Performances as Log-Likelihoods* and Fig. S2). These results highlight the importance of using an intuitive evaluation metric: As of October 2014, there remained a significant amount of information that image-based fixation prediction models could explain but did not.

### Information Gain in the Pixel Space.

The probabilistic framework for model comparison we propose above has an additional advantage over existing metrics: The information gain of a model can be evaluated at the level of pixels (Table S1). We can examine where and by how much model predictions fail.

This procedure is schematized in Fig. 2. For an example image, the model densities show where the model predicts fixations to occur in the given image (Fig. 2*A*). This prediction is then divided by the baseline density, yielding a map showing where and by how much the model believes the fixation distribution in a given image is different from the baseline (“image-based prediction”). If the ratio is greater than one, the model predicts there should be more fixations than the center bias expects. The “information gain” images in Fig. 2 quantify how much a given pixel contributes to the model’s performance relative to the baseline (code length saved in bits/fix). Finally, the difference between the model’s information gain and the possible information gain, estimated by the gold standard, is shown in “difference to real information gain”: It shows where and how much (bits) the model wastes information that could be used to describe the fixations more efficiently.

The advantage of this approach is that we can see not only how much a model fails (on an image or dataset level), but also exactly where it fails, in individual images. This can be used to make informed decisions about how to improve fixation prediction models. In Fig. 2*B*, we show an example image and the performance of the three best-performing models [eDN, Boolean map-based saliency (BMS), and attention based on information maximization (AIM)]. The pixel space information gains show that the eDN model correctly assigns large density to the boat, whereas the other models both underestimate the saliency of the boat.

To extend this pixel-based analysis to the level of the entire dataset, we display each image in the dataset according to its possible information gain and the percentage of that information gain explained by the eDN model (Fig. 3). In this space, points to the bottom right represent images that contain a lot of explainable information in the fixations that the model fails to capture. Points show all images in the dataset, and for a subset of these we have displayed the image itself. The images in the bottom right of the plot tend to contain human faces. See *SI Text, Pixel-Based Analysis on Entire Dataset* for an extended version of this analysis including pixel-space information gain plots and a model comparison.

## Discussion

Predicting where people look in images is an important problem, yet progress has been hindered by model comparison uncertainty. We have shown that phrasing fixation prediction models probabilistically and appropriately evaluating their performance cause the disagreement between many existing metrics to disappear. Furthermore, bringing the model comparison problem into the principled domain of information allows us to assess the progress of the field, using an intuitive distance metric. The best-performing model we evaluate here (eDN) explains about 34% of the explainable information gain. More recent model submissions to the MIT Benchmark have significantly improved on this number (e.g., ref. 11). This highlights one strength of information gain as a metric: As model performance begins to approach the gold standard, the nonlinear nature of other metrics (e.g., AUC) causes even greater distortion of apparent progress. The utility of information gain is clear.

To improve models it is useful to know where in images this unexplained information is located. We developed methods not only to assess model performance on a database level, but also to show where and by how much model predictions fail in individual images, on the pixel level (Figs. 2 and 3). We expect these tools will be useful for the model development community, and we provide them in our free software package.

Many existing metrics can be understood as evaluating model performance on a specific task. For example, the AUC is the performance of a model in a two-alternative forced-choice (2AFC) task, “Which of these two points was fixated?”. If this is the task of interest to the user, then AUC is the right metric. Our results do not show that any existing metric is wrong. The metrics do not differ because they capture fundamentally different properties of fixation prediction, but mainly because they do not agree on the definition of “saliency map.” The latter case requires only minor adjustments to move the field forward. This also serves to explain the three metric groups found by Riche et al. (7): One group contains among others AUC with uniform nonfixation distribution (called AUC-Judd by Riche), another group contains AUC with center bias nonfixation distribution (AUC-Borji), and the last group contains image-based KL divergence (KL-Div). We suggest that the highly uncorrelated results of these three groups are due to the fact that one group penalizes models without center bias, another group penalizes models with center bias, and the last group depends on the absolute saliency values. Compensating for these factors appropriately makes the metric results correlate almost perfectly.

Although existing metrics are appropriate for certain use cases, the biggest practical advantage in using a probabilistic framework is its generality. First, once a model is formulated in a probabilistic way many kinds of “task performance” can be calculated, depending on problems of applied interest. For example, we might be interested in whether people will look at an advertisement on a website or whether the top half of an image is more likely to be fixated than the bottom half. These predictions are a simple matter of integrating over the probability distribution. This type of evaluation is not well defined for other metrics that do not define the scale of saliency values. Second, a probabilistic model allows the examination of any statistical moments of the probability distribution that might be of practical interest. For example, Engbert et al. (12) examine the properties of second-order correlations between fixations in scanpaths. Third, information gain allows the contribution of different factors in explaining data variance to be quantified. For example, it is possible to show how much the center bias contributes to explaining fixation data independent of image-based saliency contributions (10) (*SI Text, Model Performances as Log-Likelihoods* and Fig. S2). Fourth, the information gain is differentiable in the probability density, allowing models to be numerically optimized using gradient techniques. In fact, the optimization is equivalent to maximum-likelihood estimation, which is ubiquitously used for density estimation and fulfills a few simple desiderata for density metrics (13). In some cases other loss functions may be preferable.

If we are interested in understanding naturalistic eye movement behavior, free viewing static images is not the most representative condition (14⇓⇓⇓–18). Understanding image-based fixation behavior is not only a question of “where?”, but of “when?” and “in what order?”. It is the spatiotemporal pattern of fixation selection that is increasingly of interest to the field, rather than purely spatial predictions of fixation locations. The probabilistic framework we use in this paper (10, 19) is easily extended to study spatiotemporal effects, by modeling the conditional probability of a fixation given previous fixations (*Materials and Methods* and ref. 12).

Accounting for the entirety of human eye movement behavior in naturalistic settings will require incorporating information about the task, high-level scene properties, and mechanistic constraints on the eye movement system (12, 15⇓–17, 20⇓–22). Our gold standard contains the influence of high-level (but still purely image-dependent) factors to the extent that they are consistent across observers. Successful image-based fixation prediction models will therefore need to use such higher-level features, combined with task-relevant biases, to explain how image features are associated with the spatial distribution of fixations over scenes.

## Materials and Methods

### Image Dataset and Fixation Prediction Models.

We use a subset of a popular benchmarking dataset (MIT-1003) (23) to compare and evaluate fixation prediction models. We used only the most common image size (1,024 *SI Text, Kienzle Dataset* and Fig. S3).

We evaluated all models considered in ref. 25 and the top-performing models added to the MIT Saliency Benchmarking website (saliency.mit.edu) up to October 2014. For all models, the original source code and default parameters have been used unless stated otherwise. The included models are Itti et al. (1) [here, two implementations have been used: one from the Saliency Toolbox and the variant specified in the graph-based visual saliency (GBVS) paper], Torralba et al. (26), GBVS (27), saliency using natural statistics (SUN) (28) (for “SUN, original” we used a scale parameter of 0.64, corresponding to the pixel size of

### Information Gain and Comparison Models.

Given fixations

The explainable information gain is the information gain of the gold standard model

In this paper we use the logarithm to base 2, meaning that information gain is in bits. Model comparison within the framework of likelihoods is well defined and the standard of any statistical model comparison enterprise.

The baseline model is a 2D histogram model with a uniform regularization (to avoid zero bin counts) cross-validated between images (trained on all fixations for all observers on all other images). That is, reported baseline performance used all fixations from other images to predict the fixations for a specific image: It captures the image-independent spatial information in the fixations. Bin width and regularization parameters were optimized by gridsearch. If a saliency model captured all of the behavioral fixation biases but nothing about what causes parts of an image to attract fixations, it would do as well as the baseline model.

Fixation preferences that are inconsistent between observers are by definition unpredictable from fixations alone. If we have no additional knowledge about interobserver differences, the best predictor of an observer’s fixation pattern on a given image is therefore to average the fixation patterns from all other observers and add regularization. This is our gold standard model. It was created by blurring the fixations with a Gaussian kernel and including a multiplicative center bias (*Phrasing Saliency Maps Probabilistically*), learned by leave-one-out cross-validation between subjects. That is, the reported gold standard performance (for information gain and AUCs) always used only fixations from other subjects to predict the fixations of a specific subject, therefore giving a conservative estimate of the explainable information. It accounts for the amount of information in the spatial structure of fixations to a given image that can be explained while averaging over the biases of individual observers. This model is the upper bound on prediction in the dataset (see ref. 8 for a thorough comparison of this gold standard and other upper bounds capturing different constraints).

### Existing Metrics.

We evaluate the models on several prominent metrics (Fig. 1*C*): AUC wrt. uniform, AUC wrt. center bias, image-based KL divergence, fixation-based KL divergence, normalized scanpath saliency, and correlation coefficient. For details on these metrics, their implementation, and their relationship to information gain see *SI Text, Existing Metrics*, Fig. S4, and Table S3.

### Phrasing Saliency Maps Probabilistically.

We treat the normalized saliency map [*I*] as the predicted gaze density for the fixations:

Because many of the models were optimized for AUC, and because AUC is invariant to monotonic transformations whereas information gain is not, we cannot simply compare the models’ raw saliency maps to one another. The saliency map for each model was therefore transformed by a pointwise monotonic nonlinearity that was optimized to give the best log-likelihood for that model. This corresponds to picking the model with the best log-likelihood from all models that are equivalent (under AUC) to the original model.

Every saliency map was jointly rescaled to range from 0 to 1 (i.e., over all images at once, not per image, keeping contrast changes from image to image intact).

Then a Gaussian blur with radius *σ* was applied that allowed us to compensate in models that make overly precise, confident predictions of fixation locations (25).

Next, the pointwise monontonic nonlinearity was applied. This nonlinearity was modeled as a continuous piecewise linear function supported in 20 equidistant points

Finally, we included a center bias term (accounting for the fact that human observers tend to look toward the center of the screen) (25).

The center bias was modeled as

Here, *α*, and

All parameters were optimized jointly, using the L-BFGS SLSQP algorithm from scipy.optimize (38).

### Evaluating the Metrics on Probabilistic Models.

To evaluate metrics described above on the probabilistic models (the results shown in Fig. 1*D*), we used the log-probability maps as saliency maps. All other computations were as described above. An exception is the image-based KL divergence. Because this metric operates on probability distributions, our model predictions were used directly.

The elements of Fig. 2 are calculated as follows: First, we plot the model density for each model (column “density” in Fig. 2). This is

Now we separate the expected information gain (an integral over space) into its constituent pixels, as

Note that this detailed evaluation is not possible with existing saliency metrics (Table S1).

### Generalization to Spatiotemporal Scanpaths.

The models we consider in this paper are purely spatial: They do not include any temporal dependencies. A complete understanding of human fixation selection would require an understanding of spatiotemporal behavior, that is, scanpaths. The model adaptation and optimization procedure we describe above can be easily generalized to account for temporal effects. For details see *SI Text, Generalization to Spatiotemporal Scanpaths*.

## SI Text

## Model Performances as Log-Likelihoods

In Fig. S2, we report the average log-likelihoods of the tested models. All reported log-likelihoods are relative to the maximum entropy model predicting a uniform fixation distribution.

The gold standard model shows that the total mutual information between the image and the spatial structure of the fixations amounts to 2.1 bits/fix. To give another intuition for this number, a model that would for every fixation always correctly predict the quadrant of the image in which it falls would also have a log-likelihood of 2 bits/fix.

The lower-bound model is able to explain 0.89 bits/fix of this mutual information. That is, 42% of the information in spatial fixation distributions can be accounted for by behavioral biases (e.g., the bias of human observers to look at the center of the image).

The eDN model performs best of all of the saliency models compared, with 1.29 bits/fix, capturing 62% of the total mutual information. It accounts for 19% more than the lower-bound model or 34% of the possible information gain (1.21 bits/fix) between baseline and gold standard.

Fig. S2 also shows performances where only a subset of our optimization procedure was performed, allowing the contribution of different stages of our optimization to be assessed. Considering only model performance (i.e., without also including center bias and blur factors; the pink sections in Fig. S2) shows that many of the models perform worse than the lower-bound model. This means that the center bias is more important than the portion of image-based saliency that these models do capture (39). Readers will also note that the center bias and blurring factors account for very little of the performance of the Judd model and the eDN model relative to most other models. This is because these models already include a center bias that is optimized for the Judd dataset.

## Gold Standard Convergence

The absolute performance level of the gold standard (the estimate of explainable information gain) depends on the size of the dataset. With fewer data points, the true gold standard performance will be underestimated because more regularization is required to generalize across subjects. With enough data, our estimate of the gold standard will converge to the true gold standard performance.

To examine the convergence of our gold standard estimate in the dataset we use, we repeated our cross-validation procedure using, for each subject, only a subset of the other 14 subjects. Fig. S1 shows the average gold standard performance (in bits per fixation) as a function of the number of other subjects used for cross-validation. The curve rapidly increases and then begins to flatten as we reach the full dataset size. This result indicates that more data would be required to gain a precise estimate of the true gold standard performance. Nevertheless, that the curve begins to saturate indicates that more data are unlikely to qualitatively change the results we report here. If anything, the gold standard performance would increase, reducing our estimate of the explainable information gain explained (34%) even further.

## Kienzle Dataset

We repeated the full evaluation on the dataset of Kienzle et al. (24). It consists of 200 grayscale images of size

In this dataset, with 22% even less of the possible information gain is covered by the best model (here, GBVS. Note that we were not able to include eDN into this comparision, as the source code was not yet released at the time of the analysis). Removing the photographer bias leads to a smaller contribution (34%) of the nonparametric model compared with the increase in log-likelihood by saliency map-based models. The possible information gain is with 0.92 bits/fix smaller than for the Judd dataset (1.21 bits/fix) There are multiple possible reasons for this. Primarily, this dataset contains no pictures of people, but a lot of natural images. In addition, the images are in grayscale.

## Pixel-Based Analysis on Entire Dataset

In Fig. S5, we display each image in the dataset according to its possible information gain and the percentage of that information gain explained by the model. In this space, points to the bottom right represent images that contain a lot of explainable information in the fixations that the model fails to capture. Points show all images in the dataset, and for a subset of these we have displayed the image itself (Fig. S5 *A* and *C*) and the information gain difference to the gold standard (Fig. S5 *B* and *D*). For the eDN model (Fig. S5 *A* and *B*), the images in the bottom right of the plot tend to contain human faces. The Judd model contains an explicit face detection module, and as can be seen in Fig. S5 *C* and *D*, it tends to perform better on these images. In terms of the whole dataset, however, the eDN model performs better on images with a moderate level of explainable information (around 3 bits/fix).

## Existing Metrics

We evaluate the models on several prominent metrics. The area under the curve (AUC) metrics are the most widely used. They calculate the performance of the model when using the saliency map as classifier score in a two-alternative forced-choice (2AFC) task where the model has to separate fixations from nonfixations. There are several variants of AUC scores, differing by the nonfixation distribution used and in approximations to speed up computation. We use all sample values as thresholds, therefore using no approximation. AUC wrt. uniform uses a uniform nonfixation distribution, i.e., the full saliency map as nonfixations [this corresponds to “AUC-Judd” in the MIT Benchmark (25)]. AUC wrt. center bias uses the fixations from all other images as nonfixations, thus capturing structure unrelated to the image [behavioral biases, primarily center bias (3, 4, 39)]. This corresponds to “sAUC” in the MIT benchmark (“shuffled AUC”).

Confusingly, there are two completely independent measures referred to as “Kullback–Leibler divergence” used in the saliency literature. We discuss the precise definitions of these metrics and their relationship to information gain as used in this paper in *SI Text, KL Divergence*. What we refer to as image-based Kullback–Leibler (KL) divergence treats the saliency maps as 2D probability distributions and calculates the KL divergence between the model distribution and an approximated true distribution (8, 39). To compute this metric, the saliency maps were rescaled to have a maximum of 1 and a minimum of at least

The other variant of KL divergence, here called fixation-based KL divergence, calculates the KL divergence between the distribution of saliency values at fixations and the distribution of saliency values at some choice of nonfixations (40). We use histograms with 10 bins to calculate the KL divergence. For the nonfixations, we use all saliency values [fixation-based (f.b.)

Normalized scanpath saliency (NSS) normalizes each saliency map to have zero mean and unit variance and then takes the mean saliency value over all fixations.

The correlation coefficient (CC) metric normalizes the saliency maps of the model and the saliency maps of the approximated true distribution (gold standard) to have zero mean and unit variance and then calculates the correlation coefficient of these maps over all pixels.

## Detailed Comparison of Log-Likelihoods, AUC, and KL Divergence

Here we consider the relationship between log-likelihoods and prominent existing saliency metrics: AUC and KL divergence.

### AUC.

The most prominent metric used in the saliency literature is the area under the receiver operating characteristic curve (AUC). The AUC is the area under a curve of model hit rate against false positive rate for each threshold. It is equivalent to the performance in a 2AFC task where the model is “presented” with two image locations: one at which an observer fixated and another from a nonfixation distribution. The thresholded saliency value is the model’s decision, and the percentage correct of the model in this task across all possible thresholds is the AUC score. The different versions of AUC used in saliency research differ primarily in the nonfixation distribution used. This is usually either a uniformly selected distribution of not-fixated points across the image (e.g., in ref. 25) or the distribution of fixations for other images in the database [the shuffled AUC (3, 4, 39)]. The latter provides an effective control against center bias (a tendency for humans to look in the center of the screen, irrespective of the image content), by ensuring that both fixation and nonfixation distributions have the same image-independent bias. It is important to bear in mind that this measure will penalize models that explicitly try to model the center bias. The AUC therefore depends critically on the definition of the nonfixation distribution. In the case of the uniform nonfixation distribution, AUC is tightly related to area counts: Optimizing for AUC with uniform nonfixation distribution is equivalent to finding for each percentage

One characteristic of the AUC that is often considered an advantage is that it is sensitive only to the rank order of saliency values, not their scale (i.e., it is invariant under monotonic pointwise transformations) (39). This allows the modeling process to focus on the shape (i.e., the geometry of iso-saliency points) of the distribution of saliency without worrying about the scale, which is argued to be less important for understanding saliency than the contour lines (39). However, in certain circumstances the insensitivity of AUC to differences in saliency can lead to counterintuitive behavior, if we accept that higher saliency values are intuitively associated with more fixations.

By using the likelihood of points as a classifier score, one can compute the AUC for a probabilistic model just as for saliency maps. This has a principled connection with the probabilistic model itself: If the model performed the 2AFC task outlined above using maximum-likelihood classification, then the model’s performance is exactly the AUC. Given the real fixation distribution, it can also be shown that the best saliency map in terms of AUC with uniform nonfixation distribution is exactly the gaze density of the real fixation. However, this does not imply that a better AUC score will yield a better log-likelihood or vice versa. For more details and a precise derivation of these claims, see ref. 10.

### Kullback–Leibler Divergence.

KL divergence is tightly related to log-likelihoods. However, KL divergence as used in practice in the saliency literature is not the same as the approach we advocate.

In general, the KL divergence between two probability distributions *p* and *q* is given by

We now precisely show the relationship between these measures and our information theoretic approach. Very generally, information theory can be derived from the task of assigning code words to different events that occur with different probabilities such that their average code word length becomes minimal. It turns out that the negative log-probability is a good approximation to the optimal code word length possible, which gives rise to the definition of the log-loss:*x* is simply *x*. Accordingly, the more ambiguous the possible values of a variable are, the larger its average log-loss, which is also known as its entropy:*x* and we have a model *x* that are of length *p*(*x*) denotes a posterior distribution that correctly describes the variability of *x* after the observation has been made whereas *q*(*x*) denotes the prior distribution. In a completely analog fashion we can measure how much more or less information one model distribution *x* than an alternative model ^{2} tests):

Our information gain metric reported in the main text is exactly the ELLR, where

It is crucial to note that in the past the scale used for saliency maps was only a rank scale. This was the case because AUC was the predominant performance measure and is invariant under such transformations. That is, two saliency maps

Fixation-based KL divergence is the more common variant in the literature: Researchers wanted to apply information theoretic measures to saliency evaluation while remaining consistent with the rank-based scale of AUC (40). Therefore, they did not interpret saliency maps themselves as probability distributions, but applied the KL divergence to the distribution of saliency values obtained when using the fixations to that obtained when using nonfixations. We emphasize that this measure has an important conceptual caveat: Rather than being invariant under only monotonic increasing transformations, KL divergence is invariant under any reparameterization. This implies that the measure cares only about which areas are of equal saliency, but does not care about which of any two areas is actually the more salient one. For illustration, for any saliency map

Image-based KL-divergence requires that the saliency maps are interpreted as probability distributions. Previous studies using this method (Table S3) simply divided the saliency values by their sum to obtain such probability distributions. However, they did not consider that this measure is sensitive to the scale used for the saliency maps. Optimization of the pointwise nonlinearity (i.e., the scale) has a huge effect on the performance of the different models. More generally, realizing that image-based KL divergence treats saliency maps as probability distributions means that other aspects of density estimation, like center bias and regularization strategies (blurring), must also be taken into account.

The only conceptual difference between image-based KL divergence and log-likelihoods is that for estimating expected log-likelihood ratios, it is not necessary to have a gold standard. One can simply use the unbiased sample mean estimator (*SI Text, Estimation Considerations*). Furthermore, by conceptualizing saliency in an information-theoretic way, we can not only assign meaning to expected values (such as ELLR or DKL) but also know how to measure the information content of an individual event (here, a single fixation), using the notion of its log-loss (see our application on the individual pixel level in the main text). Thus, although on a theoretical level log-likelihoods and image-based KL divergence are tightly linked, on a practical level a fundamental reinterpretation of saliency maps as probability distributions is necessary.

## Estimation Considerations

One principle advantage of using log-likelihoods instead of image-based KL divergence is that for all model comparisons except comparing against the gold standard we do not have to rely on the assumptions made for the gold standard but can simply use the unbiased sample mean estimator:

However, estimating the upper limit on information gain still requires a gold standard [an estimate of the true distribution

For our dataset, the optimal cross-validated kernel size was 27 pixels, which is relatively close to the commonly used kernel size of

Because we conclude that our understanding of image-based saliency is surprisingly limited, we have been using a conservative strategy for estimating the information gain of the gold standard that is downward biased such that we obtain a conservative upper bound on the fraction of how much we understand about image-based saliency. To this end, we not only used the unbiased sample estimator for averaging over the true distribution but also resorted to a cross-validation strategy for estimating the gold standard that takes into account how well the distributions generalize across subjects,*j* and *j*. For comparison, if one would simply use the plain sample mean estimator for the gold standard, the fraction explained would drop to an even smaller value of only 22%. Our approach guarantees that it is very likely that the true value falls into the range between 22% and 34%.

## Generalization to Spatiotemporal Scanpaths

The models we consider in this paper are purely spatial: They do not include any temporal dependencies. A complete understanding of human fixation selection would require an understanding of spatiotemporal behavior, that is, scanpaths. The model adaptation and optimization procedure we describe above can be easily generalized to account for temporal effects, as follows.

A scanpath consists of *N* fixations with positions *N* is part of the data distribution (not a fixed parameter) and *I* denotes the image for which the fixations should be predicted. By chain rule, this is decomposed into conditional likelihoods

The above holds true for any 3D point process. In this way, the model comparison framework we propose in this paper is general in that it can account for spatiotemporal fixation dependencies (see ref. 12 for a recent application of spatiotemporal point processes to the study of scanpaths).

## Acknowledgments

We thank Lucas Theis for his suggestions and Eleonora Vig and Benjamin Vincent for helpful comments on an earlier draft of this manuscript. We acknowledge funding from the Deutsche Forschungsgemeinschaft (DFG) through the priority program 1527, research Grant BE 3848/2-1. T.S.A.W. was supported by a Humboldt Postdoctoral Fellowship from the Alexander von Humboldt Foundation. We further acknowledge support from the DFG through the Werner-Reichardt Centre for Integrative Neuroscience (EXC307) and from the BMBF through the Bernstein Center for Computational Neuroscience (FKZ: 01GQ1002).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: matthias.kuemmerer{at}bethgelab.org.

Author contributions: M.K., T.S.A.W., and M.B. designed research; M.K. performed research; M.K. analyzed data; and M.K., T.S.A.W., and M.B. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1510393112/-/DCSupplemental.

## References

- ↵
- ↵
- ↵
- ↵.
- Borji A,
- Tavakoli HR,
- Sihite DN,
- Itti L

*Proceedings of the 2013 IEEE International Conference on Computer Vision*(IEEE Computer Society, Washington, DC), pp 921–928 - ↵.
- Zhang J,
- Sclaroff S

- ↵.
- Bruce ND,
- Wloka C,
- Frosst N,
- Rahman S,
- Tsotsos JK

- ↵.
- Riche N,
- Duvinage M,
- Mancas M,
- Gosselin B,
- Dutoit T

*Proceedings of the 2013 IEEE International Conference on Computer Vision*(IEEE Computer Society, Washington, DC), pp 1153–1160 - ↵
- ↵.
- Shannon CE,
- Weaver W

- ↵.
- Barthelmé S,
- Trukenbrod H,
- Engbert R,
- Wichmann F

- ↵.
- Kümmerer M,
- Theis L,
- Bethge M

- ↵.
- Engbert R,
- Trukenbrod HA,
- Barthelmé S,
- Wichmann FA

- ↵.
- Bernardo JM

- ↵.
- Tatler BW,
- Vincent BT

- ↵
- ↵.
- Tatler BW,
- Hayhoe MM,
- Land MF,
- Ballard DH

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Judd T,
- Ehinger K,
- Durand F,
- Torralba A

- ↵
- ↵.
- Judd T,
- Durand F,
- Torralba A

*A Benchmark of Computational Models of Saliency to Predict Human Fixations*. Cambridge, MA: MIT Computer Science and Artificial Intelligence Laboratory; 2012. Report No.: MIT-CSAIL-TR-2012-001 - ↵
- ↵.
- Harel J,
- Koch C,
- Perona P

- ↵
- ↵.
- Kienzle W,
- Wichmann FA,
- Schölkopf B,
- Franz MO

- ↵.
- Hou X,
- Zhang L

- ↵.
- Bruce N,
- Tsotsos J

*J Vis*9(3): 1–24 - ↵.
- Goferman S,
- Zelnik-Manor L,
- Tal A

- ↵
- ↵.
- Erdem E,
- Erdem A

- ↵
- ↵.
- Zhang J,
- Sclaroff S

- ↵.
- Vig E,
- Dorr M,
- Cox D

*Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition*(IEEE Computer Society, Washington, DC), pp 2798–2805 - ↵Jones E, Oliphant T, Peterson P, others (2001) SciPy: Open source scientific tools for Python. Available at www.scipy.org/. Accessed November 24, 2014.
- ↵
- ↵.
- Itti L,
- Baldi PF

- .
- Itti L,
- Baldi P

*Proceedings of the 2005 IEEE Conference on Computer Vision and Patter Recognition*(IEEE Computer Society, Washington, DC) Vol 1, pp 631–637 - .
- Baldi P,
- Itti L

- .
- Wang W,
- Wang Y,
- Huang Q,
- Gao W

- .
- Rajashekar U,
- Cormack LK,
- Bovik AC

- .
- Engbert R,
- Trukenbrod HA,
- Barthelmé S,
- Wichmann FA

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Biological Sciences
- Psychological and Cognitive Sciences

- Physical Sciences
- Statistics

## Jump to section

- Article
- Abstract
- Results
- Discussion
- Materials and Methods
- SI Text
- Model Performances as Log-Likelihoods
- Gold Standard Convergence
- Kienzle Dataset
- Pixel-Based Analysis on Entire Dataset
- Existing Metrics
- Detailed Comparison of Log-Likelihoods, AUC, and KL Divergence
- Estimation Considerations
- Generalization to Spatiotemporal Scanpaths
- Acknowledgments
- Footnotes
- References

- Figures & SI
- Info & Metrics