An ecologically motivated image dataset for deep learning yields better models of human vision

Significance Inspired by core principles of information processing in the brain, deep neural networks (DNNs) have demonstrated remarkable success in computer vision applications. At the same time, networks trained on the task of object classification exhibit similarities to representations found in the primate visual system. This result is surprising because the datasets commonly used for training are designed to be engineering challenges. Here, we use linguistic corpus statistics and human concreteness ratings as guiding principles to design a resource that more closely mirrors categories that are relevant to humans. The result is ecoset, a collection of 1.5 million images from 565 basic-level categories. We show that ecoset-trained DNNs yield better models of human higher-level visual cortex and human behavior.


2
Fig. S1 | Training on ecoset rather than ILSVRC 2012 improves the alignment between DNN representations and human HVC. Same data as in Figure 2 of the main manuscript, results for dataset 1 shown left, dataset 2 shown right. We here show the performance of individual network instances and focus, instead of on layers, on relative network depth. AlexNet and vNet results shown separately to increase visibility.

Supplementary text: Category-specific RDM analysis
To better understand why ecoset trained networks exhibit improved alignment with representations in HVC, we extended our previous analyses by running them in an imagespecific way. For this, we separately Spearman correlated each column of the brain RDMs with the corresponding column of the DNN RDMs. This column-based analysis tests the agreement between the representational distances of a given stimulus to the set of all other stimuli. As a summary statistic, we averaged all correlation coefficients of stimuli belonging to a given superordinate category (animate/inanimate), and subsequently to subcategories within the superordinate ones (human/animal within animate, and natural/manmade for inanimates). We then tested whether ecoset-trained DNNs yield higher overall correlations for stimuli of a given category by performing a permutation test in which we shuffle the dataset labels across network instances (10,000 iterations for vNet and all possible 252 permutations for AlexNet v2). To control the family-wise error rate, we used a Bonferroni correction for the number tests performed per network and dataset. This analysis is inspired by Kriegeskorte et al. 2008 (see Figure S2), please note that the diagonal of the RDMs was excluded as it is zero by definition.
To test whether the effects of ecoset training were larger for animate rather than inanimate objects, we computed the effect size of ecoset training for each pair of networks (each pair initialized with the same random seed, but one trained on ecoset, the other one on ILSVRC) for stimuli showing animate and inanimate objects, respectively. This provided us with an effect size estimate for each network pair, and superordinate object category (animate/inanimate), enabling us to statistically test for an interaction effect. The latter was performed using a permutation test, as detailed above.

Fig. S3 | Category-specific RDM analysis. (A)
We tested several superordinate object categories for the effect of network training on ecoset rather than ILSVRC. Instead of correlating whole RDMs, as in the main analysis, we instead computed the Spearman correlation coefficient for each RDM column separately (image-specific RDM analysis).
We then averaged the coefficients for all stimuli belonging to a given superordinate category. This analysis revealed that the benefits of training on ecoset are strongest for animate objects (including the relations among themselves, as well as to inanimate objects). This was true for both network architectures and datasets tested (results for Dataset 2 shown in panel (B).

Fig. S4 | DNN alignment with perceptual similarity judgments.
To ensure that the benefits of ecoset training are not due dataset differences in the number of categories or images, we trained two sets of networks on trimmed versions of ILSVRC and ecoset, respectively, and compared the network RDMs with RDMs obtained via perceptual similarity judgments. Consistent with our main results, we observe, in higher-level network layers, significantly better alignment between DNNs trained on trimmed ecoset (shown in black) than DNNs trained on trimmed ILSVRC (shown in grey, permutation test, p<0.01, Bonferroni corrected for the number of network layers).