A feedforward architecture accounts for rapid categorization
See allHide authors and affiliations

Communicated by Richard M. Held, Massachusetts Institute of Technology, Cambridge, MA, January 26, 2007 (received for review November 11, 2006)
Article Figures & SI
Figures
Data supplements
Serre et al. 10.1073/pnas.0700622104.
Supporting Information
Files in this Data Supplement:
SI Text
SI Table 1
SI Figure 4
SI Figure 5
SI Figure 6
SI Figure 7
SI Figure 8
SI Figure 9
SI Table 2
SI Figure 4Fig. 4. The role of units in different areas. Comparison between different layers of the model on the animal vs. nonanimalcategorization task. The poor performance of model after lesioning V4 is likely to be because of the resulting decrease of invariance to position and scale. The bypass route only corresponds to an implementation of the model in which V4 was lesioned; the direct route corresponds to an implementation of the model for which the route from V2 to the posterior inferotemporal cortex (bypassing V4) was lesioned. The performance of all of the various model implementations was obtained with n = 10 random splits.
SI Figure 5Fig. 5. The effect of image orientation. A comparison between the performance (d^{'}) of the human observers (Left, n = 14) and the model (Right) in three experimental conditions: upright, at 90° rotation, and inverted (180° rotation). Human observers and the model are similarly robust to image rotations.
SI Figure 6Fig. 6. A comparison between the model and human observers (hit rates) with different mask conditions. The upper and lower bounds on humanlevel performance (n = 21) are given by the nomask and the immediatemask conditions, respectively. The average accuracy (percent correct) of human observers for the conditions with 20ms SOA, 50ms SOA, 80 ms SOA, and nomask conditions were 59%, 79%, 86%, and 91%, respectivelyall significantly above chance (t test, P < 0.01)compared to 82% for the model (18% false alarms). The model matches human observers for SOAs between 50 ms and 80 ms. Error bars indicate the standard error and are not directly comparable for the model (computed over n = 20 random runs) and for humans (computed over n = 21 observers).
SI Figure 7Fig. 7. An estimate of the timing of feedback loops in the ventral stream of primates (based on refs. 47 and 48). We assume that typical latency from one stage to the next is »1020 ms and that feedforward and back projections have similar conduction times (45). The first number corresponds to latencies for monkeys and is assumed to constitute a lower bound on the latencies for humans. The second number corresponds to an additional 50% and is assumed to constitute a "typical" number for humans ( S. Thorpe, personal communication).
SI Figure 8Fig. 8. A closeup view in the model from S_{1} to C_{2} stages. The input image (gray value) is first analyzed by an array of functionally organized S_{1} units at all locations and several scales. At the next C_{1} stage, a local maxpooling operation is taken over retinotopically organized S_{1} units with the same preferred orientation and at neighboring positions and scales to increase invariance to 2D transformations. S_{2} units then combine the response of several C_{1} units at different preferred orientations to increase the complexity of the optimal stimulus with a tuning operation and are selective for features of moderate complexity (53) (examples shown in yellow). Although only one type of S_{2} unit is shown, by considering different combinations (learned from natural images) of C_{1} units in the model, we obtain K_{S2} »2, 000 different types of S_{2} units. S_{2} units are also organized in feature maps such that every location in the visual field is analyzed by all K_{S2} types of S_{2} units at different scales. A local maxpooling operation is performed over S_{2} units with the same selectivity over neighboring positions and scales to yield the C_{2} unit responses. C_{2} units have been shown to match well with the tuning and invariance properties of cells in V4 (see ref. 4, pp. 2836) in response to different stimulus sets (1719).
SI Figure 9Fig. 9. The population of S_{1} units (corresponding to simple cells in primary visual cortex). 2 phases ´4 orientations ´ 17 sizes (or equivalently peak frequencies). Only units at one phase are shown but the population also includes filters of the opposite phase. Receptive field sizes range between 0.2° and 1.1°(typical values for cortex range between »0.1° and 1°; see refs. 10 and 11). Peak frequencies are in the range between 1.6 and 9.8 cycles/deg.