New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
A mathematical theory of semantic development in deep neural networks
Edited by Terrence J. Sejnowski, Salk Institute for Biological Studies, La Jolla, CA, and approved April 9, 2019 (received for review December 6, 2018)

Significance
Over the course of development, humans learn myriad facts about items in the world, and naturally group these items into useful categories and structures. This semantic knowledge is essential for diverse behaviors and inferences in adulthood. How is this richly structured semantic knowledge acquired, organized, deployed, and represented by neuronal networks in the brain? We address this question by studying how the nonlinear learning dynamics of deep linear networks acquires information about complex environmental structures. Our results show that this deep learning dynamics can self-organize emergent hidden representations in a manner that recapitulates many empirical phenomena in human semantic development. Such deep networks thus provide a mathematically tractable window into the development of internal neural representations through experience.
Abstract
An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: What are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.
Human cognition relies on a rich reservoir of semantic knowledge, enabling us to organize and reason about our complex sensory world (1⇓⇓–4). This semantic knowledge allows us to answer basic questions from memory (e.g., “Do birds have feathers?”) and relies fundamentally on neural mechanisms that can organize individual items, or entities (e.g., canary or robin), into higher-order conceptual categories (e.g., birds) that include items with similar features, or properties. This knowledge of individual entities and their conceptual groupings into categories or other ontologies is not present in infancy, but develops during childhood (1, 5), and in adults, it powerfully guides inductive generalizations.
The acquisition, organization, deployment, and neural representation of semantic knowledge has been intensively studied, yielding many well-documented empirical phenomena. For example, during acquisition, broader categorical distinctions are generally learned before finer-grained distinctions (1, 5), and periods of relative stasis can be followed by abrupt conceptual reorganization (6, 7). Intriguingly, during these periods of developmental stasis, children can believe illusory, incorrect facts about the world (2).
Also, many psychophysical studies of performance in semantic tasks have revealed empirical regularities governing the organization of semantic knowledge. In particular, category membership is a graded quantity, with some items being more or less typical members of a category (e.g., a sparrow is a more typical bird than a penguin). Item typicality is reproducible across individuals (8, 9) and correlates with performance on a diversity of semantic tasks (10⇓⇓⇓–14). Moreover, certain categories themselves are thought to be highly coherent (e.g., the set of things that are dogs), in contrast to less coherent categories (e.g., the set of things that are blue). More coherent categories play a privileged role in the organization of our semantic knowledge; coherent categories are the ones that are most easily learned and represented (8, 15, 16). Also, the organization of semantic knowledge powerfully guides its deployment in novel situations, where one must make inductive generalizations about novel items and properties (2, 3). Indeed, studies of children reveal that their inductive generalizations systematically change over development, often becoming more specific with age (2, 3, 17⇓–19).
Finally, recent neuroscientific studies have probed the organization of semantic knowledge in the brain. The method of representational similarity analysis (20, 21) revealed that the similarity structure of cortical activity patterns often reflects the semantic similarity structure of stimuli (22, 23). And, strikingly, such neural similarity structure is preserved across humans and monkeys (24, 25).
This wealth of empirical phenomena raises a fundamental conceptual question about how neural circuits, upon experiencing many individual encounters with specific items, can, over developmental time scales, extract abstract semantic knowledge consisting of useful categories that can then guide our ability to reason about the world and inductively generalize. While several theories have been advanced to explain semantic development, there is currently no analytic, mathematical theory of neural circuits that can account for the diverse phenomena described above. Interesting nonneural accounts for the discovery of abstract semantic structure include the conceptual “theory-theory” (2, 17, 18) and computational Bayesian (26) approaches. However, neither currently proposes a neural implementation that can infer abstract concepts from a stream of examples. Also, they hold that specific domain theories or a set of candidate structural forms must be available a priori for learning to occur. In contrast, much prior work has shown, through simulations, that neural networks can gradually extract semantic structure by incrementally adjusting synaptic weights via error-corrective learning (4, 27⇓⇓⇓⇓–32). However, the theoretical principles governing how even simple artificial neural networks extract semantic knowledge from their ongoing stream of experience, embed this knowledge in their synaptic weights, and use these weights to perform inductive generalization, remain obscure.
In this work, our goal is to fill this gap by using a simple class of neural networks—namely, deep linear networks. Surprisingly, this model class can learn a wide range of distinct types of structure without requiring either initial domain theories or a prior set of candidate structural forms, and accounts for a diversity of phenomena involving semantic cognition described above. Indeed, we build upon a considerable neural network literature (27⇓⇓⇓⇓–32) addressing such phenomena through simulations of more complex nonlinear networks. We build particularly on the integrative, simulation-based treatment of semantic cognition in ref. 4, often using the same modeling strategy in a simpler linear setting, to obtain similar results but with additional analytical insight. In contrast to prior work, whether conceptual, Bayesian, or connectionist, our simple model permits exact analytical solutions describing the entire developmental trajectory of knowledge acquisition and organization and its subsequent impact on the deployment and neural representation of semantic structure. In the following, we describe semantic knowledge acquisition, organization, deployment, and neural representation in sequence, and we summarize our main findings in Discussion.
A Deep Linear Neural Network Model
Here, we consider a framework for analyzing how neural networks extract semantic knowledge by integrating across many individual experiences of items and their properties, across developmental time. In each experience, given an item as input, the network is trained to produce its associated properties or features as output. Consider, for example, the network’s interaction with the semantic domain of living things, schematized in Fig. 1A. If the network encounters an item, such as a canary, perceptual neural circuits produce an activity vector
(A) During development, the network experiences sequential episodes with items and their properties. (B) After each episode, the network adjusts its synaptic weights to reduce the discrepancy between actual observed properties y and predicted properties
The network’s task is to predict an item’s properties y from its perceptual representation x. These predictions are generated by propagating activity through a three-layer linear neural network (Fig. 1B). The input activity pattern x in the first layer propagates through a synaptic weight matrix
To study the impact of depth, we will contrast the learning dynamics of this deep linear network to that of a shallow network that has just a single weight matrix,
To illustrate the power of deep linear networks to capture learning dynamics, even in nonlinear networks, we compare the two learning dynamics in Fig. 2. Fig. 2A shows a low-dimensional visualization of the simulated learning dynamics of a multilayered nonlinear neural network trained to predict the properties of a set of items in a semantic domain of animals and plants (for details of the neural architecture and training data see ref. 4). The nonlinear network exhibits a striking, hierarchical, progressive differentiation of structure in its internal hidden representations, in which animals vs. plants are first distinguished, then birds vs. fish and trees vs. flowers, and, finally, individual items. This remarkable phenomenon raises important questions about the theoretical principles governing the hierarchical differentiation of structure in neural networks: How and why do the network’s dynamics and the statistical structure of the input conspire to generate this phenomenon? In Fig. 2B, we mathematically derive this phenomenon by finding analytic solutions to the nonlinear dynamics of learning in a deep linear network, when that network is exposed to a hierarchically structured semantic domain, thereby shedding theoretical insight onto the origins of hierarchical differentiation in a deep network. We present the derivation below, but for now, we note that the resemblance in Fig. 2 suggests that deep linear networks can form an excellent, analytically tractable model for shedding conceptual insight into the learning dynamics, if not the expressive power, of their nonlinear counterparts.
(A) A two-dimensional multidimensional scaling (MDS) visualization of the temporal evolution of internal representations, across developmental time, of a deep nonlinear neural network studied in ref. 4. Reprinted with permission from ref. 4. (B) An MDS visualization of analytically derived learning trajectories of the internal representations of a deep linear network exposed to a hierarchically structured domain.
Acquiring Knowledge
We now outline the derivation that leads to Fig. 2B. The incremental error-corrective process described above can be formalized as online stochastic gradient descent; each time an example i is presented, the weights
Explicit Solutions from Tabula Rasa.
These nonlinear dynamics are difficult to solve for arbitrary initial conditions. However, we are interested in a particular limit: learning from a state of essentially no knowledge, which we model as small random synaptic weights. To further ease the analysis, we shall assume that the influence of perceptual correlations is minimal (
(A) SVD of input–output correlations. Associations between items and their properties are decomposed into modes. Each mode links a set of properties (a column of U) with a set of items (a row of
For example, the α’th column
The corresponding α’th column
Given the SVD of the training set’s input–output correlation matrix in [4], we can now explicitly describe the network’s learning dynamics. The network’s overall input–output map at time t is a time-dependent version of this SVD (Fig. 3B); it shares the object-analyzer and feature-synthesizer matrices of the SVD of
This solution also gives insight into how the internal representations in the hidden layer of the deep network evolve. An exact solution for
The shallow network has an analogous solution,
Rapid Stage-Like Transitions Due to Depth.
We first compare the time course of learning in deep vs. shallow networks as revealed in Eqs. 6 and 9. For the deep network, beginning from a small initial condition
By contrast, the shallow-network learning timescale is
Progressive Differentiation of Hierarchical Structure.
We are now almost ready to explain how we analytically derive the result in Fig. 2B. The only remaining ingredient is a mathematical description of the training data. Indeed, the numerical results in Fig. 2A arise from a toy training set, making it difficult to understand how structure in data drives learning. Here, we introduce a probabilistic generative model to reveal general principles of how statistical structure impacts learning. We begin with hierarchical structure, but subsequently show how diverse structures come to be learned by the network (compare Fig. 9).
We use a generative model corresponding to a branching diffusion process that mimics evolutionary dynamics to create an explicitly hierarchical dataset (SI Appendix). In the model, each feature diffuses down an evolutionary tree (Fig. 4A), with a small probability of mutating along each branch. The items lie at the leaves of the tree, and the generative process creates a hierarchical similarity matrix between items such that items with a more recent common ancestor on the tree are more similar to each other (Fig. 4B). We analytically compute the SVD of this dataset, and find that the object-analyzer vectors, which can be viewed as functions on the leaves of the tree in Fig. 4C, respect the hierarchical branches of the tree, with the larger singular values corresponding to broader distinctions. In Fig. 4A, we artificially label the leaves and branches of the evolutionary tree with organisms and categories that might reflect a natural realization of this evolutionary process.
Hierarchy and the SVD. (A) A domain of eight items with an underlying hierarchical structure. (B) The correlation matrix of the features of the items. (C) SVD of the correlations reveals semantic distinctions that mirror the hierarchical taxonomy. This is a general property of the SVD of hierarchical data. (D) The singular values of each semantic distinction reveal its strength in the dataset and control when it is learned.
Now, inserting the singular values in Fig. 4D (and SI Appendix) into the deep-learning dynamics in Eq. 6 to obtain the time-dependent singular values
Illusory Correlations.
Another intriguing aspect of semantic development is that children sometimes attest to false beliefs [e.g., worms have bones (2)] they could not have learned through experience. These errors challenge simple associationist accounts of semantic development that predict a steady, monotonic accumulation of information about individual properties (2, 16, 17, 34). Yet, the network’s knowledge of individual properties exhibits complex, nonmonotonic trajectories over time (Fig. 5). The overall prediction for a property is a sum of contributions from each mode, where the contribution of mode α to an individual feature m for item i is
Illusory correlations during learning. (A) Predicted value (blue) of feature “can fly” for item “salmon” during learning in a deep network (dataset as in Fig. 3). The contributions from each mode are shown in red. (B) The predicted value and modes for the same feature in a shallow network.
Indeed, any property–item combination for which
Organizing and Encoding Knowledge
We now turn from the dynamics of learning to its final outcome. When exposed to a variety of items and features interlinked by an underlying hierarchy, for instance, what categories naturally emerge? Which items are particularly representative of a categorical distinction? And how is the structure of the domain internally represented?
Category Membership, Typicality, and Prototypes.
A long-observed empirical finding is that category membership is not simply a binary variable, but rather a graded quantity, with some objects being more or less typical members of a category (e.g., a sparrow is a more typical bird than a penguin) (8, 9). Indeed, graded judgments of category membership correlate with performance on a range of tasks: Subjects more quickly verify the category membership of more typical items (10, 11), more frequently recall typical examples of a category (12), and more readily extend new information about typical items to all members of a category (13, 14). Our theory begins to provide a natural mathematical definition of item typicality that explains how it emerges from the statistical structure of the environment and improves task performance. We note that all results in this section apply specifically to data generated by binary trees as in Fig. 4A, which exhibit a one-to-one correspondence between singular dimensions of the SVD and individual categorical distinctions (Fig. 4C). The theory characterizes the typicality of items with respect to individual categorical distinctions.
A natural notion of the typicality of an item i for a categorical distinction α is simply the quantity
Several previous attempts at defining the typicality of an item involved computing a weighted sum of category-specific features present or absent in the item (8, 15, 35⇓–37). For instance, a sparrow is a more typical bird than a penguin because it shares more relevant features (can fly) with other birds. However, the specific choice of which features are relevant—the weights in the weighted sum of features—has often been heuristically chosen and relied on prior knowledge of which items belong to each category (8, 36). Our definition of typicality can also be described in terms of a weighted sum of an object’s features, but the weightings are uniquely fixed by the statistics of the entire environment through the SVD (SI Appendix):
The geometry of item typicality. For a semantic distinction α (in this case, α is the bird–fish distinction) the object-analyzer vector
In many theories of typicality, the particular weighting of object features corresponds to a prototypical object (3, 15), or the best example of a particular category. Such object prototypes are often obtained by a weighted average over the feature vectors for the objects in a category (averaging together the features of all birds, for instance, will give a set of features they share). However, such an average relies on prior knowledge of the extent to which an object belongs to a category. Our theory also yields a natural notion of object prototypes as a weighted average of object feature vectors, but, unlike many other frameworks, it yields a unique prescription for the object weightings in terms of environmental statistics through the SVD (SI Appendix):
In summary, a beautifully simple duality between item typicality and category prototypes arises as an emergent property of the learned internal representations of the neural network. The typicality of an item with respect to dimension α is the projection of that item’s feature vector onto the category prototype in [13]. And the category prototype is an average over all object feature vectors, weighted by their typicality in [14]. Also, in any categorical distinction α, the most typical items i and the most important features m are determined by the extremal values of
Category Coherence.
The categories we naturally learn are not arbitrary; they are coherent and efficiently represent the structure of the world (8, 15, 16). For example, the set of things that are red and cannot swim is a well-defined category but, intuitively, is not as coherent as the category of dogs; we naturally learn, and even name, the latter category, but not the former. When is a category learned at all, and what determines its coherence? An influential proposal (8, 15) suggested that coherent categories consist of tight clusters of items that share many features and that moreover are highly distinct from other categories with different sets of shared features. Such a definition, as noted in refs. 3, 16, and 17, can be circular: To know which items are category members, one must know which features are important for that category, and, conversely, to know which features are important, one must know which items are members. Thus, a mathematical definition of category coherence that is provably related to the learnability of categories by neural networks has remained elusive. Here, we provide such a definition for a simple model of disjoint categories and demonstrate how neural networks can cut through the Gordian knot of circularity. Our definition and theory are motivated by, and consistent with, prior network simulations exploring notions of category coherence through the coherent covariation of features (4).
Consider, for example, a dataset consisting of
The discovery of disjoint categories in noise. (A) A dataset of N0 = 1,000 items and Nf = 1,600 features, with no discernible visible structure. (B) Yet when a deep linear network learns to predict the features of items, an MDS visualization of the evolution of its internal representations reveals three clusters. (C) By computing the SVD of the product of synaptic weights
This threshold is strikingly permissive. For example, at
Basic Categories.
Closely related to category coherence, a variety of studies have revealed a privileged role for basic categories at an intermediate level of specificity (e.g., bird), compared with superordinate (e.g., animal) or subordinate (e.g., robin) levels. At this basic level, people are quicker at learning names (38, 39), prefer to generate names (39), and are quicker to verify the presence of named items in images (11, 39). We note that basic-level advantages typically involve naming tasks done at an older age, and so need not be inconsistent with progressive differentiation of categorical structure from superordinate to subordinate levels as revealed in preverbal cognition (1, 4⇓⇓–7, 40). Moreover, some items are named more frequently than others, and these frequency effects could contribute to a basic-level advantage (4). However, in artificial category-learning experiments with tightly controlled frequencies, basic-level categories are still often learned first (41). What environmental statistics could lead to this effect? While several properties have been proposed (11, 35, 37, 41), a mathematical function of environmental structure that provably confers a basic-level advantage to neural networks has remained elusive.
Here, we provide such a function by generalizing the notion of category coherence C in the previous section to hierarchically structured categories. Indeed, in any dataset containing strong categorical structure, so that its singular vectors are in one-to-one correspondence with categorical distinctions, we simply propose to define the coherence of a category by the associated singular value. This definition has the advantage of obeying the theorem that more coherent categories are learned faster, through Eq. 6. Moreover, we show in SI Appendix that this definition is consistent with that of category coherence C defined in Eq. 15 for the special case of disjoint categories. However, for hierarchically structured categories, as in Fig. 4, this singular value definition always predicts an advantage for superordinate categories, relative to basic or subordinate. Is there an alternate statistical structure for hierarchical categories that confers high category coherence at lower levels in the hierarchy? We exhibit two such structures in Fig. 8. More generally, in SI Appendix, we analytically compute the singular values at each level of the hierarchy in terms of the similarity structure of items. We find that these singular values are a weighted sum of within-cluster similarity minus between-cluster similarity for all levels below, weighted by the fraction of items that are descendants of that level. If at any level, between-cluster similarity is negative, that detracts from the coherence of superordinate categories, contributes strongly to the coherence of categories at that level, and does not contribute to subordinate categories.
From similarity structure to category coherence. (A) A hierarchical similarity structure over objects in which categories at the basic level are very different from each other due to a negative similarity. (B) For this structure, basic-level categorical distinctions acquire larger singular values, or category coherence, and therefore gain an advantage in both learning and task performance. (C) Now, subordinate categories are very different from each other through negative similarity. (D) Consequently, subordinate categories gain a coherence advantage. See SI Appendix for formulas relating similarity structure to category coherence.
Thus, the singular value-based definition of category coherence is qualitatively consistent with prior intuitive notions. For instance, paraphrasing Keil (17), coherent categories are clusters of tight bundles of features separated by relatively empty spaces. Also, consistent with refs. 3, 16, and 17, we note that we cannot judge the coherence of a category without knowing about its relations to all other categories, as singular values are a complex emergent property of the entire environment. But going beyond past intuitive notions, our quantitative definition of category coherence based on singular values enables us to prove that coherent categories are most easily and quickly learned and also provably provide the most accurate and efficient linear representation of the environment, due to the global optimality properties of the SVD (see SI Appendix for details).
Capturing Diverse Domain Structures.
While we have focused thus far on hierarchical structure, the world may contain many different structure types. Can a wide range of such structures be learned and encoded by neural networks? To study this question, we formalize structure through probabilistic graphical models (PGMs), defined by a graph over items (Fig. 9, top) that can express a variety of structural forms, including clusters, trees, rings, grids, orderings, and hierarchies. Features are assigned to items by independently sampling from the PGM (ref. 26 and SI Appendix), such that nearby items in the graph are more likely to share features. For each of these forms, in the limit of a large number of features, we compute the item–item covariance matrices (Fig. 9, second row), object-analyzer vectors (Fig. 9, third row), and singular values of the input–output correlation matrix, and we use them in Eq. 6 to compute the development of the network’s internal representations through Eq. 8, as shown in Fig. 9, bottom. Overall, this approach yields several insights into how distinct structural forms, through their different statistics, drive learning in a deep network, as summarized below.
Representation of explicit structural forms in a neural network. Each column shows a different structure. The first four columns correspond to pure structural forms, while the final column has cross-cutting structure. First row: The structure of the data generating PGM. Second row: The resulting item-covariance matrix arising from either data drawn from the PGM (first four columns) or designed by hand (final column). Third row: The input-analyzing singular vectors that will be learned by the linear neural network. Each vector is scaled by its singular value, showing its importance to representing the covariance matrix. Fourth row: MDS view of the development of internal representations.
Clusters.
Graphs that break items into distinct clusters give rise to block-diagonal constant matrices, yielding object-analyzer vectors that pick out cluster membership.
Trees.
Tree graphs give rise to ultrametric covariance matrices, yielding object-analyzer vectors that are tree-structured wavelets that mirror the underlying hierarchy (42, 43).
Rings and grids.
Ring-structured graphs give rise to circulant covariance matrices, yielding object-analyzer vectors that are Fourier modes ordered from lowest to highest frequency (44).
Orderings.
Graphs that transitively order items yield highly structured, but nonstandard, covariance matrices whose object analyzers encode the ordering.
Cross-cutting structure.
Real-world domains need not have a single underlying structural form. For instance, while some features of animals and plants generally follow a hierarchical structure, other features, like male and female, can link together hierarchically disparate items. Such cross-cutting structure can be orthogonal to the hierarchical structure, yielding object-analyzer vectors that span hierarchical distinctions.
These results reflect an analytic link between two popular, but different, methods of capturing structure: PGMs and deep networks. This analysis transcends the particulars of any one dataset and shows how different abstract structures can become embedded in the internal representations of a deep neural network. Strikingly, the same generic network can accommodate all of these structure types, without requiring the set of possible candidate structures a priori.
Deploying Knowledge: Inductive Projection
Over the course of development, the knowledge children acquire powerfully reshapes their inductions upon encountering novel items and properties (2, 3). For instance, upon learning a novel fact (e.g., “a canary is warm-blooded”) children extend this new knowledge to related items, as revealed by their answers to questions like “is a robin warm-blooded?” Studies have shown that children’s answers to such questions change over the course of development (2, 3, 17⇓–19), generally becoming more specific. For example, young children may project the novel property of warm-blooded to distantly related items, while older children will only project it to more closely related items. How could such changing patterns arise in a neural network? Here, building upon previous network simulations (4, 28), we show analytically that deep networks exposed to hierarchically structured data naturally yield progressively narrowing patterns of inductive projection across development.
Consider the act of learning that a familiar item has a novel feature (e.g., “a pine has property x”). To accommodate this knowledge, new synaptic weights must be chosen between the familiar item pine and the novel property x (Fig. 10A), without disrupting prior knowledge of items and their properties already stored in the network. This may be accomplished by adjusting only the weights from the hidden layer to the novel feature so as to activate it appropriately. With these new weights established, inductive projections of the novel feature to other familiar items (e.g., “does a rose have property x?”) naturally arise by querying the network with other inputs. If a novel property m is ascribed to a familiar item i, the inductive projection of this property to any other item j is given by (SI Appendix)
The neural geometry of inductive generalization. (A) A novel feature (property x) is observed for a familiar item (e.g., “a pine has property x”). (B) Learning assigns the novel feature a neural representation in the hidden layer of the network that places it in semantic similarity space near the object which possesses the novel feature. The network then inductively projects that novel feature to other familiar items (e.g., “Does a rose have property x?”) only if their hidden representation is close in neural space. (C) A novel item (a blick) possesses a familiar feature (e.g., “a blick can move”). (D) Learning assigns the novel item a neural representation in the hidden layer that places it in semantic similarity space near the feature possessed by the novel item. Other features are inductively projected to that item (e.g., “Does a blick have wings?”) only if their hidden representation is close in neural space. (E) Inductive projection of a novel property (“a pine has property x”) over learning. As learning progresses, the neural representations of items become progressively differentiated, yielding progressively restricted projection of the novel feature to other items. Here, the pine can be thought of as the left-most item node in the tree.
A parallel situation arises upon learning that a novel item possesses a familiar feature (e.g., “a blick can move”; Fig. 10C). Encoding this knowledge requires new synaptic weights between the item and the hidden layer. Appropriate weights may be found through standard gradient descent learning of the item-to-hidden weights for this novel item, while holding the hidden-to-output weights fixed to preserve prior knowledge about features. The network can then project other familiar properties to the novel item (e.g., “Does a blick have legs?”) by simply generating a feature output vector given the novel item as input. A novel item i with a familiar feature m will be assigned another familiar feature n through (SI Appendix)
Thus, the hidden layer of the deep network furnishes a common, semantic representational space into which both features and items can be placed. When a novel feature m is assigned to a familiar item i, that novel feature is placed close to the familiar item in the hidden layer, and so the network will inductively project this novel feature to other items close to i in neural space. In parallel, when a novel item i is assigned a familiar feature m, that novel item is placed close to the familiar feature, and so the network will inductively project other features close to m in neural space onto the novel item.
This principle of similarity-based generalization encapsulated in Eqs. 16 and 17, when combined with the progressive differentiation of internal representations as the network learns from hierarchically structured data, as illustrated in Fig. 2B, then naturally explains the developmental shift in patterns of inductive projection from broad to specific, as shown in Fig. 10E. For example, consider specifically the inductive projection of a novel feature to familiar items (Fig. 10 A and B). Earlier (later) in developmental time, neural representations of all items are more similar to (different from) each other, and so the network’s similarity-based inductive projection will extend the novel feature to many (fewer) items, thereby exhibiting progressively narrower patterns of projection that respect the hierarchical structure of the environment (Fig. 10E). Thus, remarkably, even a deep linear network can provably exhibit the same broad to specific changes in patterns of inductive projection, empirically observed in many works (2, 3, 17, 18).
Linking Behavior and Neural Representations
Compared with previous models which have primarily made behavioral predictions, our theory has a clear neural interpretation. Here, we discuss implications for the neural basis of semantic cognition.
Similarity Structure Is an Invariant of Optimal Learning.
An influential method for probing neural codes for semantic knowledge in empirical measurements of neural activity is the representational similarity approach (20, 21, 25, 45), which examines the similarity structure of neural population vectors in response to different stimuli. This technique has identified rich structure in high-level visual cortices, where, for instance, inanimate objects are differentiated from animate objects (22, 23). Strikingly, studies have found remarkable constancy between neural similarity structures across human subjects and even between humans and monkeys (24, 25). This highly conserved similarity structure emerges, despite considerable variability in neural activity patterns across subjects (46, 47). Indeed, exploiting similarity structure enables more effective across-subject decoding of fMRI data relative to transferring a decoder based on careful anatomical alignment (48). Why is representational similarity conserved, both across individuals and species, despite highly variable tuning of individual neurons and anatomical differences?
Remarkably, we show that two networks trained in the same environment must have identical representational similarity matrices, despite having detailed differences in neural tuning patterns, provided that the learning process is optimal, in the sense that it yields the smallest norm weights that solve the task (see SI Appendix). One way to get close to the optimal manifold of smallest norm synaptic weights after learning is to start learning from small random initial weights. We show in Fig. 11 A and B that two networks, each starting from different sets of small random initial weights, will learn very different internal representations (Fig. 11 A and B, Top), but will have nearly identical representational similarity matrices (Fig. 11 A and B, Middle). Such a result is, however, not obligatory. Two networks starting from large random initial weights not only learn different internal representations, but also learn different representational similarity matrices (Fig. 11 C and D, Top and Middle). This pair of networks both learns the same composite input–output map, but with suboptimal large-norm weights. Hence, our theory, combined with the empirical finding that similarity structure is preserved across humans and species, suggests these disparate neural circuits may be implementing approximately optimal learning in a common environment.
Neural representations and invariants of learning. A and B depict two networks trained from small norm random weights. C and D depict two networks trained from large norm random weights. (Top) Neural tuning curves
When the Brain Mirrors Behavior.
In addition to matching neural similarity patterns across subjects, experiments using fMRI and single-unit responses have also documented a correspondence between neural similarity patterns and behavioral similarity patterns (21). When does neural similarity mirror behavioral similarity? We show this correspondence again emerges only in optimal networks. In particular, denote by
Given that optimal learning is a prerequisite for neural similarity mirroring behavioral similarity, as in the previous section, there is a match between the two for pairs of networks trained from small random initial weights (Fig. 11 A and B, Middle and Bottom), but not for pairs of networks trained from large random initial weights (Fig. 11 C and D, Middle and Bottom). Thus, again, speculatively, our theory suggests that the experimental observation of a link between behavioral and neural similarity may in fact indicate that learning in the brain is finding optimal network solutions that efficiently implement the requisite transformations with minimal synaptic strengths.
Discussion
In summary, the main contribution of our work is the analysis of a simple model—namely, a deep linear neural network—that can, surprisingly, qualitatively capture a diverse array of phenomena in semantic development and cognition. Our exact analytical solutions of nonlinear learning phenomena in this model yield conceptual insights into why such phenomena also occur in more complex nonlinear networks (4, 28⇓⇓⇓–32) trained to solve semantic tasks. In particular, we find that the hierarchical differentiation of internal representations in a deep, but not a shallow, network (Fig. 2) is an inevitable consequence of the fact that singular values of the input–output correlation matrix drive the timing of rapid developmental transitions (Fig. 3 and Eqs. 6 and 10), and hierarchically structured data contain a hierarchy of singular values (Fig. 4). In turn, semantic illusions can be highly prevalent between these rapid transitions simply because global optimality in predicting all features of all items necessitates sacrificing correctness in predicting some features of some items (Fig. 5). And, finally, this hierarchical differentiation of concepts is intimately tied to the progressive sharpening of inductive generalizations made by the network (Fig. 10).
The encoding of knowledge in the neural network after learning also reveals precise mathematical definitions of several aspects of semantic cognition. Basically, the synaptic weights of the neural network extract from the statistical structure of the environment a set of paired object analyzers and feature synthesizers associated with every categorical distinction. The bootstrapped, simultaneous learning of each pair solves the apparent Gordian knot of knowing both which items belong to a category and which features are important for that category: The object analyzers determine category membership, while the feature synthesizers determine feature importance, and the set of extracted categories are uniquely determined by the statistics of the environment. Moreover, by defining the typicality of an item for a category as the strength of that item in the category’s object analyzer, we can prove that typical items must enjoy enhanced performance in semantic tasks relative to atypical items (Eq. 12). Also, by defining the category prototype to be the associated feature synthesizer, we can prove that the most typical items for a category are those that have the most extremal projections onto the category prototype (Fig. 6 and Eq. 13). Finally, by defining the coherence of a category to be the associated singular value, we can prove that more coherent categories can be learned more easily and rapidly (Fig. 7) and explain how changes in the statistical structure of the environment determine what level of a category hierarchy is the most basic or important (Fig. 8). All our definitions of typicality, prototypes, and category coherence are broadly consistent with intuitions articulated in a wealth of psychology literature, but our definitions imbue these intuitions with enough mathematical precision to prove theorems connecting them to aspects of category learnability, learning speed, and semantic task performance in a neural network model.
More generally, beyond categorical structure, our analysis provides a principled framework for explaining how the statistical structure of diverse structural forms associated with different PGMs gradually becomes encoded in the weights of a neural network (Fig. 9). Remarkably, the network learns these structures without knowledge of the set of candidate structural forms, demonstrating that such forms need not be built in. Regarding neural representation, our theory reveals that, across different networks trained to solve a task, while there may be no correspondence at the level of single neurons, the similarity structure of internal representations of any two networks will both be identical to each other and closely related to the similarity structure of behavior, provided that both networks solve the task optimally, with the smallest possible synaptic weights (Fig. 11).
While our simple neural network captures this diversity of semantic phenomena in a mathematically tractable manner, because of its linearity, the phenomena it can capture still barely scratch the surface of semantic cognition. Some fundamental semantic phenomena that require complex nonlinear processing include context-dependent computations, dementia in damaged networks, theory of mind, the deduction of causal structure, and the binding of items to roles in events and situations. While it is inevitably the case that biological neural circuits exhibit all of these phenomena, it is not clear how our current generation of artificial nonlinear neural networks can recapitulate all of them. However, we hope that a deeper mathematical understanding of even the simple network presented here can serve as a springboard for the theoretical analysis of more complex neural circuits, which, in turn, may eventually shed much-needed light on how the higher-level computations of the mind can emerge from the biological wetware of the brain.
Acknowledgments
We thank Juan Gao and Madhu Advani for useful discussions. S.G. was supported by the Burroughs-Wellcome, Sloan, McKnight, James S. McDonnell, and Simons foundations. J.L.M. was supported by the Air Force Office of Scientific Research. A.M.S. was supported by Swartz, National Defense Science and Engineering Graduate, and Mind, Brain, & Computation fellowships.
Footnotes
- ↵1To whom correspondence should be addressed. Email: andrew.saxe{at}psy.ox.ac.uk.
Author contributions: A.M.S., J.L.M., and S.G. designed research; A.M.S. and S.G. performed research; and A.M.S., J.L.M., and S.G. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1820226116/-/DCSupplemental.
Published under the PNAS license.
References
- ↵
- Keil F
- ↵
- Carey S
- ↵
- Murphy G
- ↵
- Rogers TT,
- McClelland JL
- ↵
- Mandler J,
- McDonough L
- ↵
- Inhelder B,
- Piaget J
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Mervis C,
- Catlin J,
- Rosch E
- ↵
- ↵
- ↵
- Rosch E
- ↵
- ↵
- Keil F
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Connolly A, et al
- ↵
- ↵
- ↵
- Kemp C,
- Tenenbaum J
- ↵
- Hinton G
- ↵
- Rumelhart D,
- Todd P
- ↵
- McClelland J
- ↵
- Plunkett K,
- Sinha C
- ↵
- ↵
- ↵
- Poole B,
- Lahiri S,
- Raghu M,
- Sohl-Dickstein J,
- Ganguli S
- ↵
- ↵
- ↵
- ↵
- ↵
- Anglin J
- ↵
- ↵
- ↵
- ↵
- Khrennikov A,
- Kozyrev S
- ↵
- ↵
- Gray R
- ↵
- Laakso A,
- Cottrell G
- ↵
- Haxby J, et al
- ↵
- ↵
Citation Manager Formats
Sign up for Article Alerts
Article Classifications
- Biological Sciences
- Psychological and Cognitive Sciences
- Physical Sciences
- Applied Mathematics

























