Simple transformations capture auditory input to cortex

Significance Sensory systems are extremely complex, with diverse neurons and connections. However, this does not necessarily imply that the computations performed by these systems are also as complex. Here we examine the impact of processing in the ear and subcortical pathway on neural responses to natural sounds in the auditory cortex. We find that this can be described more consistently using simple spectral models. This suggests that there may be an underlying simplicity to the signal transformation from ear to cortex that is hidden among the detail. This hidden simplicity may be a feature of other sensory systems too.

3 were single units. Seventy-three of the single units had noise ratios (5,6) to the stimulus set of < 40, and these were the units used in this study.

Natural sound dataset 2 (NS2)
The second natural sound dataset has been published by other authors (7) and is publicly available (https://zenodo.org/record/3445557#.XyNEy-d7lEY). A description of the experimental setting and stimuli can be found in (7). Briefly, extracellular neurophysiological recordings were obtained from (A1 and AAF) of six awake ferrets using 1-4 independently positioned tungsten microelectrodes while the animals were passively listening. Sound was delivered through free-field speakers (Manger W05) positioned at ±30  azimuth and 80 cm distant from the animal, at a sampling rate of 44.1 kHz. The sound stimuli consisted of diverse clips of natural sounds. Most clips were presented only once, but some were repeated a number of times, which varied over experiments. In a subset of experiments, there were 18 natural sound stimuli clips (each 4 s in duration) that were repeated 10 (or sometimes 20) times. We selected this subset of experiments and used the neural responses to these 18 natural sounds for our analysis, as a sufficient of repeats is required for the calculation of noise ratio and the PSTH, and a sufficient number of sound clips is needed for model fitting. This set of 18 natural sound clips included human speech, ferret and other species' vocalizations, natural environmental sounds, and sounds from the animals' laboratory environment. The RMS sound intensity of the full 4 s duration of each clip was 54 dB SPL. The RMS sound intensity within 4 ms time bins varied from -16 dB SPL to 76 dB SPL over the whole dataset. The neural dataset for these 18 stimuli consisted of 336 single units.
We used the 235 of these units, selected because they had a noise ratio <40 for the 18 natural sound stimuli.

The DRC dataset
The dynamic random chord (DRC) dataset comes from the same anesthetized ferret experiments as the natural sound dataset 1. The DRC dataset has not previously been analyzed in publications, but is publicly available at https://osf.io/ayw2p/. The DRC stimulus clips were randomly interleaved between the clips in natural sound dataset 1, hence the units in the DRC dataset are the same 73 single units 4 as in that natural sound dataset. The sound stimuli in the DRC dataset consisted of 12 sound clips each of 5 s duration, and each clip was presented 5 times. Each clip was of dynamic random chords (4,5,8,9), which consisted of a sequence of complex tones, each tone being 25 ms long, presented with no gap between the tones. Each complex tone was composed of 34 superposed pure tones, each of whose levels were independently and randomly chosen. The frequencies of the 34 pure tones were log spaced over 5.5 octaves, from 500 Hz to 22627 Hz, with 1/6 octave spacing. The tone levels were picked from a uniform distribution of spanning either 10, 20, 30 and 40 dB, with a mean of 44 dB SPL. The RMS intensity over the full 5 s duration of each clip ranged from 79 to 88 dB SPL. The RMS sound intensity within 4 ms time bins clips varied from 71 dB SPL to 93 dB SPL over the whole dataset. Figure S1 illustrates the similarities and differences between of the cochlear models that we used.

Cochlear models
Below we describe these models.
WSR model: This model was originally proposed by Wang and Shamma and was later described by Powen Ru (Matlab codes that we are using were adapted from https://github.com/tel/NSLtools) (10)(11)(12)(13). We refer to it as the WSR (Wang Shamma Ru) model, after the names of the developers. It makes use of a series of filters, whose center frequencies were spaced logarithmically, followed by a sigmoid compression of the filter outputs. The sigmoid compressed outputs were then passed through a lowpass filtering stage, a lateral inhibition stage (between neighboring frequency channels), and a temporal integration stage. The source of these parameters is not strictly physiological, rather, they are abstracted from animal experiments to match perceptual processing. The filterbank of this model comprised a set of 129 given frequency channels. To make this model comparable to other models, appropriately spaced and sized subsets of the frequency channels were selected.
Lyon model: This model uses a cascade of filters with half-wave rectification and adaptive gain control whose center frequencies are spaced according to a logarithmic scale, except for low frequencies, where the spacing is linear, to simulate the behavior of the cochlea (14,15). The parameters of the 5 filters have been largely obtained from human psychophysics experiments. We adapted this model from: https://github.com/google/carfac. This implementation has several free parameters, allowing for flexible choice of spacing and bandwidth of each frequency channel in the filterbankwe set these parameters to values that would generate the desired number of frequency channels between 0.5 Hz and 20 kHz.
BEZ model: This is a phenomenological model of cat auditory nerve fibers (16)(17)(18), which we refer to as the BEZ (Bruce Erfani Zilani) model [the model described in (16)], named after its developers (we adapted this model from: https://www.urmc.rochester.edu/MediaLibraries/URMCMedia/labs/carneylab/codes/UR_EAR_v2_1.zip). This model has various processing stages to match the processing stages in the cochlea, the synapses with the auditory nerve and the spiking properties of the auditory nerve. Sound input is first processed through a middle ear filter, followed by a bank of filters with various complexities and non-linearities to account for inner hair cell and outer hair cell properties, the activity of the hair cells is then processed through a nonlinear model of synaptic neurotransmitter release, uptake of neurotransmitter and spiking of the auditory nerve. Because the spiking is a stochastic process, for each center frequency, each of three auditory nerve fiber types was allowed to spike many times (200 for plotting cochleagrams, and 20 for predicting neural responses to reduce running time) and the average was taken over these trials. After that, we took an average over different fiber types weighted by the ratio of each fiber type in the nerve fiber population.  (19)(20)(21)(22)(23)(24), the parameters of which are based on measurements from the guinea pig (23,24). The MSS model includes several stages of processing. The first stagean outer and middle ear (OME) modeluses a series of linear filter approximation to mimic stapes motion at the oval window. The second stage, the frequency decomposition stage, uses a dual resonance non-linear (DRNL) filter architecture to model the properties of the basilar membrane. A DRNL filter consists of two parallel pathways comprising a series of bandpass and lowpass filters, one with a nonlinear compression and the other without a nonlinear compression. The outputs of each parallel pathway are then combined to find the velocity of the basilar membrane displacement. Finally, transduction by the inner hair cells is modelled using a differential equation for their membrane potential, the synapse is modelled using probabilistic release of neurotransmitter and the activity of the auditory nerve (AN) fibers involves refractoriness, with the spiking of the AN fibers providing the output (Supplementary Fig. 1.1A-C). We set the center frequency of the non-linear filterbank to be log spaced between 508 Hz and 19,912 Hz. Like the BEZ model, for each center frequency each of three auditory nerve fiber types was allowed to spike many times (200 for plotting cochleagrams, and 20 for predicting neural responses to reduce running time) and the average was taken over these trials. For the main MSS model, the output was averaged over the three fiber types, whereas for the multi-fiber MSS model this was not done. Finally, the output was downsampled into 4-ms time-bins to find a windowed time-frequency representation. Spec-log model: A spectrogram was produced from the sound waveform by taking the amplitude spectrum using 8-ms Hanning windows, overlapping by 4 ms. The amplitude of adjacent frequency channels was summed using overlapping triangular windows (using code adapted from melbank.m, http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html), with log-spaced center frequencies ranging from 508 Hz to 19,912 Hz. The amplitude in each time-frequency bin was converted to log values (using 20 * log(⋅)) and any value below a threshold was set to that threshold (threshold is -40 for NS1 and DRC, -50 for NS2, set by cross-validation).

Spec-log1plus model:
This model is the same as the spec-log model but uses log(1+(⋅)) as the compression function instead of log(⋅).

Spec-power model:
This model too is the same as the spec-log model, but takes the square of the spectrogram, resulting in a power spectrogram instead of an amplitude spectrogram. This model uses a logarithmic compression function (10 * log(⋅)).

Spec-Hill model:
This model is an extension of the spec-power model where the output of spec-power model is further compressed using a non-linear Hill function (25). Before the spec-power output was put into the Hill function, the magnitude of the spec-power model's threshold was added to this output to ensure that it was non-negative. The Hill function is given by, Here, c is the scaling factor, which was set to 0.01 and is the saturation parameter of the Hill function, which was set to 0. 16

The PSTH and normalized cochleagram
For each neural unit in a dataset and for each sound stimulus clip n, a peri-stimulus time histogram (PSTH) was made. To construct the PSTH the number of spikes was counted in consecutive 4-ms time bins over the course of the clip and averaged over stimulus repeats. The PSTH to stimulus n is henceforth denoted as [t], where t indicates the time bin and goes from t = 1 to T. During the beginning of a clip presentation the neuron is adapting from silence, which can be a challenge to model. It is therefore common practice in fitting STRF and LN models to not use this period of sound (2,4,27).
Hence, the first 800 ms of the neural response for each clip was clipped from the data for natural sound dataset 1 (although including the first 800ms made very little difference to the results, see Figure S12).

8
This clipping was not done for natural sound dataset 2 or the DRC dataset as these clips were shorter than those of natural sound dataset 1 and this clipping would have shortened them too much.
The output of each cochlear model is called a cochleagram: the frequency-decomposed transformation of sound over specific time windows (4 ms in our case). To provide input for an encoding model of an auditory cortical neuron, the cochleagrams were normalized to zero mean and unit variance and for each snippet n, for every time t, a time-lagged matrix was extracted from the cochleagram. This was The encoding models The spectro-temporal receptive field and the linear-nonlinear model The linear-nonlinear (LN) model consists of a linear stage, the spectrotemporal receptive field (STRF), and a non-linear stage. The linear part of the model is: where is a vector of the input weights and b is the background activity of the neuron. The second stage of the model was a logistic activation function (sigmoid nonlinearity), which was fitted after fitting the STRF. This function is given by: The four parameters of the function were fitted by minimizing the squared error between the nonlinear estimate of the firing rate ̂[ ] and measured firing rate [ ].
The network receptive field model The network receptive field (NRF) models the neural response using a network, and is the same model as reported in (1)  The parameters of the NRF model are fitted by minimizing the objective function with respect to the free parameters using the sum-of-function optimizer algorithm (29). In using this algorithm, we take one clip to be one minibatch. The optimization algorithm requires calculation of the error gradients in respect to each of the parameters. For the NRF model, error gradients are calculated using standard chain rule (backpropagation).
Before training, the weights were initialized by modified Glorot initialization from a uniform distribution ranging from - where is the number of input weights to a HU and = 1 is the number of output weights from a HU. The biases were initialized similarly (30).

Cross-validation and testing of the encoding models
Each sound dataset was divided into a cross-validation set and a test set. The cross-validation set was used for training the weight matrices of the encoding models and setting their regularization hyperparameter strength by cross-validation. Also, for all cochlear models, biologically-complex and simple, cross-validation methods (using NS1 mostly) were used to explore some of their settings to avoid using any settings that gave poor predictions. The hyperparameters and settings were selected only using the cross-validation set, and then the separate test set was used to assess the prediction capacity of the models.
More specifically, for the natural sound dataset 1, which contained 20 sound clips, 4 were chosen as a test set that was not used during training and cross-validation. The cross-validation set (the remaining