New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
A mixed-scale dense convolutional neural network for image analysis
Contributed by James A. Sethian, November 3, 2017 (sent for review September 11, 2017; reviewed by Scott E. Fraser and Ron Kimmel)

Significance
Popular neural networks for image-processing problems often contain many different operations, multiple layers of connections, and a large number of trainable parameters, often exceeding several million. They are typically tailored to specific applications, making it difficult to apply a network that is successful in one application to different applications. Here, we introduce a neural network architecture that is less complex than existing networks, is easy to train, and achieves accurate results with relatively few trainable parameters. The network automatically adapts to a specific problem, allowing the same network to be applied to a wide variety of different problems.
Abstract
Deep convolutional neural networks have been successfully applied to many image-processing problems in recent works. Popular network architectures often add additional operations and connections to the standard architecture to enable training deeper networks. To achieve accurate results in practice, a large number of trainable parameters are often required. Here, we introduce a network architecture based on using dilated convolutions to capture features at different image scales and densely connecting all feature maps with each other. The resulting architecture is able to achieve accurate results with relatively few parameters and consists of a single set of operations, making it easier to implement, train, and apply in practice, and automatically adapts to different problems. We compare results of the proposed network architecture with popular existing architectures for several segmentation problems, showing that the proposed architecture is able to achieve accurate results with fewer parameters, with a reduced risk of overfitting the training data.
Machine learning is successful in many imaging applications, such as image classification (1⇓–3) and semantic segmentation (4⇓–6). Many applications of machine learning to imaging problems use deep convolutional neural networks (DCNNs), in which the input image and intermediate images are convolved with learned kernels in a large number of successive layers, allowing the network to learn highly nonlinear features. The popularity of machine learning has grown significantly due to (i) recent developments that allow for effective training of deeper networks, e.g., the introduction of rectified linear units (7) and dropout layers (8); (ii) the public availability of highly optimized software to both train and apply deep networks, e.g., TensorFlow (9) and Caffe (10); and (iii) the public availability of large pretrained networks and large training datasets, e.g., VGG (2) and ImageNet (11), and will continue to be an active research area (12).
To achieve accurate results for difficult image-processing problems, DCNNs typically rely on combinations of additional operations and connections including, for example, downscaling and upscaling operations to capture features at various image scales (4, 5). To train deeper and more powerful networks, additional layer types (8, 13) and connections (14, 15) are often required. Finally, DCNNs typically use a large number of intermediate images and trainable parameters [e.g., more than 100 million (2)] to achieve results for difficult problems.
The large size and complicated nature of many DCNNs bring significant challenges. For example, the chosen combination of layers and connections can significantly influence the accuracy of trained networks. Determining which combination is best for a given problem is difficult to predict a priori. Consequently, a network that works well for one problem is not guaranteed to work well for a different problem and can require significant changes to achieve accurate results. Furthermore, the large number of parameters to learn during training requires careful choices of hyperparameters (e.g., learning rates and initialization values) to avoid problems such as overfitting (8) and vanishing gradients (13) that result in inaccurate trained networks. As a result, image analysis often relies on problem-specific traditional methods instead.
Here, we introduce a network architecture specifically designed to be easy to implement, train, and use. All layers of the network use the same set of operations and are connected to each other in the same way, removing the need to choose which operations and connections to use for each specific problem. Our proposed network architecture achieves accurate results with relatively few intermediate images and parameters, eliminating both the need to tune hyperparameters and additional layers or connections to enable training. The network uses dilated convolutions instead of scaling operations to capture features at various image scales, using multiple scales within a single layer and densely connecting all intermediate images with each other. During training, the network learns which combinations of dilations to use for the given problem, allowing the same network to be applied to different problems.
This paper is structured as follows. We first introduce notation and discuss the general structure of existing deep convolutional networks. We then introduce the proposed network architecture. We explain the experiments we performed to investigate the performance of the architecture, comparing them with popular existing architectures, and discuss their results. Finally, we conclude with a summary and final remarks.
Notation and Concepts
Problem Definition.
In this paper, we apply our approach to real-valued 2D images. We define an image as a set of pixels
Convolutional Neural Networks.
Convolutional neural networks (CNNs) model the unknown function f by using several layers that are connected to each other in succession. Each layer i produces an output image
Each individual layer can consist of multiple operations. A common layer architecture first convolves each channel of the input feature map with a different filter, then sums the resulting convolved images pixel by pixel, adds a constant value (the bias) to the resulting image, and finally applies a nonlinear operation to each pixel. These operations can be repeated using different filters and biases to produce multiple channels for the output feature map. Thus, the output
A schematic representation of a two-layer CNN with input x, output y, and feature maps
The goal of training a CNN is to find filters
DCNNs.
DCNNs use a network architecture similar to standard CNNs, but consist of a larger number of layers, which enables them to model more complicated functions. In addition, DCNNs often include downscaling and upscaling operations between layers, decreasing and increasing the dimensions of feature maps to capture features at different image scales. Many DCNNs incrementally downscale feature maps in the first half of the layers, called the encoder part of the network, and subsequently upscale in the second half, called the decoder part. Skip connections are often included between feature maps of the decoder and encoder at identical scales (5). A schematic representation of a common encoder–decoder DCNN architecture is shown in Fig. 2.
A schematic representation of a common DCNN architecture with scaling operations. Downward arrows represent downscaling operations, upward arrows represent upscaling operations, and dashed arrows represent skip connections.
In general, the increased depth of DCNNs compared with shallow CNNs makes training more difficult. The increased depth often makes it more likely that training gets stuck in a local minimum of the error function and can result in gradients that become either too large or too small (13). Furthermore, DCNNs typically consist of many parameters (e.g., filters and biases), often several million or more, that have to be learned during training. The large parameter space can make training more difficult, by increasing both training time (18) and the likelihood of overfitting the network to the training data (8), thereby forcing large training sets. Several additions to standard DCNN architectures have been proposed, including batch normalization layers (13), which rescale feature maps between layers to improve the scaling of gradients during training; highway connections (14); residual connections (15); fractal networks (19), which allow information to flow more easily through deep networks by skipping layers; and dropout layers (8), in which feature maps are randomly removed from the network during training, reducing the problem of overfitting large networks.
Although these additions have advanced image processing in several fields (12), they can be difficult to routinely apply in areas such as biomedical imaging and materials science. Instead, traditional imaging algorithms are used, such as the Hough transform (20) and template matching (21), or manual processing [e.g., biological image segmentation (22)].
Theory and Algorithms
Our goal is to enable easier application of DCNNs to many imaging problems by introducing a less complicated network architecture with significantly fewer parameters to learn and which is able to automatically adapt to different problems. To do so, we introduce “the mixed-scale dense (MS-D)” network architecture, which (i) mixes scales within each layer and (ii) densely connects all feature maps.
Mixing Scales.
Instead of using downscaling and upscaling operations to capture features at different scales, the MS-D architecture uses dilated convolutions. A dilated convolution
Dense Connections.
When using convolutions with reflective boundaries, the mixed-scale approach has an additional advantage compared with standard scaling: All network feature maps have the same number of rows and columns as the input and output image, i.e.,
In a densely connected network, all feature maps are maximally (re)used: If a certain useful feature is detected in a feature map, it does not have to be replicated in other layers to be used deeper in the network, as in other DCNN architectures. As a result, significantly fewer feature maps and trainable parameters are required to achieve the same accuracy in densely connected networks compared with standard networks. The smaller number of maps and parameters makes it easier to train densely connected networks, reducing the risk of overfitting and enabling effective training with relatively small training sets. Recently, a similar dense-connection architecture was proposed which relied on a relatively small number of parameters (25); however, in ref. 25 the dense connections were used only within small sets of layers at a single scale, with traditional downscaling and upscaling operations to acquire information at different scales. Here, we combine dense connections with the mixed-scale approach, enabling dense connections between the feature maps of the entire network, resulting in more efficient use of all feature maps and an even larger reduction of the number of required parameters.
MS-D Neural Networks.
By combining mixed-scale dilated convolutions and dense connections, we can define a DCNN architecture that we call the MS-D network architecture. Similar to existing architectures, an MS-D network consists of several layers of feature maps. Each feature map is the result of applying the same set of operations given by Eq. 4 to all previous feature maps: dilated convolutions with
Schematic representation of an MS-D network with
Compared with existing DCNN architectures, the MS-D network architecture has several advantages. Due to the mixing of scales through dilated convolutions and dense connections, MS-D networks can produce accurate results with relatively few feature maps and trainable parameters. Furthermore, an MS-D network learns which combination of dilations to use during training, allowing the same network to be effectively applied to a wide variety of problems. Finally, all layers are connected to each other in the same way and computed using the same set of standard operations, making MS-D networks easier to implement, train, and use in practice. MS-D networks do not include learned scaling operations or advanced layer types to facilitate training and do not require architecture changes when being applied to different problems. These advantages can make MS-D networks applicable beyond semantic segmentation, with potential value in classification, detection, instance segmentation, and adversarial networks (16).
Experiments
Setup.
We implemented the MS-D architecture in Python, using PyCUDA (26) to enable GPU acceleration of computationally expensive parts such as convolutional operations. We note that existing frameworks such as TensorFlow (9) or Caffe (10) typically do not support the proposed mixed-scale approach well, since they assume that all channels of a certain feature map are computed in the same way. Furthermore, existing frameworks are mostly optimized for processing large numbers of relatively small images by efficiently implementing convolutions using large matrix multiplications (27). To allow the application of MS-D networks to problems with large images, we implemented the architecture using direct convolutions. Computations were performed on two workstations, with an NVidia GeForce GTX 1080 GPU and four NVidia Tesla K80 GPUs, respectively, all running CUDA 8.0.
In general, deeper networks tend to produce more accurate results than shallower networks (2). Because of the dense connections in MS-D networks, it is possible to effectively use networks that have many layers and few channels per layer, resulting in very deep networks with relatively few channels. Such very deep networks might be more difficult to train than shallower networks, as explained above. However, we did not observe such problems and were able to use the extreme case of each layer consisting of only one channel (
In segmentation problems with L labels, we represent correct outputs by images with L channels, with channel j set to 1 for pixels that are assigned to label j and set to 0 for other pixels. We use the soft-max activation function in the final output layer and use the ADAM optimization method (17) during training to minimize the cross-entropy between correct outputs and network outputs (5). To compare results of MS-D networks with existing architectures for segmentation problems, we use the global accuracy metric (4), defined as the percentage of correctly labeled pixels in the network output, and the class accuracy metric (4), computed by taking the average of the true positive rates for each individual label.
Simulated Data.
In a first experiment, network input consist of
(A–C) Example of the segmentation problem of the simulated dataset, with (A) the single-channel input image, (B) the correct segmentation with labels indicated by color, and (C) the output of a trained MS-D network with 200 layers.
In Fig. 5, class accuracy for an independent test set of 100 images is shown as a function of the number of trainable parameters for MS-D networks with
The class accuracy of a set of 100 simulated images (Fig. 4) as a function of the number of trainable parameters for the proposed MS-D network architecture and the popular U-Net architecture. For each U-Net network (U-Net-q), q indicates the number of scaling operations used. For the MS-D architecture, results are shown for dilations
CamVid Dataset.
Next, we compare results for the CamVid dataset (29), using 367 training, 101 validation, and 233 testing color images of road scenes with
Table 1 shows global and class accuracies. MS-D segments with highest global and class accuracy, while using roughly 10 times fewer parameters. Furthermore, an MS-D network with 100 layers achieves similar accuracies to other network architectures while using 30–40 times fewer parameters.† Fig. 6 shows global accuracy during training for validation and training sets, for both the U-Net network and an MS-D network. Lack of improvement for the U-Net network in validation set accuracy and its difference with training set accuracy indicate overfitting of the chosen training set. Due to the smaller number of trainable parameters, the MS-D network improves validation set accuracy for more training iterations, with a significantly smaller difference with training set accuracy, showing reduced risk of overfitting of MS-D networks and the ability to accurately train with relatively small training sets. In addition, MS-D networks are able to achieve accurate results without pretraining additional large datasets, e.g., ImageNet (11), or relying on large pretrained networks, e.g., VGG (2).
The number of trainable parameters (Pars) in millions (M), global accuracy (GA), and class accuracy (CA) for the CamVid test set
The global accuracy of a U-Net network and an MS-D network as a function of the training epoch for the CamVid dataset. Given are the accuracies for the validation set (solid lines) and the training set (dashed lines).
Segmenting Biomedical Images.
To test whether an MS-D network can be easily applied to a new problem without adjustments, we use the same network parameters as above, with
(A–C) A tomographic slice of the test cell (A), with the corresponding manual segmentation (B) and output of an MS-D network with 100 layers (C).
To learn limited 3D features, we use five channels in the input image of the MS-D network: the current slice to be segmented and four adjacent slices. Of eight manual cell segmentations, we randomly chose six for training and one for validation and report results for the remaining cell. During training, we used a batch size of 10 images and stopped after no improvements in global accuracy for the validation cell, yielding network parameters with the best global accuracy. Fig. 7C shows network output for the slice in Fig. 7A, showing high similarity to manual segmentation. The remaining differences between network output and manual segmentation, indicated by an arrow in Fig. 7, typically represent ambiguous cell structure (see Figs. S1 and S2 for additional results). Final global accuracy and class accuracy of the trained network for the test cell are 94.1% and 93.1%, indicating that identical MS-D networks can be trained for different problems. Results for two other challenging problems are given in Figs. S3 and S4.
Denoising Large Tomographic Images.
Finally, we use the above architecture, changing only the nonlinear function of the final layer from the soft-max function to the identity, and train on the different task of denoising tomographic reconstructions of a fiber-reinforced minicomposite. A total of 2,160 images of
(A–C) Tomographic images of a fiber-reinforced minicomposite, reconstructed using 1,024 projections (A) and 128 projections (B). In C, the output of an MS-D network with image B as input is shown. Bottom Right Insets in A–C show enlarged images of small regions indicated by red squares.
Conclusions
We have presented a deep convolutional MS-D network architecture for image-processing problems, using dilated convolutions instead of traditional scaling operations to learn features at different scales, using multiple scales in each layer, and computing the feature map of each layer using all feature maps of earlier layers, resulting in a densely connected network. By combining dilated convolutions and dense connections, the MS-D network architecture can achieve accurate results with significantly fewer feature maps and trainable parameters than existing architectures, enabling accurate training with relatively small training sets. MS-D networks are able to automatically adapt by learning which combination of dilations to use, allowing identical MS-D networks to be applied to a wide range of different problems.
Acknowledgments
Supporting contributions were performed by A. Ekman, C. Larabell [supported by the National Institutes of Health-National Institute for Drug Abuse (NIH-NIDA) Grant U01DA040582], S. Mo, O. Jain, D. Parkinson, A. MacDowell, and D. Ushizima [Center for Advanced Mathematics for Energy Research Applications (CAMERA)]. Computer calculations were performed at the Lawrence Berkeley National Laboratory under Contract DE0AC02-5CH11231. Tomographic images of fiber-reinforced minicomposite were provided by N. Larson and collected at the US Department of Energy’s (DOE) Advanced Light Source (ALS) Beamline 8.3.2. This work was supported by CAMERA, jointly funded by The Office of Advanced Scientific Research (ASCR) and the Office of Basic Energy Sciences (BES) within the DOE’s Office of Science. Soft X-ray tomography data were collected and segmented at the National Center for X-ray Tomography, supported by the National Institutes of Health-National Institute of General Medical Sciences (NIH-NIGMS) Grant P41GM103445 and the Department of Energy, Office of Biological and Environmental Research, Grant DE0AC02-5CH11231.
Footnotes
- ↵1To whom correspondence should be addressed. Email: sethian{at}math.berkeley.edu.
Author contributions: D.M.P. and J.A.S. designed research, performed research, and wrote the paper.
Reviewers: S.E.F., University of Southern California; and R.K., Technion–Israel Institute of Technology.
The authors declare no conflict of interest.
↵∗Alternatively, dilated convolutions can be defined without using dilated filters by changing the convolution operation itself; see ref. 23 for a detailed explanation.
↵†The authors of ref. 4 report improved results for the SegNet architecture with 90.4% global accuracy by training with a significantly larger set of around 3,500 images. However, since this larger set is not publicly available, we cannot directly compare this result with the MS-D network architecture.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1715832114/-/DCSupplemental.
Published under the PNAS license.
References
- ↵
- Agrawal P,
- Girschick R,
- Malik J
- ↵
- Simonyan K,
- Zisserman A
- ↵
- He K,
- Zhang X,
- Ren S,
- Sun J
- ↵
- Badrinarayanan V,
- Kendall A,
- Cipolla R
- ↵
- Navab N,
- Hornegger J,
- Wells W,
- Frangi A
- Ronneberger O,
- Fischer P,
- Brox T
- ↵
- Shelhamer E,
- Long J,
- Darrell T
- ↵
- Fürnkranz J,
- Joachims T
- Nair V,
- Hinton GE
- ↵
- ↵
- Keeton K,
- Roscoe T
- Abadi M, et al.
- ↵
- Jia Y, et al.
- ↵
- Deng J, et al.
- ↵
- ↵
- Ioffe S,
- Szegedy C
- ↵
- Srivastava RK,
- Greff K,
- Schmidhuber J
- ↵
- He K,
- Zhang X,
- Ren S,
- Sun J
- ↵
- Isola P,
- Zhu JY,
- Zhou T,
- Efros AA
- ↵
- Kingma D,
- Ba J
- ↵
- Pereira F,
- Burges CJC,
- Bottou L,
- Weinberger KQ
- Krizhevsky A,
- Sutskever I,
- Hinton GE
- ↵
- Larsson G,
- Maire M,
- Shakhnarovich G
- ↵
- ↵
- ↵
- Niessen WJ,
- Viergever MA
- Gerig G,
- Jomier M,
- Chakos M
- ↵
- Yu F,
- Koltun V
- ↵
- Ke TW,
- Maire M,
- Yu SX
- ↵
- Huang G,
- Liu Z,
- van der Maaten L,
- Weinberger KQ
- ↵
- Klöckner A, et al.
- ↵
- Chetlur S, et al.
- ↵
- Akeret J,
- Chang C,
- Lucchi A,
- Refregier A
- ↵
- Brostow GJ,
- Fauqueur J,
- Cipolla R
- ↵
- Jarrett K, et al.
- ↵
- Ladickỳ L,
- Sturgess P,
- Alahari K,
- Russell C,
- Torr P
- ↵
- Tighe J,
- Lazebnik S
Citation Manager Formats
Sign up for Article Alerts
Article Classifications
- Physical Sciences
- Applied Mathematics