Table 2.

Summary of datasets

DatasetClassesInputDescription
CIFAR10 (26)1032 row × 32 column × 3 RGBNatural and manufactured objects in their environment
CIFAR100 (26)10032 row × 32 column × 3 RGBNatural and manufactured objects in their environment
SVHN (27)1032 row × 32 column × 3 RGBSingle digits of house addresses from Google’s Street View
GTSRB (28)4332 row × 32 column × 3 RGBGerman traffic signs in multiple environments
Flickr-Logos32 (29)3232 row × 32 column × 3 RGBLocalized corporate logos in their environment
VAD (30, 31)216 sample × 26 MFCCVoice activity present or absent, with noise (TIMIT + NOISEX)
TIMIT class (30).3932 sample × 16 MFCC × 3 deltaPhonemes from English speakers, with phoneme boundaries
TIMIT frame (30)3916 sample × 39 MFCCPhonemes from English speakers, without phoneme boundaries
  • GTSRB and Flickr-Logos32 are cropped and/or downsampled from larger images. VAD and TIMIT datasets have Mel-frequency cepstral coefficients (MFCC) computed from 16-kHz audio data.