The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes

NeurIPS 2022

Peter Kocsis
TU Munich
Peter Súkeník
IST Austria
Guillem Brasó
TU Munich
Matthias Nießner
TU Munich
Laura Leal-Taixé
TU Munich
Ismail Elezi
TU Munich


Convolutional neural networks were the standard for solving many computer vision tasks until recently, when Transformers of MLP-based architectures have started to show competitive performance. These architectures typically have a vast number of weights and need to be trained on massive datasets; hence, they are not suitable for their use in low-data regimes. In this work, we propose a simple yet effective framework to improve generalization from small amounts of data. We augment modern CNNs with fully-connected (FC) layers and show the massive impact this architectural change has in low-data regimes. We further present an online joint knowledge-distillation method to utilize the extra FC layers at train time but avoid them during test time. This allows us to improve the generalization of a CNN-based model without any increase in the number of weights at test time. We perform classification experiments for a large range of network backbones and several standard datasets on supervised learning and active learning. Our experiments significantly outperform the networks without fully-connected layers, reaching a relative improvement of up to 16% validation accuracy in the supervised setting without adding any extra parameters during inference.


Feature Refiner

Feature Refiner We propose a simple yet effective framework for improving the generalization from a small amount of data. In our work, we bring back fully-connected layers at the end of CNN-based architectures. We show that by adding as little as 0.37% extra parameters during training, we can significantly improve the generalization in the low-data regime. Our network architecture consists of two main parts: a convolutional backbone network and our proposed Feature Refiner (FR) based on multi-layer perceptrons. Our method is task and model-agnostic and can be applied to many convolutional networks. In our method, we extract features with the convolutional backbone network. Then, we apply our FR followed by a task-specific head. More precisely, we first reduce the feature dimension dbbf to df rf with a single linear layer to reduce the number of extra parameters. Then we apply a symmetric two-layer multi-layer perceptron wrapped around by normalization layers.

Online Joint Knowledge Distillation

Online Joint Knowledge Distillation One could argue that using more parameters can improve the performance just because of the increased expressivity of the network. To disprove this argument, we develop an online joint knowledge distillation (OJKD) method. Our OJKD enables us to use the exact same architecture as our baseline networks during inference and utilizes our FR solely during training.


We compare the results of our method with those of ResNet18. On the first training cycle (1000 labels), our method outperforms ResNet18 by 7.6 percentage points (pp). On the second cycle, we outperform ResNet18 by more than 10pp. We keep outperforming ResNet18 until the seventh cycle, where our improvement is half a percentage point. For the remaining iterations, both methods reach the same accuracy. A common tendency for all datasets is that with an increasing number of labeled samples, the gap between our method and the baseline shrinks. Therefore, dropping the fully-connected layers in case of a large labeled dataset does not cause any disadvantage, as was found in [6]. However, that work did not analyze this question in the low-data regime, where using FC layers after CNN architectures is clearly beneficial.
We check if our method can be used with other backbones than ResNet18. The goal of the experiment is to show that our method is backbone agnostic and generalizes both to different versions of ResNet as well as to other types of convolutional neural networks. As we can see, our method significantly outperforms the baselines on both datasets and for all three types of backbones.


author = {Peter Kocsis
and Peter S\'{u}ken\'{i}k
and Guillem Bras\'{o}
and Matthias Nie{\ss}ner
and Laura Leal-Taix\'{e}
and Ismail Elezi},
title = {The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes},
booktitle = {Proc. NeurIPS},
Citation copied!