ConViT¶
Introduction¶
ConViT combines the strengths of convolutional architectures and Vision Transformers (ViTs). ConViT introduces gated positional self-attention (GPSA), a form of positional self-attention that can be equipped with a “soft” convolutional inductive bias. ConViT initializes the GPSA layers to mimic the locality of convolutional layers, then gives each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. ConViT, outperforms the DeiT (Touvron et al., 2020) on ImageNet, while offering a much improved sample efficiency.[1]
Figure 1. Architecture of ConViT [1]
Results¶
Our reproduced model performance on ImageNet-1K is reported as follows.
| Model | Context | Top-1 (%) | Top-5 (%) | Params (M) | Recipe | Download |
|---|---|---|---|---|---|---|
| convit_tiny | D910x8-G | 73.66 | 91.72 | 5.71 | yaml | weights |
| convit_tiny_plus | D910x8-G | 77.00 | 93.60 | 9.97 | yaml | weights |
| convit_small | D910x8-G | 81.63 | 95.59 | 27.78 | yaml | weights |
| convit_small_plus | D910x8-G | 81.80 | 95.42 | 48.98 | yaml | weights |
| convit_base | D910x8-G | 82.10 | 95.52 | 86.54 | yaml | weights |
| convit_base_plus | D910x8-G | 81.96 | 95.04 | 153.13 | yaml | weights |
Notes¶
Context: Training context denoted as {device}x{pieces}-{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.
Top-1 and Top-5: Accuracy reported on the validation set of ImageNet-1K.
Quick Start¶
Preparation¶
Installation¶
Please refer to the installation instruction in MindCV.
Dataset Preparation¶
Please download the ImageNet-1K dataset for model training and validation.
Training¶
Distributed Training
It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
# distrubted training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet
If the script is executed by the root user, the
--allow-run-as-rootparameter must be added tompirun.
Similarly, you can train the model on multiple GPU devices with the above mpirun command.
For detailed illustration of all hyper-parameters, please refer to config.py.
Note: As the global batch size (batch_size x num_devices) is an important hyper-parameter, it is recommended to keep the global batch size unchanged for reproduction or adjust the learning rate linearly to a new global batch size.
Standalone Training
If you want to train or finetune the model on a smaller dataset without distributed training, please run:
# standalone training on a CPU/GPU/Ascend device
python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/dataset --distribute False
Validation¶
To validate the accuracy of the trained model, you can use validate.py and parse the checkpoint path with --ckpt_path.
python validate.py -c configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet --ckpt_path /path/to/ckpt
Deployment¶
Please refer to the deployment tutorial in MindCV.
References¶
[1] d’Ascoli S, Touvron H, Leavitt M L, et al. Convit: Improving vision transformers with soft convolutional inductive biases[C]//International Conference on Machine Learning. PMLR, 2021: 2286-2296.