Visformer¶
Introduction¶
Visformer, or Vision-friendly Transformer, is an architecture that combines Transformer-based architectural features with those from convolutional neural network architectures. Visformer adopts the stage-wise design for higher base performance. But self-attentions are only utilized in the last two stages, considering that self-attention in the high-resolution stage is relatively inefficient even when the FLOPs are balanced. Visformer employs bottleneck blocks in the first stage and utilizes group 3 × 3 convolutions in bottleneck blocks inspired by ResNeXt. It also introduces BatchNorm to patch embedding modules as in CNNs. [2]
Figure 1. Network Configuration of Visformer [1]
Results¶
ImageNet-1k¶
Our reproduced model performance on ImageNet-1K is reported as follows.
Model |
Context |
Top-1 (%) |
Top-5 (%) |
Params (M) |
Recipe |
Download |
|---|---|---|---|---|---|---|
visformer_tiny |
D910x8-G |
78.28 |
94.15 |
10.33 |
||
visformer_tiny_v2 |
D910x8-G |
78.82 |
94.41 |
9.38 |
||
visformer_small |
D910x8-G |
81.73 |
95.88 |
40.25 |
||
visformer_small_v2 |
D910x8-G |
82.17 |
95.90 |
23.52 |
Notes¶
Context: Training context denoted as {device}x{pieces}-{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.
Top-1 and Top-5: Accuracy reported on the validation set of ImageNet-1K.
Quick Start¶
Preparation¶
Installation¶
Please refer to the installation instruction in MindCV.
Dataset Preparation¶
Please download the ImageNet-1K dataset for model training and validation.
Training¶
Distributed Training
It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
# distrubted training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/visformer/visformer_tiny_ascend.yaml --data_dir /path/to/imagenet
Similarly, you can train the model on multiple GPU devices with the above mpirun command.
For detailed illustration of all hyper-parameters, please refer to config.py.
Note: As the global batch size (batch_size x num_devices) is an important hyper-parameter, it is recommended to keep the global batch size unchanged for reproduction or adjust the learning rate linearly to a new global batch size.
Standalone Training
If you want to train or finetune the model on a smaller dataset without distributed training, please run:
# standalone training on a CPU/GPU/Ascend device
python train.py --config configs/visformer/visformer_tiny_ascend.yaml --data_dir /path/to/dataset --distribute False
Validation¶
To validate the accuracy of the trained model, you can use validate.py and parse the checkpoint path with --ckpt_path.
python validate.py -c configs/visformer/visformer_tiny_ascend.yaml --data_dir /path/to/imagenet --ckpt_path /path/to/ckpt
Deployment¶
To deploy online inference services with the trained model efficiently, please refer to the deployment tutorial.
References¶
[1] Chen Z, Xie L, Niu J, et al. Visformer: The vision-friendly transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 589-598.
[2] Visformer, https://paperswithcode.com/method/visformer