Vision Transformer

Introduced by Dosovitskiy et al. in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.

Source: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Semantic Segmentation	85	9.55%
Image Classification	63	7.08%
Object Detection	34	3.82%
Self-Supervised Learning	33	3.71%
Decoder	26	2.92%
Image Segmentation	23	2.58%
Classification	18	2.02%
Instance Segmentation	16	1.80%
Retrieval	15	1.69%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Dense Connections	Feedforward Networks
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms

Categories

Add Remove

Image Models

Vision Transformers