Abstract: In this study, we proposed a micro-expression recognition framework combining the Vision Transformer (ViT) network based on the multi-head self-attention mechanism with a pre-trained model.