Visual Representation
- Masked Autoencoders Are Scalable Vision Learners-2021<Paper>
- Efficient Self-supervised Vision Transformers for Representation Learning-2021<Paper>
- Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations-2021<Paper><Code-PyTorch>
Video Understanding
- ViViT: A Video Vision Transformer-Arxiv2021(<Paper><Code-PyTorch>)
- Video Swin Transformer-Arxiv2021<Paper><Code-PyTorch>
- AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE-ICLR2021<Paper><Code-PyTorch>
- A Survey of Transformers-Arxiv2021<Paper>
- Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features-CBMI2021<Paper>
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision-ICML2021<Paper><Code-PyTorch>
- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers-CVPR2021<Paper>
- TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval-2021<Paper><Code-PyTorch>
- Cross-Modal Retrieval Augmentation for Multi-Modal Classification-2021<Paper>
- Continual learning in cross-modal retrieval-CVPR2021<Paper>
- Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning-CVPR2021<[Paper]><Code>
- Visual Semantic Role Labeling for Video Understanding-CVPR2021<Paper><Code-PyTorch>
- Perceiver: General Perception with Iterative Attention-2021<Paper>
- Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning-CVPR2021<Paper><Code-PyTorch>
- Hyperbolic Visual Embedding Learning for Zero-Shot Recognition-CVPR2020<Paper><Code-PyTorch>
- Retrieve Fast, Rerank Smart:Cooperative and Joint Approaches for Improved Cross-Modal Retrieval-2021<Paper><Code-PyTorch>
- What is Multimodality?<Paper>
- Multi-modal Transformer for Video Retrieval-ECCV2020<Paper><Code-PyTorch>
- Support-set Bottlenecks for Video-text Representation Learning-ICLR2021<Paper>
- Dual Encoding for Video Retrieval by Text-TPAMI2021<Paper><Code-PyTorch>
- Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling-CVPR2021<Paper><Code>
- VL-BERT: Pre-training of Generic Visual-Linguistic Representations-ICLR2020<Paper><Code-PyTorch>
- Transformer is All You Need:Multimodal Multitask Learning with a Unified Transformer-2021<Paper><Code>
- COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning-NeurIPS2020<Paper><Code-PyTorch>
- Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning-CVPR2020<Paper><Code-PyTorch>
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers-EMNLP2019<Paper><Code-PyTorch>
- VisualBERT: A Simple and Performant Baseline for Vision and Language-2019<Paper><Code-PyTorch>
- Video SemNet: Memory-Augmented Video Semantic Network-NIPS2017<Paper>
- Self-Supervised Video Representation Learning by Pace Prediction-ECCV2020<Paper><Code-PyTorch>
- SalSum: Saliency-based Video Summarization using Generative Adversarial Networks-2020<Paper>
- Self-Supervised Temporal-Discriminative Representation Learning for Video Action Recognition-2020<Paper><Code-PyTorch><Zhihu>
- Classification of Important Segments in Educational Videos using Multimodal Features-CIKM2020<Paper><Code-Keras>
- Attentive and Adversarial Learning for Video Summarization-WACV2019<Paper><Code-PyTorch>
- Digital Video Summarization Techniques: A Survey-2020<Paper>
- Emerging Trends of Multimodal Research in Vision and Language-2020<Paper>
- Exploring global diverse attention via pairwise temporal relation for video summarization-2020<Paper>
- Multi-modal Dense Video Captioning-CVPR Workshops 2020<Paper><Code-PyTorch><Project>
- Accuracy and Performance Comparison of Video Action Recognition Approaches-HPEC2020<Paper>
- What Makes Training Multi-Modal Classification Networks Hard?-CVPR2020<Paper>
- [DMASum] Query Twice: Dual Mixture Attention Meta Learning for Video Summarization-ACM2020<Paper>
- [CHAN] Convolutional Hierarchical Attention Network for Query-Focused Video Summarization<Paper><Code-PyTorch>
- [ILS-SUMM] ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization-ICPR2020<Paper><Code>
- Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward-AAAI2018<Paper><Code-PyTorch>
- [SUM-GAN] Unsupervised video summarization with adversarial lstm networks-CVPR2017<Paper><Code-PyTorch>
- Enhancing Video Summarization via Vision-Language Embedding-CVPR2017<Paper>
- Query-adaptive Video Summarization via Quality-aware Relevance Estimation-ICCV2017<Paper><Code-Theano>
- Temporal Tessellation: A Unified Approach for Video Analysis-ICCV2017<Paper><Code-Tensorflow>
- Video Summarization with Long Short-term Memory-ECCV2016<Paper><Code-Theano><Code-Keras>
- Video Summarization using Deep Semantic Features-ACCV2016<Paper><Code-Chainer>
- [SA-LSTM] Describing Videos by Exploiting Temporal Structure-ICCV2015<Paper><Code-Theano><Code-PyTorch>
- [3D-ResNet] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?-CVPR2018<Paper><Code-PyTorch>
- [Hidden Two-Stream] Hidden Two-Stream Convolutional Networks for Action Recognition-ACCV2018<Paper><Code-PyTorch>
- [FlowNet2] FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks<[Paper(https://arxiv.org/pdf/1612.01925.pdf)]><Code-PyTorch>
- [TSN] Temporal Segment Networks Towards Good Practices for Deep Action Recognition-ECCV2016<Paper><Code-Caffe><Code-PyTorch>
- Towards Good Practices for Very Deep Two-Stream ConvNets<Paper><Code-Caffe>
- [Two-Stream] Two-Stream Convolutional Networks for Action Recognition in Videos-NIPS2014<Paper>
- [C3D] Learning Spatiotemporal Features with 3D Convolutional Networks-ICCV2015<Paper><Code-Caffe><Code-Tensorflow><Code-PyTorch>
- [NetVLAD] NetVLAD: CNN architecture for weakly supervised place recognition-CVPR2016<Paper><Code-Matlab><Code-PyTorch>