Visual Representation

  • Masked Autoencoders Are Scalable Vision Learners-2021<Paper>
  • Efficient Self-supervised Vision Transformers for Representation Learning-2021<Paper>
  • Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations-2021<Paper><Code-PyTorch>

Video Understanding

  • ViViT: A Video Vision Transformer-Arxiv2021(<Paper><Code-PyTorch>)
  • Video Swin Transformer-Arxiv2021<Paper><Code-PyTorch>
  • AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE-ICLR2021<Paper><Code-PyTorch>
  • A Survey of Transformers-Arxiv2021<Paper>
  • Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features-CBMI2021<Paper>
  • ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision-ICML2021<Paper><Code-PyTorch>
  • Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers-CVPR2021<Paper>
  • TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval-2021<Paper><Code-PyTorch>
  • Cross-Modal Retrieval Augmentation for Multi-Modal Classification-2021<Paper>
  • Continual learning in cross-modal retrieval-CVPR2021<Paper>
  • Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning-CVPR2021<[Paper]><Code>
  • Visual Semantic Role Labeling for Video Understanding-CVPR2021<Paper><Code-PyTorch>
  • Perceiver: General Perception with Iterative Attention-2021<Paper>
  • Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning-CVPR2021<Paper><Code-PyTorch>
  • Hyperbolic Visual Embedding Learning for Zero-Shot Recognition-CVPR2020<Paper><Code-PyTorch>
  • Retrieve Fast, Rerank Smart:Cooperative and Joint Approaches for Improved Cross-Modal Retrieval-2021<Paper><Code-PyTorch>
  • What is Multimodality?<Paper>
  • Multi-modal Transformer for Video Retrieval-ECCV2020<Paper><Code-PyTorch>
  • Support-set Bottlenecks for Video-text Representation Learning-ICLR2021<Paper>
  • Dual Encoding for Video Retrieval by Text-TPAMI2021<Paper><Code-PyTorch>
  • Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling-CVPR2021<Paper><Code>
  • VL-BERT: Pre-training of Generic Visual-Linguistic Representations-ICLR2020<Paper><Code-PyTorch>
  • Transformer is All You Need:Multimodal Multitask Learning with a Unified Transformer-2021<Paper><Code>
  • COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning-NeurIPS2020<Paper><Code-PyTorch>
  • Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning-CVPR2020<Paper><Code-PyTorch>
  • LXMERT: Learning Cross-Modality Encoder Representations from Transformers-EMNLP2019<Paper><Code-PyTorch>
  • VisualBERT: A Simple and Performant Baseline for Vision and Language-2019<Paper><Code-PyTorch>
  • Video SemNet: Memory-Augmented Video Semantic Network-NIPS2017<Paper>
  • Self-Supervised Video Representation Learning by Pace Prediction-ECCV2020<Paper><Code-PyTorch>
  • SalSum: Saliency-based Video Summarization using Generative Adversarial Networks-2020<Paper>
  • Self-Supervised Temporal-Discriminative Representation Learning for Video Action Recognition-2020<Paper><Code-PyTorch><Zhihu>
  • Classification of Important Segments in Educational Videos using Multimodal Features-CIKM2020<Paper><Code-Keras>
  • Attentive and Adversarial Learning for Video Summarization-WACV2019<Paper><Code-PyTorch>
  • Digital Video Summarization Techniques: A Survey-2020<Paper>
  • Emerging Trends of Multimodal Research in Vision and Language-2020<Paper>
  • Exploring global diverse attention via pairwise temporal relation for video summarization-2020<Paper>
  • Multi-modal Dense Video Captioning-CVPR Workshops 2020<Paper><Code-PyTorch><Project>
  • Accuracy and Performance Comparison of Video Action Recognition Approaches-HPEC2020<Paper>
  • What Makes Training Multi-Modal Classification Networks Hard?-CVPR2020<Paper>
  • [DMASum] Query Twice: Dual Mixture Attention Meta Learning for Video Summarization-ACM2020<Paper>
  • [CHAN] Convolutional Hierarchical Attention Network for Query-Focused Video Summarization<Paper><Code-PyTorch>
  • [ILS-SUMM] ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization-ICPR2020<Paper><Code>
  • Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward-AAAI2018<Paper><Code-PyTorch>
  • [SUM-GAN] Unsupervised video summarization with adversarial lstm networks-CVPR2017<Paper><Code-PyTorch>
  • Enhancing Video Summarization via Vision-Language Embedding-CVPR2017<Paper>
  • Query-adaptive Video Summarization via Quality-aware Relevance Estimation-ICCV2017<Paper><Code-Theano>
  • Temporal Tessellation: A Unified Approach for Video Analysis-ICCV2017<Paper><Code-Tensorflow>
  • Video Summarization with Long Short-term Memory-ECCV2016<Paper><Code-Theano><Code-Keras>
  • Video Summarization using Deep Semantic Features-ACCV2016<Paper><Code-Chainer>
  • [SA-LSTM] Describing Videos by Exploiting Temporal Structure-ICCV2015<Paper><Code-Theano><Code-PyTorch>
  • [3D-ResNet] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?-CVPR2018<Paper><Code-PyTorch>
  • [Hidden Two-Stream] Hidden Two-Stream Convolutional Networks for Action Recognition-ACCV2018<Paper><Code-PyTorch>
  • [FlowNet2] FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks<[Paper(https://arxiv.org/pdf/1612.01925.pdf)]><Code-PyTorch>
  • [TSN] Temporal Segment Networks Towards Good Practices for Deep Action Recognition-ECCV2016<Paper><Code-Caffe><Code-PyTorch>
  • Towards Good Practices for Very Deep Two-Stream ConvNets<Paper><Code-Caffe>
  • [Two-Stream] Two-Stream Convolutional Networks for Action Recognition in Videos-NIPS2014<Paper>
  • [C3D] Learning Spatiotemporal Features with 3D Convolutional Networks-ICCV2015<Paper><Code-Caffe><Code-Tensorflow><Code-PyTorch>
  • [NetVLAD] NetVLAD: CNN architecture for weakly supervised place recognition-CVPR2016<Paper><Code-Matlab><Code-PyTorch>
阅读全文 »

TODO

  • 基于视频的时序建模与动作识别<Talk>
    Keywords:ARTNet, TSN, UntrimmedNet

  • Squeeze-and-Excitation Networks<Talk>

  • 基于光流的视频语义分割和物体检测<Talk>
    Keywords:Opital flow, Feature propagation(特征传播), Feature aggregation

  • From Faster R-CNN to Mask R-CNN<Talk>
    Key function: Classification, Location, Mask(per pixel) classification, Lanmarks location

  • 理解和利用CNN的内部表征<Talk>
    Outlines: Visualizing the internal units & Weakly supervised localization and class-specific saliency
    Keywords:: 神经元响应,内部表征,物体表征,可视化,可解释性

阅读全文 »

视频理解中audio信息的使用

读取音频&可视化波形图

1
2
3
4
5
6
7
8
9
10
11
12
13
import matplotlib
matplotlib.use('agg') # 服务器无GUI使用matplotlib绘图
import matplotlib.pyplot as plt
import soundfile as sf
import librosa.display

wav_data, sr = sf.read(wav_file, dtype='int16')
samples = wav_data / 32768.0 # Convert to [-1.0, +1.0]

plt.figure()
plt.title('Waveform')
librosa.display.waveplot(samples, sr=sr)
plt.savefig("wav.png")

wav

声音波形图只是把一维的时域信号直观的显示出来,并没有什么有用的信息。

声谱图(spectrogram)

声谱图,简单来说就是将声音信号通过短时傅里叶变换,获得一张包含3个维度数据的热力图。

1
2
3
4
5
6
7
8
9
10
11
import librosa.display
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import numpy as np

#librosa.display.specshow(librosa.power_to_db(spectrogram.T, ref=np.max), sr=audio_sample_rate, x_axis='time', y_axis='linear')
librosa.display.specshow(spectrogram.T, sr=audio_sample_rate, x_axis='time', y_axis='linear')
plt.colorbar(format='%+2.0f dB')
plt.title('spectrogram')
plt.savefig('spec.png')

1 torchvision.transform.ToTensor()中的细节

功能:把一个取值范围是[0,255]PIL.Image或者shape(H,W,C)numpy.ndarray,转换成形状为(C,H,W),取值范围是[0,1.0]torch.FloatTensor

注意:只有当numpy.ndarraydtype=unit8时才会将像素值scale到[0,1.0]

阅读全文 »

为什么会有这个东西?

工作之后,很少有时间去维护github上项目了,但是又不忍心忽略掉小伙伴的issues,为了能够更快的解决小伙伴的问题,所以决定将我的微信放在这里!如果遇到以下问题就加我好友吧:

  • 使用项目期间发现有解决不掉的bug
  • 觉得项目有缺陷,你有一些改进建议
  • 你的科研&工作期间有一些问题想和我交流一下
  • 或者你有一些奇奇怪怪的想法

在一些视频理解任务中,比如分类,VQA等,除了画面信息之外通常还会引入一些音频、文本等信息。为了使视频任务中引入标题信息(无论是否有帮助,但总要尝试),决定利用BERT来提取标题向量。

看了google官方的bert以及huggingface所实现的pytorch版本的transformers,对于一个NLP小白来说可能是由于bert的变种太多实在太过眼花缭乱,所以本文的目标就是得到一个可提取中文文本特征向量的接口。

对我而言,接触一个任务时习惯性的会考虑三个问题:数据、模型以及损失函数。这篇Blog也根据该思路来对BERT做一个学习。

BERT的输入是什么?

该部分直接参考官方的tokenization.py来说明是如何对输入进行预处理的,其它框架虽然略有不同,但也都是由此演变而来。tokenization.py中包含了两个分词器:BasicTokenizerWordpieceTokenizer,下面我们根据例子来说明是如何处理的。

给定一段文本:text=“新垣结衣(Aragaki Yui),1988年6月11日出生于日本冲绳县那霸市。日本女演员、歌手、模特\r\n”。

BasicTokenizer

BasicTokenizer是一个初步分词器。对于一个输入文本,处理流程为:转unicode->去除特殊字符->处理中文->按空格分词->去除多余字符和标点分词->再次按空格分词。

转unicode编码

具体可以看convert_to_unicode(text)函数,大概流程是:对python3而言,输入为str类型时直接返回,为bytes型时转为unicode类型;对python2而言,输入为str类型时则转为unicode类型,为unicode类型时则直接返回。

经过转码处理之后,text和原来相同:

1
2
3
>>> text=convert_to_unicode(text)
>>> text
'新垣结衣(Aragaki Yui),1988年6月11日出生于日本冲绳县那霸市。日本女演员、歌手、模特\r\n'

清除无效字符及空白字符

该部分对应于_clean_text(self, text)函数,主要用于去除无效字符和多余空格。

源码中的ord()函数是以一个字符(长度为1的字符串)作为参数,返回对应的 ASCII 数值,或者 Unicode 数值。

整体流程如下:

  • 码位=0时

强化学习记录,学习资料为ZhouBolei大神的introRL

Foundation

  • Lecture1: Overview (课程概括与RL基础)
    • Keywords: Agent, Environment, action, reward
    • Supervised learning: Annotated images, data follows i.i.d distribution(条件独立同分布)
    • Reinforcement learning: Data are not i.i.d, a correlated time series data(数据不符合i.i.d条件,前后有时序关系); No instant feedback or label for correct action(不能马上得到反馈)
    • Features of RL: Trial-and-error exploration; Delayed reward; Time matters(sequential data, non i.i.d data); Agent’s actions changes the environment.
    • Rewards: A scalar feedback signal; Indicate how well agent is doing at step t; RL is based on the maximization of rewards.
    • Sequential Decision Making: Trade-off between immediate reward and long-term reward.
    • An RL agent components: Policy, Value function, Model.
    • Agents type: Value-based agent, Policy-based agent, Actor-Critic agent.
    • Exploration and Exploitation.

Advanced Topics

写在前面

想写点东西这个想法产生已经很久了,但是去年毕业时把博客相关的东西都搞丢了,所以博客也有一年多的时间没有更新了。正值工作一周年,内心或多或少有一些想说的话,终于下定决心重新折腾了一下,博客勉强恢复到了原来的样子,然后随便写点东西也算是对过去一年的交代。

阅读全文 »

并行训练(数据并行与模型并行)与分布式训练是深度学习中加速训练的两种常用方式,相对于并行训练,分布式是更优的加速方案,也是PyTorch官方推荐的方法:
Multi-Process Single-GPU
This is the highly recommended way to use DistributedDataParallel, with multiple processes, each of which operates on a single GPU. This is currently the fastest approach to do data parallel training using PyTorch and applies to both single-node(multi-GPU) and multi-node data parallel training. It is proven to be significantly faster than torch.nn.DataParallel for single-node multi-GPU data parallel training.

阅读全文 »

当我们的设备具有多个GPUs时,为了训练加速,我们通常会选用多卡并行训练,常见的并行训练方式有数据并行和模型并行。而PyTorch中也给我们提供了数据并行的接口DataParalle。本文将对该并行过程做一个简单的总结。

阅读全文 »