[20160423]Human Activity Recognition and Prediction (C-6)

Human Activity Recognition and Prediction By Yun Fu
Chapter 6 Activity Prediction

Chapter 6 - Activity Prediction

Activity Prediction与Action Recognition的区别: Action video data arrive sequentially in action prediction。
文中提出Multiple temporal scale support vector machine (MTSSVM)对action进行prediction。 MTSSVM有助于提取action中动态的和演化性的信息，并且可以根据部分观测的video对aciton进行prediction。

Related Work: 在Action Recognition问题中，常用的是Bag-of-words方法，但是该方法对于Appearance和Pose变化较大的情况表现不好，改进的方法有：
1.用sematic description or attributes描述；
2.用关键帧的方法。
但以上都是用在Fully observe video中的方法，本文主要做的是prediction也即是对于partial observe video而言的。
对于Pose或Appearance的时间演化(temporal evolutions)可以用Sequential State Models建模(类似HMM等方法)，但没有考虑这些演化与Observation ratios间的关系，就是说需要考虑Pose或Appearance的演化程度与动作进行的ratios的关系。
之前有人提出Integral bag-of-words(IBoW)方法和Dynamic bag-of-words(DBoW)方法进行action prediction。他们用同类、同阶段的the mean of feature去建模，考虑到intra-class(类间) appearance variation,又用sparse coding表达Action model.
本文模型考虑了信息会随着新的观测的到来而积累，并考虑label consistency，还有action dynamics at multiple scale。

Method:
数据组织:首先是将视频平均分割成K段，每段长度为T/K(总长度为T)。定义progress level:$\frac tK$(t为帧数), observation ratio:$\frac tT$。
Action Representations:获得interest points和trajectories然后进行聚类，用Histogram of visual words方法对entire partial video进行表达。
令$D = \{x_i,y_i\}_{i=1}^N$为训练数据,
其中$x_i$为第i个fully observed action video且$y_i$为对应的action label。
Action prediction问题可转化为学习一个函数$f:\mathcal{X} \rightarrow \mathcal{Y}$,表示将部分观测视频$x_{(1,k)}\in\mathcal{X}$映射到Action label$y\in\mathcal{Y}(k\in{1,…,K})$
本文希望学习一个discriminat function$F:\mathcal{X} \rightarrow \mathcal{Y}$用于对训练样本$(x,y)$进行打分(score),而非寻找函数$f$.
$F$定义为一个线性函数$F(x_{(1,k)},y:\mathbf{w}) = \langle\mathbf{w},\Phi(x_{(1,k)},y)\rangle$,其中$\Phi(x_{(1,k)},y)$为spatio-temporal feature。
预测action label可表示为:$y^\ast = {argmax}_{y\in\mathcal{Y}}F(x_{(1,k)},y;\mathbf{w^\ast}) = {argmax}_{y\in\mathcal{Y}}\langle\mathbf{w^\ast},\Phi(x_{(1,k)},y)\rangle$
文中将$\mathbf{w}^T\Phi(x_{(1,k)},y)$解析为Global progress model(GPM)和Local progress model(LPM)。
$\mathbf{w}^T\Phi(x_{(1,k)},y) = \mathbf{\alpha_k^T}\psi_1(x_{(1,k)},y) + \sum_{l=1}^K[\mathbf{1}(l\leq k)\cdot\mathbf{\beta_l^T}\psi_2(x_{(1,k)},y)]$
Global progress model(GPM):
$\mathbf{\alpha_k^T}\psi_1(x_{(1,k)},y) = \sum_{a\in\mathcal{Y}}\alpha_k^T\mathbf{1}(y=a)g(x_{(1,k)},1:k)$
$\alpha_k^T$在不同程度上指定了progress level和label。
Local progress model(LPM):
$\mathbf{\beta_l^T}\psi_2(x_{(1,k)},y) = \sum_{a\in\mathcal{Y}}\beta_l^T\mathbf{1}(y=a)g(x_{(1,k)},l)$
LPM考虑了视频的序列性。

Structured Learning Formulation:
本文使用类似SSVM的方法对MTSSVM进行Formulation。约束项含义包含了:
1.对每个Segment的score需要递增
2.鼓励label consistency，并且允许有segment outlier等。具体参见文章。
此外作者还说明约束项是Empirical Risk Minimization的上界。
对于以上的Formulation，作者使用regularized bundle algorithm进行迭代近似求解。

最后实验部分，作者在UTI-Set1, UTI-Set2和BIT三个数据集上进行prediction的测试。
在对GPM和LPM进行重要性实验时发现，LPM有助于获取区别性的视频段信息，同时GPM有助于获取历史信息从而补偿LPM而提高performance。

欢迎转载和提出意见和建议，转载请注明出处。http://yongyitang92.github.io