M in such videos. Many action recognition methods

Published by admin on


M Sai Raghu Ram,                                                  M Raghavendra                                                       S Iniyan

SRM Institute of Science & Technology,                SRM Institute of Science &
Technology,                Asst. Professor

Kattankulathur, Chennai                                          Kattankulathur, Chennai                                          SRM Institute of Science &

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now


[email protected]                    [email protected]                
  [email protected]



            WITH the rapid advance of Internet
and smart phone, action recognition in personal videos produced by users has
become an important research topic due to its wide applications, such as
automatic video tracking 1, 2 and video annotation 3, etc. Consumer
videos on the Web are uploaded by users and produced by hand-held cameras or
smart phones, which may contain considerable camera shake, occlusion, and
cluttered background. Thus, these videos contain large intra class variations
within the same semantic category. It is now a challenging task to recognize
human actions in such videos. Many action recognition methods followed the
conventional framework. First, a large number of local motion features (e.g.,
space-time interest points (STIP) 4, 5, motion scale invariant feature
transform (MoSIFT) 6, etc.) are extracted from videos. Then, all local
features are quantized into a histogram vector using bag-of-words (BoWs)
representation 7, 8. Finally, the vector-based classifiers (e.g., support
vector machine 9) are used to perform recognition in testing videos. When the
videos are simple, these action recognition methods have achieved promising
results. However, noises and uncorrelated information may be incorporated into
the BoW during the extraction and quantization of the local features 10.
Therefore, these methods are usually not robust and could not be generalized
well when the videos contain considerable camera shake, occlusion, cluttered
background, and so on. In order to improve the recognition accuracy, meaningful
components of actions, e.g., related objects, human appearance, posture, and so
on, should be utilized to form a clearer semantic interpretation of human
actions. Recent efforts 11, 12 have demonstrated the effectiveness of leveraging
related objects or human poses. However, these methods may require a training
process with large amount of videos to obtain good performance, especially for
real world videos. However, it is quite challenging to collect enough labeled
videos that cover a diverse range of action poses.




most of the knowledge adaptation algorithms require sufficient labeled data in
the target domain. In real world applications, however, most videos are
unlabeled or weak-labeled. Collecting well-labeled videos is time consuming and
labor intensive. For example, 111 researchers from 23 institutes spent more
than 220 h to collect only 63 h of Trecvid 2003 development corpus 18.
Previous studies 19, 20 have shown that simultaneously utilizing labeled
and unlabeled data is beneficial for video action recognition. In order to
enhance the performance of action recognition, we explore how to utilize
semi-supervised learning to leverage unlabeled data and thus to learn a more
accurate classifier.


MoSIFT: Recognizing Human Actions in
Surveillance Videos CMU-CS-09-161 Ming-yu Chen and Alex Hauptmann:

        The goal of this paper is to build
robust human action recognition for real world surveillance videos. Local
spatio-temporal features around interest points provide compact but descriptive
representations for video analysis and motion recognition. The older approaches
tend to extend spatial descriptions. This captures the motion information
implicitly and this is done by adding temporal component for appearance
descriptor. In our approach we propose MoSIFT algorithm that detects interest
points and explicitly models local motion apart from encoding their local
appearance. The detection of distinctive local feature through local appearance
and motion is the main idea. We construct MoSIFT feature descriptors in the
spirit of the well-known SIFT descriptors to be robust to small deformations
through grid aggregation. In order to capture more global structure of actions,
we introduce a bigram model to construct co relation between local features. The
method advances the result on the KTH dataset to an accuracy of 95.8%. We also
applied our approach to nearly 100 hours of surveillance data as part of the
TRECVID Event Detection task with very promising results on recognizing human
actions in the real world surveillance videos.


Observing Human-Object Interactions:
Using Spatial and Functional Compatibility for Recognition Abhinav Gupta,
Member, IEEE, Aniruddha Kembhavi, Member, IEEE, and Larry S. Davis, Fellow,

      Interpretation of images and videos
containing humans interacting with different objects is a daunting task. It
involves understanding scene/event, analyzing human movements, recognizing
manipulable objects, and observing the effect of the human movement on those
objects. While each of these perceptual tasks can be conducted independently,
recognition rate improves when interactions between them are considered.
Motivated by psychological studies of human perception, we present a Bayesian
approach which integrates various perceptual tasks involved in understanding
human-object interactions. Previous approaches to object and action recognition
rely on static shape/appearance feature matching and motion analysis,
respectively. Our approach goes beyond these traditional approaches and applies
spatial and functional constraints on each of the perceptual elements for
coherent semantic interpretation. Such constraints allow us to recognize
objects and actions when the appearances are not discriminative enough. We also
demonstrate the use of such constraints in recognition of actions from static
images without using any motion information.

Visual Event Recognition in Videos
by Learning from Web Data Lixin Duan Dong Xu Ivor W. Tsang School of Computer
Engineering Nanyang Technological University {S080003, DongXu,
IvorTsang}@ntu.edu.sg Jiebo Luo Kodak Research Labs Eastman Kodak Company,
Rochester, NY, USA

      We propose a visual event recognition
framework for consumer domain videos by leveraging a large amount of loosely
labeled web videos (e.g., from YouTube). First, we propose a new aligned
space-time pyramid matching method to measure the distances between two video
clips, where each video clip is divided into space-time volumes over multiple
levels. We calculate the pair-wise distances between any two volumes and
further integrate the information from different volumes with Integer-flow
Earth Mover’s Distance (EMD) to explicitly align the volumes. Second, we
propose a new cross-domain learning method in order to 1) fuse the information
from multiple pyramid levels and features (i.e., space-time feature and static
SIFT feature) and 2) cope with the considerable variation in feature
distributions between videos from two domains (i.e., web domain and consumer
domain). For each pyramid level and each type of local features, we train a set
of SVM classi- fiers based on the combined training set from two domains using
multiple base kernels of different kernel types and parameters, which are fused
with equal weights to obtain an average classifier. Finally, we propose a
cross-domain learning method, referred to as Adaptive Multiple Kernel Learning
(A-MKL), to learn an adapted classifier based on multiple base kernels and the
prelearned average classi- fiers by minimizing both the structural risk
functional and the mismatch between data distributions from two domains.
Extensive experiments demonstrate the effectiveness of our proposed framework
that requires only a small number of labeled consumer videos by leveraging web

Action Recognition Using Nonnegative
Action Component Representation and Sparse Basis Selection Haoran Wang,
Chunfeng Yuan, Weiming Hu, Haibin Ling, Wankou Yang, and Changyin Sun.

      In this paper, we propose using high-level
action units to represent human actions in videos and, based on such units, a
novel sparse model is developed for human action recognition. There are three
interconnected components in our approach. First, we propose a new
context-aware spatialtemporal descriptor, named locally weighted word context,
to improve the discriminability of the traditionally used local
spatial-temporal descriptors. Second, from the statistics of the context-aware
descriptors, we learn action units using the graph regularized nonnegative
matrix factorization, which leads to a part-based representation and encodes
the geometrical information. These units effectively bridge the semantic gap in
action recognition. Third, we propose a sparse model based on a joint l2,1-norm
to preserve the representative items and suppress noise in the action units.
Intuitively, when learning the dictionary for action representation, the sparse
model captures the fact that actions from the same class share similar units.
The proposed approach is evaluated on several publicly available data sets. The
experimental results and analysis clearly demonstrate the effectiveness of the
proposed approach.

Recognizing human actions in still
images: a study of bag-of-features and part-based representations.

      Recognition of human actions is usually
addressed in the scope of video interpretation. Meanwhile, common human actions
such as “reading a book”, “playing a guitar” or “writing notes” also provide a
natural description for many still images. In addition, some actions in video
such as “taking a photograph” are static by their nature and may require
recognition methods based on static cues only. Motivated by the potential
impact of recognizing actions in still images and the little attention this
problem has received in computer vision so far, we address recognition of human
actions in consumer photographs. We construct a new dataset with seven classes
of actions in 968 Flickr images representing natural variations of human
actions in terms of camera view-point, human pose, clothing, occlusions and
scene background. We study action recognition in still images using the
state-of-the-art bag-of-features methods as well as their combination with the
part-based Latent SVM approach of Felzenszwalb et al. 6. In particular, we
investigate the role of background scene context and demonstrate that improved
action recognition performance can be achieved by (i) combining the statistical
and part-based representations, and (ii) integrating person-centric description
with the background scene context. We show results on our newly collected
dataset of seven common actions as well as demonstrate improved performance
over existing methods on the datasets of Gupta et al. 8 and Yao and Fei Fei


1.      Image Feature Extraction.

2.      Video Feature Extraction.

3.      PCA.

4.      Classifier.































Image Feature Extraction:

In our method, we extract the image (static) feature from
both images and  key frames of videos.
Considering computational efficiency, we extract key frames by a shot boundary
detection algorithm 36. The example of key frames extraction is shown in Fig.
3. The main steps of the key frames extraction include the following. First, the
color histogram of every five frames is calculated. Second, the histogram is
subtracted with that of the previous frame. Third, the frame is a shot boundary
if the subtracted value is larger than an empirically set threshold. Once we
get the shot, the frame in the middle of the shot is considered as a key frame.

Video Feature Extraction:

The video
(motion) feature is extracted from the video domain and is combined with image
feature. Therefore, the image feature is a subset of the combined feature.


Principal component analysis (PCA) is a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly correlated
variables into a set of values of linearly uncorrelated variables
called principal components (or
sometimes, principal modes of variation). The number of principal components is
less than or equal to the smaller of the number of original variables or the
number of observations. This transformation is defined in such a way that the
first principal component has the largest possible variance (that is,
accounts for as much of the variability in the data as possible), and each
succeeding component in turn has the highest variance possible under the
constraint that it is orthogonal to the preceding components. The
resulting vectors are an uncorrelated orthogonal basis set. PCA is
sensitive to the relative scaling of the original variables. In the proposed
framework, we use the kernel principal component analysis (KPCA) method 37 to
map the combined features and image features. The KPCA method can explore the
principal knowledge of the mapped Hilbert spaces. Therefore, the training
process is more efficient, which makes the IVA more suitable for real-world
applications. We can get the common feature A by mapping the image features
into one Hilbert space. In order to obtain the heterogeneous feature AB, the combined
features are mapped into another Hilbert space. The knowledge can be adapted
based on such shared space of the common features, and then used to optimize
the classifier A.


knowledge can be adapted based on such shared space of the common features, and
then used to optimize the classifier A. In order to make use of unlabeled
videos, a semi supervised classifier AB is trained based on the heterogeneous features
in videos domain. We integrate the two classifiers into a joint optimization
framework. The final recognition results of testing videos are improved by
fusing the results of aforementioned two classifiers.


Ø  MATLAB 7.14 Version R2012


            The MATLAB high-performance
language for technical computing integrates computation, visualization, and
programming in an easy-to-use environment where problems and solutions are
expressed in familiar mathematical notation.

v  Data Exploration ,Acquisition

v  Engineering drawing and Scientific

v  Analyzing of algorithmic designing
and development

v  Mathematical functions and
Computational functions

v  Simulating problems prototyping and

v  Application development programming
using GUI building environment.

            Using MATLAB, you can solve
technical computing problems faster than with traditional programming
languages, such as C, C++, and FORTRAN.

of the retinal vasculature.


            To emphasize that the reliability
and computational efficiency of the proposed method allows the creation of an
effective tool that can easily be incorporated in clinical practice.



achieve good performance of video action recognition, we propose an classifier
of IVA, which can borrow the knowledge adapted from images based on the common
visual features. Meanwhile, it can fully utilize the heterogeneous features of
unlabeled videos to enhance the performance of action recognition in videos. In
our experiments, we validate that the knowledge learned from images can
influence the recognition accuracy of videos and that different recognition
results are obtained by using different visual cues. Experimental results show
that the proposed IVA has better performance of video action recognition,
compared to the state-of-the-art methods. And the performance of IVA is
promising when only few labeled training videos are available.


1 B. Ma, L. Huang, J. Shen, and
L. Shao, “Discriminative tracking using tensor pooling,” IEEE Trans. Cybern.,
to be published, doi: 10.1109/TCYB.2015.2477879.

2 L. Liu, L. Shao, X. Li, and K.
Lu, “Learning spatio-temporal representations for action recognition: A genetic
programming approach,” IEEE Trans. Cybern., vol. 46, no. 1, pp. 158–170,
Jan. 2016.

3 A. Khan, D. Windridge, and J.
Kittler, “Multilevel Chinese takeaway process and label-based processes for
rule induction in the context of automated sports video annotation,” IEEE
Trans. Cybern., vol. 44, no. 10, pp. 1910–1923, Oct. 2014.

4 H. Wang, M. M. Ullah, A.
Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-temporal features
for action recognition,” in Proc. Brit. Mach. Vis. Conf., London, U.K.,
2009, pp. 124.1–124.11.

5 L. Shao, X. Zhen, D. Tao, and
X. Li, “Spatio-temporal Laplacian pyramid coding for action recognition,” IEEE
Trans. Cybern., vol. 44, no. 6, pp. 817–827, Jun. 2014.

6 M.-Y. Chen and A. Hauptmann, “MoSIFT:
Recognizing human actions in surveillance videos,” School Comput. Sci.,
Carnegie Mellon Univ., Pittsburgh, PA, USA, Tech. Rep. CMU-CS-09-161, 2009.

7 M. Yu, L. Liu, and L. Shao,
“Structure-preserving binary representations for RGB-D action recognition,” IEEE
Trans. Pattern Anal. Mach. Intell., to be published, doi:

8 L. Shao, L. Liu, and M. Yu,
“Kernelized multiview projection for robust action recognition,” Int. J.
Comput. Vis., 2015, doi: 10.1007/s11263-015-0861-6.

9 C.-C. Chang and C.-J. Lin,
“LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst.
Technol., vol. 2, no. 3, pp. 1–27, Apr. 2011.

10 Y. Han et al.,
“Semisupervised feature selection via spline regression for video semantic
recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 2, pp.
252–264, Feb. 2015.


Categories: Languages


I'm Iren!

Would you like to get a custom essay? How about receiving a customized one?

Check it out