الهندسة الكهربية ميكاترونيكس هندسة الإنتاج والتصميم الميكانيكى

رسالة دكتوراه بعنوان Learning Semantic Features for Visual Recognition

اسم المؤلف

JINGEN LIU

التاريخ

تاريخ النشر28 أغسطس 2022

التصنيف

فيالهندسة الكهربيةميكاترونيكسهندسة الإنتاج والتصميم الميكانيكى

المشاهدات

792

التقييم

(لا توجد تقييمات)

التحميل

رسالة دكتوراه بعنوان
Learning Semantic Features for Visual Recognition
by
JINGEN LIU
M.S., University of Central Florida, 2008
M.E., Huazhong University of Science and Technology, 2003
B.S., Huazhong University of Science and Technology, 2000
A dissertation submitted in partial fulflllment of the requirements
for the degree of Doctor of Philosophy
in the School of Electrical Engineering and Computer Science
in the College of Engineering and Computer Science
at the University of Central Florrida
Orlando, Florida
TABLE OF CONTENTS
LIST OF FIGURES xvii
LIST OF TABLES 1
CHAPTER 1: INTRODUCTION 2
1.1 Motivations 6
1.2 Proposed Work and Contributions 13
1.2.1 Action recognition via maximization of mutual information 14
1.2.2 Scene recognition using MMI co-clustering . 15
1.2.3 Visual recognition using multiple features 17
1.2.4 Learning semantic visual vocabularies using difiusion distance 18
1.3 Organization of the Thesis 19
CHAPTER 2: LITERATURE REVIEW 21
2.1 Object and Scene Recognition 21
2.1.1 Geometry-Based Models . 22
2.1.2 Appearance-based Models 23
2.2 Action Recognition 31
2.2.1 Holistic Approaches 32
2.2.2 Part-based Approaches 34
2.3 Semantic Visual Vocabulary . 37
CHAPTER 3: LEARNING OPTIMAL NUMBER OF VISUAL WORDS
FOR ACTION RECOGNITION 40
3.1 Introduction 40
3.2 Bag of Video-words Model 43
vii3.2.1 Feature Detection and Representation . 43
3.2.2 Action Descriptor . 45
3.3 Clustering of Video-words by MMI . 46
3.3.0.1 Mutual Information 46
3.3.0.2 MMI clustering Algorithm . 47
3.4 Spatiotemporal Structural Information . 49
3.5 Experiments and Discussion . 51
3.5.1 Experiments on KTH data set 53
3.5.1.1 Action recognition using orderless features 53
3.5.1.2 Classiflcation using spatiotemporal structural information . 56
3.5.2 IXMAS Multiview dataset 59
3.6 Conclusion . 62
CHAPTER 4: LEARNING SEMANTIC VISUAL-WORDS BY CO-CLUSTERING
FOR SCENE CLASSIFICATION 63
4.1 Introduction 63
4.2 Co-clustering by Maximization of Mutual Information . 65
4.2.1 Co-clustering Algorithm . 67
4.3 Spatial Correlogram Kernel Matching 69
4.3.1 Spatial correlgram 70
4.3.2 Spatial Correlogram Kernel . 71
4.4 Experiments 72
4.4.1 Classiflcation of Fifteen Scene Categories 73
4.4.1.1 Classiflcation using orderless features . 74
4.4.1.2 Classiflcation using intermediate concepts and their spatial
information . 78
4.4.2 Classiflcation of LSCOM Dataset 81
viii4.5 Conclusion . 84
CHAPTER 5: VISUAL RECOGNITION USING MULTIPLE HETEROGENOUS FEATURES 85
5.1 Introduction 85
5.2 Fiedler Embedding 87
5.2.1 Fiedler Embedding and LSA . 90
5.3 Object Classiflcation Framework . 91
5.3.1 Constructing Feature Laplacian Matrix . 95
5.3.2 Embedding 97
5.4 Action Recognition Framework . 97
5.4.1 Feature Extraction and Representation . 98
5.4.2 Construction of the Laplacian Matrix 101
5.5 Experiments and Discussion . 104
5.5.1 Synthetic Data Set 104
5.5.2 Caltech Data Set: Object Recognition . 106
5.5.2.1 Qualitative Results . 108
5.5.2.2 Quantitative Results 110
5.5.3 Weizmann Data Set: Action Recognition 115
5.6 Conclusion . 119
CHAPTER 6: LEARNING SEMANTIC VOCABULARIES USING DIFFUSION DISTANCE 121
6.1 Introduction 121
6.2 Difiusion Maps 123
6.2.1 Difiusion distances in a graph 123
6.2.2 Difiusion Maps Embedding 125
6.2.3 Robustness to Noise . 127
ix6.2.4 Feature Extraction 129
6.3 Experiments and Discussion . 130
6.3.1 Experiments on KTH data set 134
6.3.2 Experiments on YouTube data set 137
6.3.3 Experiments on Scene data set 139
6.4 Conclusion . 139
CHAPTER 7: CONCLUSION AND FUTURE WORK 140
7.1 Summary of Contributions 140
7.2 Future Work 142
7.2.1 Reflne the output of information bottleneck 142
7.2.2 Semi-supervised method . 142
7.2.3 multi-scale matching . 143
7.2.4 E–cient shape model . 143
REFERENCES 157
LIST OF FIGURES
1.1 Example object images selected from Caltech-6 dataset showing the variation
in scales, viewpoints, and illumination changes. Each row corresponds to one
category 3
1.2 Example scene images selected from the flfteen-scene data set. The number
of images contained in each category is shown under the images . 4
1.3 Example actions selected from the KTH dataset. Each column shows two
action examples from one category. It has 6 categories with about 600 action
videos in total . 5
1.4 Example actions from the Weizmann action data set. It contains 9 actions
with about 81 action videos in total . 5
1.5 Four views of flve selected action examples from IXMAS dataset. It has 13
action categories with 5 camera views and about 2,000 video sequences in
total 6
1.6 Examples actions from UCF YouTube action data set. It contains 11 action
categories (here, eight categories are listed). Each category contains more
than 100 video clips 7
1.7 The Maximum Response fllters ( this flgure is taken from [87] ). They include
two anisotropic fllters with 3 scales at 6 orientations capturing the edges (the
dark images of the flrst three rows) and bars (the light images of the flrst
three rows), and 2 rotationally symmetric fllters (a Gaussian and a Laplacian
Gaussian) . 9
1.8 Two visual words demonstrate the polysemy and synonym problems in visual
vocabulary learning 12
xi1.9 Representation of an image in terms of multiple features. (a) The original
image. (b) Interest points (SIFT) representing local features. (c) Contours
representing shape features. (d) Segments representing region features 17
2.1 The demonstration of SIFT descriptor (this flgure is taken from [79]). The
left panel shows the gradients of an image patch that is divided into 2£2
subregions. The overlaid circle is the Gaussian window weighting the gradients. These gradients are accumulated into orientation histograms, as shown
on the right panel. The length of the arrow represents the sum of the gradient
magnitudes in the corresponding direction bin . 28
2.2 Two images having similar color histograms (the images are original in [51]). 29
2.3 Examples showing arbitrary view action recognition (this flgure is from [30]).The
fourth and third row are the observed image sequences and their corresponding silhouettes. The second and flrst row are the matched silhouettes and
their corresponding 3-D exemplars 33
2.4 (A) Motion energy images (MEI) and motion history image (MHI)(this flgure
is taken from [3]); (B) Space-time interest points are detected by 3-D Harriscorner detector (this flgure is taken from [46]); (C)Space-time interest points
are detected by 1D Gabor detector in time direction (this flgure is taken
from [93]) . 34
3.1 Illustration of the procedure of representing an action as a bag of video-words
(histogram of bag of video-words) 42
3.2 (a)The classiflcation performance comparison between the initial vocabulary
and the optimal vocabulary with difierent initial vocabulary sizes. (b) The
performance comparison between using MMI clustering and directly applying
k-means algorithm. MMI clustering reduces the initial dimension of 1,000 to
the corresponding number . 52
xii3.3 (a) Confusion table for the classiflcation using the optimal number of VWC s
(Nc=177, average accuracy is 91.31%). (b) Confusion table for the classiflcation using the VWC correlogram. The number of VWC is 60, and 3 quantized
distances are used (average accuracy is 94.15%) . 53
3.4 The flrst row shows the examples of six actions. The following two rows respectively demonstrate the distribution of the optimal 20 video-words clusters
using our approach and 20 video-words using k-means. We superimpose the
3D interest points in all frames into one image. Difierent clusters are represented by difierent color codes. Note that our model is more compact, e.g.
see \waving” and \running” actions (Best viewed in color) 56
3.5 Example histograms of the VWC s (Nc=20) for two selected testing actions
from each action category. These demonstrate that actions from the same
category have similar VWC distribution, which means each category has some
dominating VWC s . 57
3.6 (a) Performance (%) comparison between the original 1,000 video-words and
the optimal 189 video-word-clusters. (b) Average accuracy (%) using three
views for training and single view for testing . 59
3.7 The recognition performance when four views are used for training and a
single view is used for testing. The average accuracy is 82.8% 60
3.8 The recognition performance when four views are used for training and single view is used
for testing . 61
4.1 An illustration of hierarchical scene understanding . 64
4.2 Work flow of the proposed scene classiflcation framework 66
4.3 The graphical explanation of MMI co-clustering. The goal of MMI co-clustering
is to flnd one clustering of X and Y that minimizes the distance between the
distribution matrices p(x,y) and q(x,y) 66
xiii4.4 An example to show autocorrelogram of three synthetic images 73
4.5 Example histograms of intermediate concepts for 2 selected testing images
from each scene categories 77
4.6 Confusion table of the best performance for the SCC+BOC model. The average performance is 81.72% 80
4.7 Example key frames selected from LSCOM Data set . 80
4.8 The AP for the 28 categories. BOV-O and BOV-D represents the BOV models
with Nv = 3; 000 and Nv = 250 respectively. CC-BOC and pLSA-BOC
denotes the BOC model created by co-clustering and pLSA 83
5.1 An illustration of the graph containing multiple entities as nodes. This includes images (red), SIFT descriptors (green), contours (purple) and regions
(yellow). The goal of our algorithm is to embed this graph in a k-dimensional
space so that semantically related nodes have geometric coordinates which are
closer to each other. (Please print in color) . 88
5.2 The flgure shows two visual words each from the interest point, contour and
region vocabularies. (a)-(b) Two words belonging to the interest point vocabulary. (c)-(d) Two words belonging to the contour vocabulary. (e)-(f) Two
words belonging to the region vocabulary 93
5.3 An illustration of the graph containing multiple entities as nodes. This includes ST features (red), Spin-Image features (yellow) and action videos (green).
The goal of our algorithm is to embed this graph in a k-dimensional space so
that similar nodes have geometric coordinates which are closer to each other. 98
5.4 Left: the (fi; fl) coordinates of a surface point relative to the orientated point
O. Right: the spin-image centered at O . 100
5.5 Some 3D (x,y,t) action volumes (the flrst column) with some of their sampled
spin-images (red points are the orientated points.) 102
xiv5.6 Clustering of entities in the k-dimensional embedding space. The entities
are three images categories D1, D2, and D3, and flve feature types T 1, T 2,
T 3, T 4, and T 5. The synthetically generated co-occurrence table between
features and images is shown on the left side. While the graph represents
the assignments of feature types and image category to the clusters in the
3¡dimensional embedding space . 104
5.7 Qualitative results when the query is a feature and the output is a set of
images. (a)-(b) Query: Interest point, Output: Ten nearest images. (c)-(d)
Query: Contour, Output: Ten nearest images. (e)-(f) Query: Region, Output:
Ten nearest images . 107
5.8 The results of difierent combinations of the query-result entities. (a) Query:
Interest Point, Output: Five nearest interest points and contours. (b) Query:
Contour, Output: Five nearest interest points and regions. (c) Query: Regions, Output: Five nearest interest points, contours, and regions. (d) Query:
Image, Output: Five nearest interest points and contours. (e) Query: Image, Output: Five nearest interest points, contours, and regions. (f) Query:
Image, Output: Five nearest images . 109
5.9 Figure summarizes results of difierent experiments. (a) A comparison of BOW
approach with our method by using only the interest point features. (b) A
comparison of BOW approach with our method by using both interest point
and contour features together. (c) A comparison of BOW approach with our
method by using all three features. (d) A comparison of Fielder embedding
with LSA by using all three feature types. (e) A comparison of performance
of our framework for difierent values of embedding dimension using only the
interest point features. (f) A comparison of contributions of difierent features
towards classiflcation 111
xv5.10 Figure shows difierent combinations of query-result that we used for qualitative veriflcation of the constructed k-dimensional space. Each rectangle
represents one entity (e.g. action video or a video-word (a group of features)).
In (a)-(c), the features in blue which are from one video-word are used as
query, and the 4 nearest videos in yellow from the k-dimensional space are
returned. Under each video-word, the category component percentage is also
shown (e.g. \wave2: 99%, wave1:1%\ means 99% of features in this video-word
are from \wave2″ action). In (d) and (e), we respectively used ST features
and Spin-Image features as query, and retrieved the nearest features in the
k-dimensional space. In (f) and (g), two action videos are used as query, and
the nearest features are returned . 116
5.11 The comparison of the BOW approach with our weighted BOW method . 117
5.12 (a) Confusion table for Fiedler embedding with k=20. (b) Confusion table for
LSA with k=25 . 118
5.13 The variation of embedding dimension afiects the performance. All experiments are carried out on Nsi = Nip = 1; 000 . 118
5.14 The contributions of difierent features to the classiflcation. (a)Nsi = Nip =
200, k=20,30 and 20 for ST features, Spin-Image features and the combination
respectively. (b)Nsi = Nip = 1; 000, k=20,70 and 20 for ST features, SpinImage features and the combination respectively . 119
6.1 Flowchart of learning semantic visual vocabulary . 123
6.2 Demonstration of robustness to noise. (a) Two dimensional spiral points. (bc) The distribution of the difiusion distance and geodesic distance between
points A and B. (d) KTH data set. (e-f) The distribution of the difiusion
distance and geodesic distance between two points on KTH data set . 128
xvi6.3 (a) and (b) shows the influence of difiusion time and sigma value, respectively,
on the recognition performance. The three curves correspond to three visual
vocabularies of size 100, 200, and 300 respectively. The sigma value is 3 in
(a) and the difiusion time is 5 in (b); (c) The comparison of recognition rate
between mid-level and high-level features 131
6.4 (a) Comparison of performance between difierent manifold learning schemes.
(b) Comparison of performance between DM and IB . 132
6.5 (a) Confusion table of KTH data set when the size of the semantic visual vocabulary is 100. The average accuracy is 92.3%. (b) Performance comparison
between DM and other manifold learning schemes on the YouTube action data
set. (c) Confusion table of the YouTube data set when the size of semantic
visual vocabulary is 250. The average accuracy is 76.1% 133
6.6 The decay of the eigenvalues of Pt on YouTube data set when sigma is 14 134
6.7 Some examples of mid-level and high-level features with their corresponding
real image patches. Each row lists one mid-level or high-level feature followed
by its image patches. The three mid-level features are selected from 40 midlevel features. The four high-level features are selected from 40 high-level
features generated by DM from 1,000 mid-level features 135
xviiLIST OF TABLES
3.1 Major steps for the training phase of our framework 43
3.2 The number of training examples vs. the average performance . 55
3.3 The performance comparison between difierent models. VW and VWC respectively denote video-words and video-word-clusters based methods, and
VW Correl and VWC Correl are their corresponding correlgoram models.
STPM denotes the Spatiotemporal Pyramid Matching approach. The dimension denotes the number of VW s and VWC s 58
3.4 The performance of the difierent bag of video-words related approaches. pLSA ISM
is the major contribution of [111] . 59
4.1 The average accuracy (%) achieved using strong and weak classiflers . 75
4.2 The results achieved under difierent sampling space . 76
4.3 The average accuracy (%) achieved using strong and weak classiflers . 76
4.4 The performance (average accuracy %) of SPM using visual-words and intermediate concepts. SPM IC and SPM V denote SPM using intermediate
concepts and visual-words respectively 79
4.5 The average classiflcation accuracy (%) obtained by various models (SCC,
BOC, and SCC+BOC) 79
4.6 The MAP for the 28 LSCOM categories achieved by difierent approaches.
BOV-O and BOV-D represent the BOV models with Nv = 3; 000 and Nv =
250 respectively. CC-BOC and pLSA-BOC denotes the BOC model created
by co-clustering and pLSA respectively 82
5.1 Main steps of the action recognition framework . 99
6.1 Procedure of difiusion maps embedding . 127
xviii6.2 Performance comparisons between two vocabularies learnt from mid-level features with and without DM. embedding . 137
6.3 Performance comparison between two difierent midlevel feature representations: PMI vs. Frequency. embedding 137
6.4 Best results of difierent manifold learning techniques 137