Visual Categorization with Bags of Keypoints.pdf

资源ID：21128402 资源大小：1.56MB 全文页数：16页
资源格式： PDF 下载积分：10金币

快捷下载

会员登录下载

微信登录下载

三方登录下载：

微信扫一扫登录

下载资源需要10金币

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

Visual Categorization with Bags of Keypoints.pdf

Visual Categorization with Bags of KeypointsVisual Categorization with Bags of KeypointsGabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cdric BrayXerox Research Centre Europe6, chemin de Maupertuis38240 Meylan, Francegcsurka,Abstract.Abstract. We present a novel method for generic visual catego-rization: the problem of identifying the object content of naturalimages while generalizing across variations inherent to the ob-ject class. Thisbag of keypoints method is based on vectorquantization of affine invariant descriptors of image patches.We propose and compare two alternative implementations usingdifferent classifiers: Nave Bayes and SVM. The main advan-tages of the method are that it is simple, computationally effi-cient and intrinsically invariant. We present results for simulta-neously classifying seven semantic visual categories. These re-sults clearly demonstrate that the method is robust to back-ground clutter and produces good categorization accuracy evenwithout exploiting geometric information.1. 1. IntroductionIntroductionThe proliferation of digital imaging sensors in mobile phones and consumer-levelcameras is producing a growing number of large digital image collections. To managesuch collections it is useful to have access to high-level information about objectscontained in the image. Given an appropriate categorization of image contents, onemay efficiently search, recommend, react to or reason with new image instances.We are thus confronted with the problem ofgeneric visual categorization . Weshould like to identify processes that are sufficiently generic to cope with many objecttypes simultaneously and which are readily extended to new object types. At the sametime, these processes should handle the variations in view, imaging, lighting and oc-clusion, typical of the real world, as well as the intra-class variations typical of seman-tic classes of everyday objects.The task-dependent and evolving nature of visual categories motivates an example-based machine learning approach. This paper presents a bag of keypoints approach tovisual categorization. A bag of keypoints corresponds to a histogram of the number ofoccurrences of particular image patterns in a given image. The main advantages of themethod are its simplicity, its computational efficiency and its invariance to affinetransformations, as well as occlusion, lighting and intra-class variations.It is important to understand the distinction of visual categorization from three re-lated problems:Recognition: This concerns the identification of particular object instances. For in-stance, recognition would distinguish between images of two structurally distinctcups, while categorization would place them in the same class.Content Based Image Retrieval: This refers to the process of retrieving images on thebasis of low-level image features, given a query image or manually constructed de-scription of these low-level features. Such descriptions frequently have little rela-tion to the semantic content of the image.Detection: This refers to deciding whether or not a member ofone visual category ispresent in a given image. Most previous work on detection has centered on ma-chine learning approaches to detecting faces, cars or pedestrians 1-6 While itwould be possible to perform generic categorization by applying a detector for eachclass of interest to a given image, this approach becomes inefficient given a largenumber of classes. In contrast to the technique proposed in this paper, most existingdetection techniques require precise manual alignment of the training images andthe segregation of these images into different views 5, neither of which is neces-sary in our approach.Our bag of keypoints approach can be motivated by an analogy to learning methodsusing the bag-of-words representation for text categorization 7-10. The idea ofadapting text categorization approaches to visual categorization is not new. Zhu et al11 investigated the vector quantization of small square image windows, which theycalled keyblocks. They showed that these features produced more “semantics-oriented” results than color and texture based approaches, when combined with ana-logues of the well-known vector-, histogram-, and n-gram-models of text retrieval. Incontrast to our approach, their keyblocks do not possess any invariance properties.The idea of clustering invariant descriptors of image patches has previously beenapplied to the problem of texture classification 12-14. Clearly the problem oftexture classification is rather different from that of generic categorization. Thereforeit is natural that these approaches differ from ours. While our method uses clusteringto obtain quite high-dimensional feature vectors for a classifier, these texture recog-nizers use clustering to obtain relatively low-dimensional histograms and evaluate thesimilarity of these histograms to previously seen probability densities. In 12-13filter responses are clustered and the recognition is done using the closest modelmeasured by a 2 test. Lazebnik et al 14 cluster affine invariant interest points ineach image individually and summarize the distribution of the descriptors in form of asignature composed of representative cluster members and weights proportional tocluster sizes. Signatures of different images are compared using the Earth MoversDistance 15.Recently Fergus et al 16 proposed a visual categorization method based on in-variant descriptors of image patches. Their method exploits a probabilistic model thatcombines likelihoods for appearance, relative scale and position, as well as a model ofthe statistics of their patch detector. This elegant approach has a number of limita-tions. Firstly the method is not efficient: even when models are restricted to 6 imagepatches and training images only contain up to 30 patches, days of CPU time are re-quired to learn several categories. Secondly, views of objects used for training must besegregated, for instance into cars (rear) and cars (side). This is unsurprising given theuse of an explicit 2D model of relative positions.In section 2 we explain the categorization algorithms and the choice of their com-ponents. In section 3 we present results from applying of the algorithms to the datasetof Fergus et al and to a more challenging seven class dataset. We demonstrate that ourapproach is robust to the presence of background clutter and produces state-of-the-artrecognition performance.2. 2. The methodThe methodThe main steps of our method are:Detection and description of image patchesAssigning patch descriptors to a set of predetermined clusters (avocabu-lary) with a vector quantization algorithmConstructing a bag of keypoints , which counts the number of patches as-signed to each clusterApplying a multi-class classifier, treating the bag of keypoints as the fea-ture vector, and thus determine which category or categories to assign tothe image.Ideally these steps are designed to maximize classification accuracy while minimiz-ing computational effort. Thus, the descriptors extracted in the first step should beinvariant to variations that are irrelevant to the categorization task (image transforma-tions, lighting variations and occlusions) but rich enough to carry enough informationto be discriminative at the category level. The vocabulary used in the second stepshould be large enough to distinguish relevant changes in image parts, but not so largeas to distinguish irrelevant variations such as noise.We refer to the quantized feature vectors (cluster centres) as “keypoints” by anal-ogy with “keywords” in text categorization. However, in our case “words” do notnecessarily have a repeatable meaning such as “eyes”, or “car wheels”, nor is there anobvious best choice of vocabulary. Rather, our goal is to use a vocabulary that allowsgood categorization performance on a given training dataset. Therefore the steps in-volved in training the system allow consideration of multiple possible vocabularies:Detection and description of image patches for a set of labeled trainingimagesConstructing a set of vocabularies: each is a set of cluster centres, with re-spect to which descriptors are vector quantized.Extracting bags of keypoints for these vocabulariesTraining multi-class classifiers using the bags of keypoints as feature vec-torsSelecting the vocabulary and classifier giving the best overall classifica-tion accuracy.We now discuss the choices made for each step in more detail.2.12.1Feature extractionFeature extractionIn computer vision, local descriptors (i.e. features computed over limited spatial sup-port) have proved well-adapted to matching and recognition tasks, as they are robustto partial visibility and clutter. Such tasks require descriptors that are repeatable. Here,we mean repeatable in the sense that if there is a transformation between two instancesof an object, corresponding points are detected and (ideally) identical descriptor val-ues are obtained around each. This has motivated the development of several scaleand affine invariant point detectors, as well as descriptors that are resistant to geomet-ric and illumination variations 17-21.It was shown in 21 that if we have affine transformation between two images ascale invariant point detector is not sufficient to have the stability of the points loca-tion. Therefore we preferred to work with the Harris affine detector described in 21.However, the reader should be aware that the benefits of this choice are not clear cut:firstly because most real world objects have three-dimensional structures whose varia-tions are not well captured by affine transformations; secondly since attempts to in-crease the invariance of a feature typically result in a loss of discriminative informa-tion.Harris affine pointsare detected by an iterative process. Firstly, positions andscales of interest points are determined as local maxima (in position) of a scale-adapted Harris function, and as local extrema in scale of the Laplacian operator. Thenan elliptical (i.e. affine) neighborhood is determined. This has a size given by theselected scale and a shape given by the eigenvalues of the images second momentmatrix. The selection of position/scale and the elliptical neighborhood estimation arethen iterated and the point is kept only if the process converges within a fixed numberof iterations.The affine region is then mapped to a circular region, so normalizing it for affinetransformations. Scale Invariant Feature Transform (SIFT) descriptors 18 are com-puted on that region. SIFT descriptors are multi-image representations of an imageneighborhood. They are Gaussian derivatives computed at 8 orientation planes over a4x4 grid of spatial locations, giving a 128-dimension vector. Fig. 1 shows an exampleof the maps of gradient magnitude corresponding to the 8 orientations.Fig. 1.Fig. 1. (From left to right) A Harris affine region; the normalized region; and the 8 maps ofgradient magnitude constituting the SIFT descriptor.We prefer SIFT descriptors to alternatives such as steered Gaussian derivatives ordifferential invariants of the local jet for the following reasons:1.They are simple linear Gaussian derivatives. Hence we expect them to be morestable to typical image perturbations such as noise than higher Gaussian deriva-tives or differential invariants.2.The use of a simple Euclidean metric in the feature space seems justified. In thecase of differential invariants obtained by a combination of the components of thelocal jet, the use of a Mahalanobis distance is more appropriate. For instance, onewould expect a second derivative feature to have a higher variance than a first de-rivative. Selecting an appropriate Mahalanobis distance a priori seems challeng-ing. It would not be appropriate to use the covariance matrix of SIFT descriptorsover the entire dataset, since this is predominantly influenced by inter-class varia-tions (or more precisely, by variations between keypoints that we do not wish toignore). Measuring a Mahalanobis distance would probably requiring manualspecification of multiple homologous matching points between different imagesof objects of the same category, seriously working against our objective of pro-ducing a simple and automated categorization system.3.There are far more components to these feature vectors (128 rather than 12 to 16).Hence we have a richer and potentially more discriminative representation.Recently Mikolajczyk et al 22 have compared several descriptors for matchingand found that SIFT descriptors perform best.2.32.3Visual vocabulary constructionVisual vocabulary constructionIn our method, the vocabulary is a way of constructing a feature vector for classifi-cation that relates “new” descriptors in query images to descriptors previously seen intraining. One extreme of this approach would be to compare each query descriptor toall training descriptors: this seems impractical given the huge number of training de-scriptors involved (over 600 000 for our data set). Another extreme would be to try toidentify a small number of large “clusters” that are good at discriminating a givenclass: for instance 16 operates with 6 parts per category. In practice we find that thebest tradeoffs of accuracy and computational efficiency are obtained for intermediatesizes of clustering.Most clustering or vector quantization algorithms are based on iterative square-error partitioning or on hierarchical techniques. Square-error partitioning algorithmsattempt to obtain the partition which minimizes the within-cluster scatter or maximizesthe between-cluster scatter. Hierarchical techniques organize data in a nested sequenceof groups which can be displayed in the form of a dendrogram or a tree. They needsome heuristics to form clusters and hence are less frequently used than square-errorpartitioning techniques in pattern recognition.We chose to use the simplest square-error partitioning method: k-means 23. Thisalgorithm proceeds by iterated assignments of points to their closest cluster centersand recomputation of the cluster centers. Two difficulties are that the k-means algo-rithm converges only to local optima of the squared distortion, and that it does notdetermine the parameter k. There exist methods allowing to automatically estimatingthe number of clusters. For example, Pelleg et al 24 use cluster splitting to do it,where the splitting decision is done by computing the Bayesian Information Criterion.However, in the present case we do not really know anything about the density or thecompactness of our clusters. Moreover, we are not even interested in a “correct clus-tering” in the sense of feature distributions, but rather in accurate categorization. Wetherefore run k-means several times with different number of desired representativevectors (k) and different sets of initial cluster centers. We select the final clusteringgiving the lowest empirical risk in categorization 25.2.32.3CategorizationCategor

注意事项

本文（Visual Categorization with Bags of Keypoints.pdf）为本站会员（赵**）主动上传，淘文阁 - 分享文档赚钱的网站仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知淘文阁 - 分享文档赚钱的网站（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。