| ProfileLETOR: Learning to Rank ...BlogLists | Help |
LETOR: Learning to Rank for Information Retrieval |
|||||
|
July 16 Learning a Ranking from Pairwise Preferences
Two models are proposed for rank aggregation in the paper. One is logistic regression model, and the other is SVM based model. For logistic regression model, it defines a likelihood function for a pairwise preference, and then learns the scores of the documents to maximize the likelihood. The definition of likelihood of a pairwise preference is very similar to that in SoftRank [1]. First, assume the score \theta_i of document i is Gaussian distributed. Then, the probability of P(θ_i>θ_j) is defined as For SVM model, it is very similar to Ranking SVM. The only difference is that there is no supervision here, and the pairwise constraints come from the multiple ranking lists. This model is somehow related to the work on supervised rank aggregation [2]. Overall, this work is simple and easy to understand, since its content is limited by the size of poster. I think some of its ideas are helpful for further considerations: References Ranking with Large Margin Principle: Two ApproachesPaper Authors: Amnon Shashua, Anat Levin As for the paper, I have the following discussions 1) The assumption that all the hyper-planes are parallel is a little bit strong, so I think it would be better to discuss the applications which are suitable for the assumption. 2) The paper did not explain why the proposed methods are better than the baseline. Is the improvement caused by “large margin principle”? 3) It would be better to have a comparison between the two approaches, for example, in which situation the “large margin principle” is better. Direct Maximization of Rank-Based Metrics for Information RetrievalPaper Authors: Donald A. Metzler, W. Bruce Croft, and Andrew McCallum Published at CIIR 2005 Notes written by Yin He (CAS) Edited by Tie-Yan Liu (Microsoft Research Asia) This paper formulates the Learning to Rank problem as an optimization problem that maximizes IR performance measure by exploring the scoring function parameter space. The scoring function is defined as a real-valued function that gives a score for each query-document pair. Since the original formulated optimization problem is difficult to tackle, this paper first narrows the scoring functions to be strictly increasing linear one. And then, it transforms the original unconstrained problem to a constrained one by augmenting the feature vector with an additional component. More specifically, the feature vector augmentation is to make the sum of all elements of the new feature vector to be a fixed given value. And the induced constraint is to ensure the weight associated with the additional component to be zero. Next, this paper reduces the scoring function parameter space from a d+1 dimension Euclidean space to a d-simplex multinomial manifold, which preserves the global optimum. The correctness of this reducing process is ensured by a theorem. Finally, this paper utilizes simple coordinate ascent algorithm to explore the parameter space.
Since this paper wants to directly optimize IR performance measures, it cannot seek help from any gradient based optimization technique. As a result, it utilizes coordinate ascent algorithm to resolve the optimization problem. However, as we know, the coordinate ascent algorithm is not a "good" optimization algorithm because it cannot guarantee even local optimum. Another problem with the paper is that the coordinate ascent is performed in Euclidean space while the parameter is constrained to be in a multinomial manifold. Thus, after each update of the parameter, it is required to transform the parameter back to the manifold to check whether the new parameter is valid. If we can find some guidance for parameter update that ensures the validity of parameter without transforming it back, the algorithm will be more efficient.
And overall speaking, even after the transformation to the manifold, it seems the difficulty of directly optimizing the IR measure has not been reduced. So, the technical proposal in the paper has not really solved the target problem. High Accuracy Retrieval with Multiple Nested Ranker
This approach is started with the assumption that the results list returned by the search engine already produces a sufficiently good ranking so that some relevant documents are placed somewhere near the top of the ranked list. Then, the multiple nested ranker performs re-ranking of the results in stages, at each stage generating a new distribution of the results. The training set for each subsequent stage is pruned to include only the results that are ranked high by the previous ranker. Such pruning procedure is referred as telescoping in this paper. Telescoping splits the problem into smaller and easier sub-tasks to learn the ranking for each of the stages separately. The new approach is evaluated using different settings on a data set labeled with several degrees of relevance. NDCG is used to measure the performance. The experimental results show that making the learning algorithm concentrate on the top scoring results improves precision at the top ten documents in terms of the NDCG score. I found some interesting points of the paper as follows. 1) As opposed to boosting method, this approach splits the problem into smaller pieces so that each net has a smaller and simpler task. Telescoping removes presumably difficult relevant documents at the bottom of the ranked list from the training set and forces the algorithm to concentrate on the ranking of the high scoring relevant documents. In addition, as decreasing the size of the training set roughly exponentially, more sophisticated algorithms can be used to learn the ranking at later stages. 2) In the experiments, the author used all labeled documents and added at random a certain number of unlabeled results for the same query. They found that adding randomly chosen unlabeled documents as additional examples of low relevance documents to the training set helps to improve the performance.
1) The experimental results show that the exact number of the telescoping stages did not appear to be crucial for the performance of the approach. It may need further discussion on why the multiple nested ranker with different telescoping stages can give the same improvement. 2) The role of the unlabeled examples needs further investigation. They are considered to be labeled as not relevant during the training phase. However, the fact that they are placed among the top 100 and 10 at the last stages of telescoping suggests that they may also be relevant. 3) Since the training set is pruned after each stage, it is possible that some of the relevant documents are excluded from the following re-ranking. It may need further discussion on how to avoid this problem or whether this problem can cause a big affect on the performance. 4) The fact that the low scoring documents are removed from the training set at later stages of training, can be viewed as an attempt to introduce the information about the rank of the documents into the training procedure of the RankNet algorithm. It also can be viewed as another type of relevance feedback. It may be a very interesting problem to explore the relationship between such rank information and traditional relevance feedback, and to investigate further how such rank information can enhance learning to rank. Cranking: Combining Rankings Using conditional Probability Models on PermutationsPaper Authors: Guy Lebanon, John Lafferty Published at ICML 2002 Notes written by Yanyan Lan (CAS) Edited by Tie-Yan Liu (Microsoft Research Asia)
This paper adopts models on the symmetric permutation group and its cosets to ensemble learning for ranking. It uses a generalized Mallows model on permutations to combine multiple input rankings. Applications include the task of combining the outputs of multiple search engines and multiclass or multi-label classification. Overall speaking, the approach to ensemble learning on permutation is novel. But it just generalizes Mallows model to solve the combining problem in ranking, but has not provide discussions on why the Mallows model is chosen. And it has not offered a unified framework to explain the use of generative (probability) model in this setting. I think this could be one future research direction. For example, proposing a framework, and considering all those probabilistic models on permutations, such as Mallows model, Luce model, etc. in a unified way. And the Bayesian interpretation on using Mallows model of exponential form is quite nice. This motivates me to think about whether other probabilistic models (e.g. the Luce Model) can be interpreted in similar ways. |
|
||||
|
|