ProfileLETOR: Learning to Rank ...BlogLists Tools Help

LETOR: Learning to Rank for Information Retrieval

July 16

Learning a Ranking from Pairwise Preferences


Paper Authors: Ben Carterette, Desislava Petkova
Published at SIGIR 2006

Notes written by Tao Qin (Microsoft Research Asia)
Edited by Tie-Yan Liu (Microsoft Research Asia)



This is a poster about ranking aggregation. The goal is to learn a ranking from multiple ranking lists. Since this is the standard setting of rank aggregation, I am curious why they did not use some term like “rank aggregation” in the title.

Two models are proposed for rank aggregation in the paper. One is logistic regression model, and the other is SVM based model.

For logistic regression model, it defines a likelihood function for a pairwise preference, and then learns the scores of the documents to maximize the likelihood. The definition of likelihood of a pairwise preference is very similar to that in SoftRank [1]. First, assume the score \theta_i of document i is Gaussian distributed. Then, the probability of P(θ_i>θ_j) is defined as
P(θ_i>θ_j )=ϕ(θ_i>θ_j)
Where ϕ is the normal cumulative distribution. Third, the likelihood of pairwised preference is defined as
L(θ_i>θ_j )=P(θ_i>θ_j )^(y_(i,j) ) (1-P(θ_i>θ_j ) )^(1-y_(i,j) )
Where y_(i,j)=1  if d_i > d_j, and y_(i,j)=0 if d_j>d_i.

For SVM model, it is very similar to Ranking SVM. The only difference is that there is no supervision here, and the pairwise constraints come from the multiple ranking lists. This model is somehow related to the work on supervised rank aggregation [2].

Overall, this work is simple and easy to understand, since its content is limited by the size of poster. I think some of its ideas are helpful for further considerations:
1) How to define the likelihood of pairwise preference.
2) Treat the ranking lists to be aggregated as supervision.

References
[1] Michael J Taylor, John Guiver, Stephen Robertson and Tom Minka. SoftRank: Optimising Non-Smooth Ranking Metrics, WSDM2008
[2] Yu-Ting Liu, Tie-Yan Liu, Tao Qin, Zhi-Ming Ma, Hang Li, Supervised Rank Aggregation, WWW 2007 .

Ranking with Large Margin Principle: Two Approaches

Paper Authors: Amnon Shashua, Anat Levin
Published at NIPS 2002.

Notes written by Xiubo Geng (CAS)

Edited by Tie-Yan Liu (Microsoft Research Asia)

This paper focuses on the problem of classifying an object to one of several ordinal categories. The hyper-planes to separate different classes are parallel to each other. In order to learn the hyper-planes, it proposes two kinds of “large margin principle”: the “fixed margin” strategy and the “sum of margins” strategy. The proposed two approach’s time complexity is linear to the size of training instances, which makes them feasible.

As for the paper, I have the following discussions

1) The assumption that all the hyper-planes are parallel is a little bit strong, so I think it would be better to discuss the applications which are suitable for the assumption.

2) The paper did not explain why the proposed methods are better than the baseline. Is the improvement caused by “large margin principle”?

3) It would be better to have a comparison between the two approaches, for example, in which situation the “large margin principle” is better.

Direct Maximization of Rank-Based Metrics for Information Retrieval


Paper Authors: Donald A. Metzler, W. Bruce Croft, and Andrew McCallum
Published at CIIR 2005

Notes written by Yin He (CAS)

Edited by Tie-Yan Liu (Microsoft Research Asia)

This paper formulates the Learning to Rank problem as an optimization problem that maximizes IR performance measure by exploring the scoring function parameter space. The scoring function is defined as a real-valued function that gives a score for each query-document pair.
Since the original formulated optimization problem is difficult to tackle, this paper first narrows the scoring functions to be strictly increasing linear one. And then, it transforms the original unconstrained problem to a constrained one by augmenting the feature vector with an additional
component. More specifically, the feature vector augmentation is to make the sum of all elements of the new feature vector to be a fixed given value. And the induced constraint is to ensure the weight associated with the additional component to be zero. Next, this paper reduces the scoring function parameter space from a d+1 dimension Euclidean space to a d-simplex multinomial manifold, which preserves the global optimum. The correctness of this reducing process is ensured by a theorem. Finally, this paper utilizes simple coordinate ascent algorithm to explore the parameter space.
 
Since this paper wants to directly optimize IR performance measures, it cannot seek help from any gradient based optimization technique. As a result, it utilizes coordinate ascent algorithm to resolve the optimization problem. However, as we know, the coordinate ascent algorithm is not a "good" optimization algorithm because it cannot guarantee even local optimum. Another problem with the paper is that the coordinate ascent is performed in Euclidean space while the parameter is constrained to be in a multinomial manifold. Thus, after each update of the parameter, it is required to transform the parameter back to the manifold to check whether the new parameter is valid. If we can find some guidance for parameter update that ensures the validity of parameter without transforming it back, the algorithm will be more efficient.
 
And overall speaking, even after the transformation to the manifold, it seems the difficulty of directly optimizing the IR measure has not been reduced. So, the technical proposal in the paper has not really solved the target problem.

High Accuracy Retrieval with Multiple Nested Ranker


Paper Authors: I. Matveeva and C. Burges and T. Burkard and A. Laucius and L. Wong
Published at SIGIR 2006.

Notes written by Jiang Bian (Georgia Institute of Technology)
Edited by Tie-Yan Liu (Microsoft Research Asia)


This paper presents the problem of achieving high accuracy as learning the re-ranking of the results at the top of the results list. The authors propose the multiple nested ranker approach which is applied to the list of results returned by the search engine. This approach re-ranks the documents on the results list in stages, at each stage applying the RankNet algorithm to learn a new ranking.

This approach is started with the assumption that the results list returned by the search engine already produces a sufficiently good ranking so that some relevant documents are placed somewhere near the top of the ranked list. Then, the multiple nested ranker performs re-ranking of the results in stages, at each stage generating a new distribution of the results. The training set for each subsequent stage is pruned to include only the results that are ranked high by the previous ranker. Such pruning procedure is referred as telescoping in this paper. Telescoping splits the problem into smaller and easier sub-tasks to learn the ranking for each of the stages separately.

The new approach is evaluated using different settings on a data set labeled with several degrees of relevance. NDCG is used to measure the performance. The experimental results show that making the learning algorithm concentrate on the top scoring results improves precision at the top ten documents in terms of the NDCG score.

I found some interesting points of the paper as follows.

1) As opposed to boosting method, this approach splits the problem into smaller pieces so that each net has a smaller and simpler task. Telescoping removes presumably difficult relevant documents at the bottom of the ranked list from the training set and forces the algorithm to concentrate on the ranking of the high scoring relevant documents. In addition, as decreasing the size of the training set roughly exponentially, more sophisticated algorithms can be used to learn the ranking at later stages.

2) In the experiments, the author used all labeled documents and added at random a certain number of unlabeled results for the same query. They found that adding randomly chosen unlabeled documents as additional examples of low relevance documents to the training set helps to improve the performance.


And as for the paper, I have the following discussions.

1) The experimental results show that the exact number of the telescoping stages did not appear to be crucial for the performance of the approach. It may need further discussion on why the multiple nested ranker with different telescoping stages can give the same improvement.

2) The role of the unlabeled examples needs further investigation. They are considered to be labeled as not relevant during the training phase. However, the fact that they are placed among the top 100 and 10 at the last stages of telescoping suggests that they may also be relevant.

3) Since the training set is pruned after each stage, it is possible that some of the relevant documents are excluded from the following re-ranking. It may need further discussion on how to avoid this problem or whether this problem can cause a big affect on the performance.

4) The fact that the low scoring documents are removed from the training set at later stages of training, can be viewed as an attempt to introduce the information about the rank of the documents into the training procedure of the RankNet algorithm. It also can be viewed as another type of relevance feedback. It may be a very interesting problem to explore the relationship between such rank information and traditional relevance feedback, and to investigate further how such rank information can enhance learning to rank.

Cranking: Combining Rankings Using conditional Probability Models on Permutations


Paper Authors: Guy Lebanon, John Lafferty
Published at ICML 2002

Notes written by Yanyan Lan (CAS)
Edited by Tie-Yan Liu (Microsoft Research Asia)

This paper adopts models on the symmetric permutation group and its cosets to ensemble learning for ranking. It uses a generalized Mallows model on permutations to combine multiple input rankings. Applications include the task of combining the outputs of multiple search engines and multiclass or multi-label classification. 

Overall speaking, the approach to ensemble learning on permutation is novel. But it just generalizes Mallows model to solve the combining problem in ranking, but has not provide discussions on why the Mallows model is chosen. And it has not offered a unified framework to explain the use of generative (probability) model in this setting. I think this could be one future research direction. For example, proposing a framework, and considering all those probabilistic models on permutations, such as Mallows model, Luce model, etc. in a unified way.

And the Bayesian interpretation on using Mallows model of exponential form is quite nice. This motivates me to think about whether other probabilistic models (e.g. the Luce Model) can be interpreted in similar ways.
 
No list items have been added yet.