Using pairwise comparisons

Pairwise comparisons in social sciences, clinical studies, and machine learning

Published

08 June 2025

Author

Machine Learning

In machine learning, pairwise objectives unify modern ranking, retrieval and representation methods by directly comparing examples to shape a model’s output space. Ranking models such as RankNet and LambdaMART enforce that if item $i$ should outrank $j$, then $f(x_i)>f(x_j)$. For example, the hinge loss

\[\max\bigl(0,\,m - f(x_i) + f(x_j)\bigr)\]

with margin $m>0$, and the logistic loss

\[-\bigl[y_{ij}\log\sigma(f(x_i)-f(x_j)) + (1-y_{ij})\log(1-\sigma(f(x_i)-f(x_j)))\bigr],\]

where $y_{ij}=1$ if $i$ should rank above $j$ and $\sigma$ is the sigmoid, both drive superior performance in information retrieval and recommendation. LambdaMART augments these by computing lambdas—pairwise swap gradients weighted by metrics like NDCG—to optimize listwise objectives directly.

Contrastive and self-supervised frameworks (e.g., SimCLR, MoCo) learn representations by maximizing agreement between positives—two augmentations of the same instance—and repelling negatives. They use the InfoNCE loss

\[-\frac{1}{N}\sum_{i=1}^N \log\frac{\exp\!\bigl(\mathrm{sim}(z_i,z_i^+)/\tau\bigr)}{\sum_{k\neq i}\exp\!\bigl(\mathrm{sim}(z_i,z_k)/\tau\bigr)},\]

where $\mathrm{sim}(u,v)=u^\top v/|u||v|$ and $\tau$ is a temperature. MoCo stabilizes training by maintaining a momentum-updated encoder and a large FIFO queue of negatives, while hard-negative mining and learnable weighting schemes accelerate convergence and improve discriminability.

Triplet and hierarchical metric learning extend pairwise comparisons to three examples—anchor $a$, positive $p$, and negative $n$—via

\[\max\bigl(0,\;d\bigl(f(a),f(p)\bigr)-d\bigl(f(a),f(n)\bigr)+m\bigr),\]

where $d$ is typically squared Euclidean or cosine distance. Cutting-edge variants sample negatives at multiple granularity levels, enforcing distinct margins for coarse and fine clusters, and adaptively schedule the margin $m$ during training based on batch statistics to balance intra-class compactness against inter-class separation.

By integrating ranking losses, contrastive objectives and higher-order triplet constraints—and by carefully designing sampling strategies, temperature scaling, margin dynamics and encoder architectures—state-of-the-art systems converge faster and yield representations that robustly generalize across downstream tasks.

Pairwise comparisons are fundamental in the social sciences for eliciting latent preferences—judges choose between stimuli pairs (e.g. policy options, products or artworks) and probabilistic models such as Bradley–Terry or Thurstone’s Case V yield robust ordinal data from which continuous utility or ability parameters are inferred. In the classic Bradley–Terry model each item $i$ has latent ability $\theta_i$, so

\[\Pr(i \succ j) = \frac{e^{\theta_i}}{e^{\theta_i} + e^{\theta_j}},\]

and one estimates ${\theta_k}$ by maximizing

\[\ell = \sum_{i<j}\bigl[n_{ij}\ln \Pr(i\succ j) + n_{ji}\ln \Pr(j\succ i)\bigr]\]

subject to $\sum_k\theta_k=0$.

Recent extensions push beyond unidimensional scales by

introducing judge-specific deviations $\delta_{k,r}\sim N(0,\sigma^2)$ so that $\theta_{k,r}=\theta_k+\delta_{k,r}$ for partial pooling (multilevel BT),
treating each utility $u_i\sim N(\mu_i,1)$ to derive
\[\Pr(i\succ j)=\Phi\!\bigl((\mu_i-\mu_j)/\sqrt2\bigr)\]
and allowing correlated judgments (Thurstone Case V),
constructing network‐based graphs of pairwise wins to uncover community structure and influence patterns, and
employing adaptive designs that at each step pick the pair $(i,j)$ maximizing the Fisher information
\[I_{ij} = \frac{e^{\theta_i + \theta_j}}{\bigl(e^{\theta_i} + e^{\theta_j}\bigr)^2},\]
thereby minimizing the posterior variance of ${\theta_k}$.

Clinical Studies

Head-to-head trials—randomizing subjects to treatments A or B and often employing crossover or sequential designs—remain the gold standard for assessing treatment effects and diagnostic accuracy. For binary endpoints (e.g. response vs. failure), one directly estimates $\pi = \Pr(A\text{ better than }B) = \Pr(X_A=1,\,X_B=0)$ and tests symmetry with McNemar’s statistic $\chi^2 = \frac{(n_{10}-n_{01})^2}{n_{10}+n_{01}}\sim\chi^2_1$. For continuous outcomes (e.g. symptom scores), let $d_i = Y_{A,i}-Y_{B,i}$ and compute $\;t = \frac{\bar d}{s_d/\sqrt n}$, where $s_d^2 = \frac{1}{n-1}\sum_i(d_i-\bar d)^2$.

Beyond these classical designs, modern trials integrate adaptive and network-meta techniques. In a Bayesian adaptive design one assumes $\delta\sim N(0,\tau^2)$, updates its posterior after each cohort, and stops when $\Pr(\delta>0\mid\mathrm{data})>0.975$. Network meta-analysis unifies direct and indirect log-odds ratios $d_{ij}$ by estimating effects ${\Delta_k}$ that minimize $\sum_{(i,j)\in\mathcal E}w_{ij}(\Delta_i-\Delta_j-d_{ij})^2$ subject to consistency loops. Sequential group-sequential methods employ an α-spending function $\alpha(t)$ (e.g. O’Brien–Fleming) to set interim $Z$-boundaries while preserving the overall type I error. Finally, utility-elicitation frameworks ask patients to compare health states pairwise, deriving QALY weights under probabilistic models. These integrated approaches extend the reach of head-to-head trials when direct comparisons are sparse, improving evidence-based decision-making.