Accepted Papers

ID: 3

Selective Blocking for Message-Passing Neural Networks on Heterophilic Graphs

Yoonhyuk Choi, Taewook Ko, Jiho Choi, Chong-Kwon Kim
[OpenReview]

Abstract

Graph Neural Networks (GNNs) thrive on message passing (MP) but are vulnerable when the graph carries many heterophilic or misclassified edges. Prior analyses suggest that signed propagation can mitigate over‑smoothing under low edge–error rates, yet they implicitly assume perfect edge labels and the presence of self‑loops. We revisit this setting and show that, under high edge uncertainty, propagating any information may harm node separability even with signed weights. Our key insight is to decide not to propagate along uncertain edges adaptively. Concretely, we intentionally omit self‑loops to isolate pure neighbor influence for a clearer theoretical analysis, adopt a row-stochastic (asymmetric) operator that matches the Markov–chain view of MP and simplifies spectral‑radius proofs, and dynamically estimate the local homophily $b_i$ and edge‑classification error $e_t$ during training via an EM procedure. We prove that our selective blocking yields a sub‑stochastic propagation matrix whose joint spectral radius exceeds that of signed GNNs under high $e_t$, guaranteeing reduced over‑smoothing, and we supply a lemma showing that class‑discriminative signals survive even when the operator is rank‑deficient. Extensive experiments on seven homophilic and heterophilic benchmarks confirm that the proposed adaptive blocking outperforms strong baselines.

ID: 5

Over the Top-1: Uncertainty-Aware Cross-Modal Retrieval with CLIP

Lluis Gomez
[OpenReview]

Abstract

State-of-the-art vision-language models, such as CLIP, achieve remarkable performance in cross-modal retrieval tasks, yet estimating their predictive uncertainty remains an open challenge. While recent works have explored probabilistic embeddings to quantify retrieval uncertainty, these approaches often require model retraining or fine-tuning adapters, making them computationally expensive and dataset-dependent. In this work, we propose a training-free framework for uncertainty estimation in cross-modal retrieval. We start with a simple yet effective baseline that uses the cosine similarity between a query and its top-ranked match as an uncertainty measure. Building on this, we introduce a method that estimates uncertainty by analyzing the consistency of the top-1 retrieved item across samples drawn from the posterior predictive distribution using Monte Carlo Dropout (MCD) or Deep Ensembles. Finally, we propose an adversarial perturbation-based method, where the minimal perturbation required to alter the top-1 retrieval serves as a robust indicator of uncertainty. Empirical results in two standard cross-modal retrieval benchmarks demonstrate that our approach achieves superior calibration compared to learned probabilistic methods, all while incurring zero additional training cost.

ID: 10

Group-Agent Reinforcement Learning with Heterogeneous Agents

Kaiyue Wu, Xiao-Jun Zeng, Tingting Mu
[OpenReview]

Abstract

Group-agent reinforcement learning (GARL) is a newly arising learning scenario, where multiple reinforcement learning agents study together in a group, sharing knowledge in an asynchronous fashion. The goal is to improve the learning performance of each individual agent. Under a more general heterogeneous setting where different agents learn using different algorithms, we advance GARL by designing novel and effective group-learning mechanisms. They guide the agents on whether and how to learn from action choices from the others, and allow the agents to adopt available policy and value function models sent by another agent if they perform better. We have conducted extensive experiments on a total of 43 different Atari 2600 games to demonstrate the superior performance of the proposed method. After the group learning, among the 129 agents examined, 96% are able to achieve a learning speed-up, and 72% are able to learn ≥100 times faster. Also, around 41% of those agents have achieved a higher accumulated reward score by learning in ≤5% of the time steps required by a single agent when learning on its own.

ID: 11

MutualNeRF: Improve the Performance of NeRF under Limited Samples with Mutual Information Theory

Zifan Wang, Jingwei Li, Yitang Li, Yunze Liu
[OpenReview]

Abstract

This paper introduces MutualNeRF, a framework enhancing Neural Radiance Field (NeRF) performance under limited samples using Mutual Information Theory. While NeRF excels in 3D scene synthesis, challenges arise with limited data and existing methods that aim to introduce prior knowledge lack theoretical support in a unified framework. We introduce a simple but theoretically robust concept, Mutual Information, as a metric to uniformly measure the correlation between images, considering both macro (semantic) and micro (pixel) levels. For sparse view sampling, we strategically select additional viewpoints containing more non-overlapping scene information by minimizing mutual information without knowing ground truth images beforehand. Our framework employs a greedy algorithm, offering a near-optimal solution. For few-shot view synthesis, we maximize the mutual information between inferred images and ground truth, expecting inferred images to gain more relevant information from known images. This is achieved by incorporating efficient, plug-and-play regularization terms. Experiments under limited samples show consistent improvement over state-of-the-art baselines in different settings, affirming the efficacy of our framework.

ID: 13

Best Possible Q-Learning

Jiechuan Jiang, Zongqing Lu
[OpenReview]

Abstract

Fully decentralized learning, where the global information, *i.e.*, the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose \textit{best possible operator}, a novel decentralized operator, and prove that the policies of cooperative agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator when there is only one optimal joint policy. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, *best possible Q-learning* (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.

ID: 15

Creative Agents: Empowering Agents with Imagination for Creative Tasks

Penglin Cai, Chi Zhang, Yuhui Fu, Haoqi Yuan, Zongqing Lu
[OpenReview]

Abstract

We study building embodied agents for open-ended creative tasks. While existing methods build instruction-following agents that can perform diverse open-ended tasks, none of them demonstrates creativity -- the ability to give novel and diverse solutions implicit in the language instructions. This limitation comes from their inability to convert abstract language instructions into concrete goals and perform long-horizon planning for such complicated goals. Given the observation that humans perform creative tasks with imagination, we propose a class of solutions, where the controller is enhanced with an imaginator generating detailed imaginations of task outcomes conditioned on language instructions. We introduce several approaches to implementing the components of creative agents. We implement the imaginator with either a large language model for textual imagination or a diffusion model for visual imagination. The controller can either be a behavior-cloning policy or a pre-trained foundation model generating executable codes in the environment. We benchmark creative tasks with the challenging open-world game Minecraft, where the agents create diverse buildings given free-form language instructions. We propose novel evaluation metrics for open-ended creative tasks utilizing GPT-4V, which holds many advantages over existing metrics. We perform a detailed experimental analysis of creative agents, showing that creative agents are the first AI agents accomplishing diverse building creation in the survival mode of Minecraft. Our benchmark and models are open-source for future research on creative agents (https://github.com/PKU-RL/Creative-Agents).

ID: 23

When Extragradient Meets PAGE: Bridging Two Giants to Boost Variational Inequalities

Gleb Molodtsov, Valery Parfenov, Egor Petrov, Evseev Grigoriy, Daniil Medyakov, Aleksandr Beznosikov
[OpenReview]

Abstract

Variational inequalities (VIs) have emerged as a universal framework for solving a wide range of problems. A broad spectrum of applications include optimization, equilibrium analysis, reinforcement learning, and the rapidly evolving field of generative adversarial networks (GANs). Stochastic methods have proven to be powerful tools for addressing such problems, but they often suffer from irreducible variance, necessitating the development of variance reduction techniques. Among these, SARAH-based algorithms have demonstrated remarkable practical effectiveness. In this work, we propose a new stochastic variance reduced algorithm for solving stochastic variational inequalities. We push the boundaries of existing methodologies by leveraging PAGE method to solve VIs. Unlike prior studies, which lacked theoretical guarantees under general assumptions, we establish rigorous convergence rates, thus closing a crucial gap in the literature. Our contributions extend both theoretical understanding and practical advancements in solving variational inequalities. To substantiate our claims, we conduct extensive experiments across diverse benchmarks including widely studied denoising task. The results consistently showcase the superior efficiency of our approach, underscoring its potential for real-world applications.

ID: 27

Enhanced Equilibria-Solving via Private Information Pre-Branch Structure in Adversarial Team Games

Chen Qiu, Haobo Fu, Kai Li, Jiajia Zhang, Xuan Wang
[OpenReview]

Abstract

In ex ante coordinated adversarial team games (ATGs), a team competes against an adversary, and team members can only coordinate their strategies before the game starts. The team-maxmin equilibrium with correlation (TMECor) is a suitable solution concept for extensive-form sequential ATGs. One class of TMECor-solving methods transforms the problem into solving NE in two-player zero-sum games, leveraging well-established tools for the latter. However, existing methods are fundamentally action-based, resulting in poor generalizability and low solving efficiency due to the exponential growth in the size of the transformed game. To address the above issues, we propose an efficient game transformation method based on private information, where all team members are represented by a single coordinator. We designed a structure called private information pre-branch, which makes decisions considering all possible private information from teammates. We prove that the size of the game transformed by our method is exponentially reduced compared to the current state-of-the-art. Moreover, we demonstrate equilibria equivalence. Experimentally, our method achieves a significant speedup of 182.89$\times$ to 694.44$\times$ in scenarios where the current state-of-the-art method can work, such as small-scale Kuhn poker and Leduc poker. Furthermore, our method is applicable to larger games and those with dynamically changing private information, such as Goofspiel.

ID: 33

Nearly Optimal Differentially Private ReLU Regression

Meng Ding, Mingxi Lei, Shaowei Wang, Tianhang Zheng, Di Wang, Jinhui Xu
[OpenReview]

Abstract

In this paper, we investigate one of the most fundamental non-convex learning problems—ReLU regression—in the Differential Privacy (DP) model. Previous studies on private ReLU regression heavily rely on stringent assumptions, such as constant-bounded norms for feature vectors and labels. We relax these assumptions to a more standard setting, where data can be i.i.d. sampled from $O(1)$-sub-Gaussian distributions. We first show that when $\varepsilon = \tilde{O}(\sqrt{\frac{1}{N}})$ and there is some public data, it is possible to achieve an upper bound of $\Tilde{O}(\frac{d^2}{N^2 \varepsilon^2})$ for the excess population risk in $(\epsilon, \delta)$-DP, where $d$ is the dimension and $N$ is the number of data samples. Moreover, we relax the requirement of $\epsilon$ and public data by proposing and analyzing a one-pass mini-batch Generalized Linear Model Perceptron algorithm (DP-MBGLMtron). Additionally, using the tracing attack argument technique, we demonstrate that the minimax rate of the estimation error for $(\varepsilon, \delta)$-DP algorithms is lower bounded by $\Omega(\frac{d^2}{N^2 \varepsilon^2})$. This shows that DP-MBGLMtron achieves the optimal utility bound up to logarithmic factors. Experiments further support our theoretical results.

ID: 40

Improving Adversarial Transferability via Decision Boundary Adaptation

Jiayu Zhang, Zhiyu Zhu, Zhibo Jin, Xinyi Wang, Huaming Chen, Kim-Kwang Raymond Choo
[OpenReview]

Abstract

Black-box attacks play a pivotal role in adversarial attacks. However, existing approaches often focus predominantly on attacking from a data-centric perspective, neglecting crucial aspects of the models. To address this issue, we propose a novel approach in this paper, coined Decision Boundary Adaptation (DBA). Our approach innovatively adopts a model-centric viewpoint, leveraging operations on the model to attain properties that enhance transferability. We observe that a flatter curvature of the statistical manifold, influenced by both samples and model parameters, leads to stronger transferability of the adversarial attacks. To leverage this, we introduce the concept of local flatness, providing an evaluation method for local flatness property along with a detailed mathematical proof. Additionally, we demonstrate a consistent relationship between local flatness, the model's decision boundary, and the gradient descent process, showing how flatness can be achieved through gradient descent at the model parameter level. Through extensive evaluation using state-of-the-art adversarial attack techniques, our DBA approach significantly enhances the black-box attack capabilities of all the tested adversarial attack methods. The implementation of our method is available at https://github.com/LMBTough/DBA.

ID: 47

NRFlow: Towards Noise-Robust Generative Modeling via High-Order Mechanism

Bo Chen, Chengyue Gong, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan, Xugang Ye
[OpenReview]

Abstract

Flow-based generative models have shown promise in various machine learning applications, but they often face challenges in handling noise and ensuring robustness in trajectory estimation. In this work, we propose NRFlow, a novel extension to flow-based generative modeling that incorporates second-order dynamics through acceleration fields. We develop a comprehensive theoretical framework to analyze the regularization effects of high-order terms and derive noise robustness guarantees. Our method leverages a two-part loss function to simultaneously train first-order velocity fields and high-order acceleration fields, enhancing both smoothness and stability in learned transport trajectories. These results highlight the potential of high-order flow matching for robust generative modeling in complex and noisy environments.

ID: 53

Trading Off Voting Axioms for Privacy

Zhechen Li, Ao Liu, Lirong Xia, Yongzhi Cao, Hanpin Wang
[OpenReview]

Abstract

In this paper, we investigate tradeoffs among differential privacy (DP) and several important voting axioms: Pareto efficiency, SD-efficiency, PC-efficiency, Condorcet criterion, and Condorcet loser criterion. We provide upper and lower bounds on the two-way tradeoffs between DP and each axiom. We also provide upper and lower bounds on three-way tradeoffs among DP and every pairwise combination of all the axioms, showing that, while the axioms are compatible without DP, their upper bounds cannot be achieved simultaneously under DP. Our results illustrate the effect of DP on the satisfaction and compatibility of voting axioms.

ID: 61

Distributional Reinforcement Learning with Dual Expectile-Quantile Regression

Sami Jullien, Romain Deffayet, Jean-Michel Renders, Paul Groth, Maarten de Rijke
[OpenReview]

Abstract

Distributional reinforcement learning (RL) has proven useful in multiple benchmarks as it enables approximating the full distribution of returns and extracts a rich feedback from environment samples. The commonly used quantile regression approach to distributional RL -- based on asymmetric $L_1$ losses -- provides a flexible and effective way of learning arbitrary return distributions. In practice, it is often improved by using a more efficient, asymmetric hybrid $L_1$-$L_2$ Huber loss for quantile regression. However, by doing so, distributional estimation guarantees vanish, and we empirically observe that the estimated distribution rapidly collapses to its mean. Indeed, asymmetric $L_2$ losses, corresponding to expectile regression, cannot be readily used for distributional temporal difference learning. Motivated by the efficiency of $L_2$-based learning, we propose to jointly learn expectiles and quantiles of the return distribution in a way that allows efficient learning while keeping an estimate of the full distribution of returns. We prove that our proposed operator converges to the distributional Bellman operator in the limit of infinite estimated quantile and expectile fractions, and we benchmark a practical implementation on a toy example and at scale. On the Atari benchmark, our approach matches the performance of the Huber-based IQN-1 baseline after $200$M training frames but avoids distributional collapse and keeps estimates of the full distribution of returns.

ID: 63

Generative Uncertainty in Diffusion Models

Metod Jazbec, Eliot Wong-Toi, Guoxuan Xia, Dan Zhang, Eric Nalisnick, Stephan Mandt
[OpenReview]

Abstract

Diffusion models have recently driven significant breakthroughs in generative modeling. While state-of-the-art models produce high-quality samples on average, individual samples can still be low quality. Detecting such samples without human inspection remains a challenging task. To address this, we propose a Bayesian framework for estimating generative uncertainty of synthetic samples. We outline how to make Bayesian inference practical for large, modern generative models and introduce a new semantic likelihood (evaluated in the latent space of a feature extractor) to address the challenges posed by high-dimensional sample spaces. Through our experiments, we demonstrate that the proposed generative uncertainty effectively identifies poor-quality samples and significantly outperforms existing uncertainty-based methods. Notably, our Bayesian framework can be applied post-hoc to any pretrained diffusion or flow matching model (via the Laplace approximation), and we propose simple yet effective techniques to minimize its computational overhead during sampling.

ID: 73

FlightPatchNet: Multi-Scale Patch Network with Differential Coding for Short-Term Flight Trajectory Prediction

Lan Wu, Xuebin Wang, Ruijuan Chu, Guangyi Liu, Jing Zhang, Linyu Wang
[OpenReview]

Abstract

Accurate multi-step flight trajectory prediction plays an important role in Air Traffic Control, which can ensure the safety of air transportation. Two main issues limit the flight trajectory prediction performance of existing works. The first issue is the negative impact on prediction accuracy caused by the significant differences in data range. The second issue is that real-world flight trajectories involve underlying temporal dependencies, and most existing methods fail to reveal the hidden complex temporal variations and extract features from one single time scale. To address the above issues, we propose FlightPatchNet, a multi-scale patch network with differential coding for flight trajectory prediction. Specifically, FlightPatchNet first utilizes differential coding to encode the original values of longitude and latitude into first-order differences and generates embeddings for all variables at each time step. Then, global temporal attention is introduced to explore the dependencies between different time steps. To fully explore the diverse temporal patterns in flight trajectories, a multi-scale patch network is delicately designed to serve as the backbone. The multi-scale patch network exploits stacked patch mixer blocks to capture inter- and intra-patch dependencies under different time scales, and further integrates multi-scale temporal features across different scales and variables. Finally, FlightPatchNet ensembles multiple predictors to make direct multi-step prediction. Extensive experiments on ADS-B datasets demonstrate that our model outperforms the competitive baselines.

ID: 76

Statistical Significance of Feature Importance Rankings

Jeremy Goldwasser, Giles Hooker
[OpenReview]

Abstract

Feature importance scores are ubiquitous tools for understanding the predictions of machine learning models. However, many popular attribution methods suffer from high instability due to random sampling. Leveraging novel ideas from hypothesis testing, we devise techniques that ensure the most important features are correct with high-probability guarantees. These are capable of assessing both the set of $K$ top-ranked features as well as the order of its elements. Given local or global importance scores, we demonstrate how to retrospectively verify the stability of the highest ranks. We then introduce two efficient sampling algorithms that identify the $K$ most important features, perhaps in order, with probability at least $1-\alpha$. The theoretical justification for these procedures is validated empirically on SHAP and LIME.

ID: 78

Flat Posterior Does Matter For Bayesian Model Averaging

Sungjun Lim, Jeyoon Yeom, Sooyon Kim, Hoyoon Byun, Jinho Kang, Yohan Jung, Jiyoung Jung, Kyungwoo Song
[OpenReview]

Abstract

Bayesian neural networks (BNNs) estimate the posterior distribution of model parameters and utilize posterior samples for Bayesian Model Averaging (BMA) in prediction. However, despite the crucial role of flatness in the loss landscape in improving the generalization of neural networks, its impact on BMA has been largely overlooked. In this work, we explore how posterior flatness influences BMA generalization and empirically demonstrate that \emph{(1) most approximate Bayesian inference methods fail to yield a flat posterior} and \emph{(2) BMA predictions, without considering posterior flatness, are less effective at improving generalization}. To address this, we propose Flat Posterior-aware Bayesian Model Averaging (FP-BMA), a novel training objective that explicitly encourages flat posteriors in a principled Bayesian manner. We also introduce a Flat Posterior-aware Bayesian Transfer Learning scheme that enhances generalization in downstream tasks. Empirically, we show that FP-BMA successfully captures flat posteriors, improving generalization performance.

ID: 99

Near-Optimal Regret Bounds for Federated Multi-armed Bandits with Fully Distributed Communication

Haoran Zhang, Xuchuang Wang, Hao-Xu Chen, Hao Qiu, Lin Yang, Yang Gao
[OpenReview]

Abstract

In this paper, we focus on the research of federated multi-armed bandit (FMAB) problems where agents can only communicate with their neighbors. All agents aim to solve a common multi-armed bandit~(MAB) problem to minimize individual regrets, while group regret can also be minimized. In a federated bandit problem, an agent fails to estimate the global reward means of arms by only using local observations, and hence, the bandit learning algorithm usually adopts a consensus estimation strategy to address the heterogeneity. However, up to now, the existing algorithms with fully distributed communication graphs only achieved a suboptimal result for the problem. To address that, a fully distributed online consensus estimation algorithm (\texttt{CES}) is proposed to estimate the global mean without bias. Integrating this consensus estimator into a distributed successive elimination bandit algorithm framework yields our federated bandit algorithm. Our algorithm significantly improves both individual and group regrets over previous approaches, and we provide an in-depth analysis of the lower bound for this problem.

ID: 105

Revisiting the Equivalence of Bayesian Neural Networks and Gaussian Processes: On the Importance of Learning Activations

Marcin Sendera, Amin Sorkhei, Tomasz Kuśmierczyk
[OpenReview]

Abstract

Gaussian Processes (GPs) provide a convenient framework for specifying function-space priors, making them a natural choice for modeling uncertainty. In contrast, Bayesian Neural Networks (BNNs) offer greater scalability and extendability but lack the advantageous properties of GPs. This motivates the development of BNNs capable of replicating GP-like behavior. However, existing solutions are either limited to specific GP kernels or rely on heuristics. We demonstrate that trainable activations are crucial for effective mapping of GP priors to wide BNNs. Specifically, we leverage the closed-form 2-Wasserstein distance for efficient gradient-based optimization of reparameterized priors and activations. Beyond learned activations, we also introduce trainable periodic activations that ensure global stationarity by design, and functional priors conditioned on GP hyperparameters to allow efficient model selection. Empirically, our method consistently outperforms existing approaches or matches performance of the heuristic methods, while offering stronger theoretical foundations.

ID: 109

Contaminated Multivariate Time-Series Anomaly Detection with Spatio-Temporal Graph Conditional Diffusion Models

Thi Kieu Khanh Ho, Narges Armanfard
[OpenReview]

Abstract

Mainstream unsupervised anomaly detection algorithms often excel in academic datasets, yet their real-world performance is restricted due to the controlled experimental conditions involving clean training data. Addressing the challenge of training with noise, a prevalent issue in practical anomaly detection, is frequently overlooked. In a pioneering endeavor, this study delves into the realm of label-level noise within sensory time-series anomaly detection (TSAD). This paper presents a novel and practical TSAD when the training data is contaminated with anomalies. The introduced approach, called TSAD-C, is devoid of access to abnormality labels during the training phase. TSAD-C encompasses three modules: a Decontaminator to rectify anomalies present during training and swiftly prepare the decontaminated data for subsequent modules; a Long-range Variable Dependency Modeling module to capture long-range intra- and inter-variable dependencies within the decontaminated data that is considered as a surrogate of the pure normal data; and an Anomaly Scoring module that leverages insights of the first two modules to detect all types of anomalies. Our extensive experiments conducted on four reliable, diverse, and challenging datasets conclusively demonstrate that TSAD-C surpasses existing methods, thus establishing a new state-of-the-art in the TSAD field.

ID: 112

Informative Synthetic Data Generation for Thorax Disease Classification

Yancheng Wang, Rajeev Goel, Marko Jojic, Alvin C Silva, Teresa Wu, Yingzhen Yang
[OpenReview]

Abstract

Deep Neural Networks (DNNs), including architectures such as Vision Transformers (ViTs), have achieved remarkable success in medical imaging tasks. However, their performance typically hinges on the availability of large-scale, high-quality labeled datasets—resources that are often scarce or infeasible to obtain in medical domains. Generative Data Augmentation (GDA) offers a promising remedy by supplementing training sets with synthetic data generated via generative models like Diffusion Models (DMs). Yet, this approach introduces a critical challenge: synthetic data often contains significant noise, which can degrade the performance of classifiers trained on such augmented datasets. Prior solutions, including data selection and re-weighting techniques, often rely on access to clean metadata or pretrained external classifiers. In this work, we propose \emph{Informative Data Selection} (IDS), a principled sample re-weighting framework grounded in the Information Bottleneck (IB) principle. IDS assigns higher weights to more informative synthetic samples, thereby improving classifier performance in GDA-enhanced training for thorax disease classification. Extensive experiments demonstrate that IDS significantly outperforms existing data selection and re-weighting baselines. Our code is publicly available at \url{https://github.com/Statistical-Deep-Learning/IDS}.

ID: 113

Temperature Optimization for Bayesian Deep Learning

Kenyon Ng, Chris van der Heide, Liam Hodgkinson, Susan Wei
[OpenReview]

Abstract

The Cold Posterior Effect (CPE) is a phenomenon in Bayesian Deep Learning (BDL), where tempering the posterior to a cold temperature often improves the predictive performance of the posterior predictive distribution (PPD). Although the term `CPE' suggests colder temperatures are inherently better, the BDL community increasingly recognizes that this is not always the case. Despite this, there remains no systematic method for finding the optimal temperature beyond grid search. In this work, we propose a data-driven approach to select the temperature that maximizes test log-predictive density, treating the temperature as a model parameter and estimating it directly from the data. We empirically demonstrate that our method performs comparably to grid search, at a fraction of the cost, across both regression and classification tasks. Finally, we highlight the differing perspectives on CPE between the BDL and Generalized Bayes communities: while the former primarily emphasizes the predictive performance of the PPD, the latter prioritizes the utility of the posterior under model misspecification; these distinct objectives lead to different temperature preferences.

ID: 115

Critical Influence of Overparameterization on Sharpness-aware Minimization

Sungbin Shin, Dongyeop Lee, Maksym Andriushchenko, Namhoon Lee
[OpenReview]

Abstract

Training overparameterized neural networks often yields solutions with varying generalization capabilities, even when achieving similar training losses. Recent evidence indicates a strong correlation between the sharpness of a minimum and its generalization error, leading to increased interest in optimization methods that explicitly seek flatter minima for improved generalization. Despite its contemporary relevance to overparameterization, however, this sharpness-aware minimization (SAM) strategy has not been studied much yet as to exactly how it is affected by overparameterization. In this work, we analyze SAM under varying degrees of overparameterization, presenting both empirical and theoretical findings that reveal its critical influence on SAM's effectiveness. First, we conduct extensive numerical experiments across diverse domains and show that SAM consistently improves with overparameterization. Next, we attribute this phenomenon to the interplay between the enlarged solution space and increased implicit bias resulting from overparameterization. Furthermore, we show that this effect is particularly pronounced in practical settings involving label noise and sparsity, and yet, sufficient regularization is necessary. Last but not least, we provide other theoretical insights into how overparameterization helps SAM achieve minima with more uniform Hessian moments compared to SGD, and much faster convergence at a linear rate.

ID: 123

Valid Bootstraps for Network Embeddings with Applications to Network Visualisation

Emerald Dilworth, Ed Davis, Daniel John Lawson
[OpenReview]

Abstract

Quantifying uncertainty in networks is an important step in modelling relationships and interactions between entities. We consider the challenge of bootstrapping an inhomogeneous random graph when only a single observation of the network is made and the underlying data generating function is unknown. We address this problem by considering embeddings of the observed and bootstrapped network that are statistically indistinguishable. We utilise an exchangeable network test that can empirically validate bootstrap samples generated by any method. Existing methods fail this test, so we propose a principled, distribution-free network bootstrap using k-nearest neighbour smoothing, that can pass this exchangeable network test in many synthetic and real-data scenarios. We demonstrate the utility of this work in combination with the popular data visualisation method t-SNE, where uncertainty estimates from bootstrapping are used to explain whether visible structures represent real statistically sound structures.

ID: 127

How Likely Are Two Voting Rules Different?

Ziqi Yu, Lirong Xia, Qishen Han, Chengkai Zhang
[OpenReview]

Abstract

We characterize the maximum likelihood that two voting rule outcomes are different and that the winner of one voting rule is the loser of another (implying that they are {\em drastically different}) on positional scoring rules, Condorcet winner/loser, Copeland, Ranked Pairs, and STV (Single Transferable Vote) under any fixed number of alternatives. The most famous problem in this scope is strong Borda's paradox, in which the winner of the plurality rule is the Condorcet loser. Under mild assumptions, we show that the maximum likelihood that different rules are drastically different is $\Theta(1)$ except for a few special cases, demonstrating the difference between these rules. We also prove that two scoring rules with linear independent scoring vectors have different winners with probability $\Theta(1)$, no matter how similar they are. Our analysis adopts the {\em smoothed social choice framework}~\cite{xia2020smoothed} and can be applied to a variety of statistical models, including the standard impartial culture (IC).

ID: 128

Multiple Wasserstein Gradient Descent Algorithm for Multi-Objective Distributional Optimization

Dai Hai Nguyen, Hiroshi Mamitsuka, Atsuyoshi Nakamura
[OpenReview]

Abstract

We address the optimization problem of simultaneously minimizing multiple objective functionals over a family of probability distributions. This type of Multi-Objective Distributional Optimization commonly arises in machine learning and statistics, with applications in areas such as multiple target sampling, multi-task learning, and multi-objective generative modeling. To solve this problem, we propose an iterative particle-based algorithm, which we call Muliple Wasserstein Gradient Descent (MWGraD), which constructs a flow of intermediate empirical distributions, each being represented by a set of particles, which gradually minimize the multiple objective functionals simultaneously. Specifically, MWGraD consists of two key steps at each iteration. First, it estimates the Wasserstein gradient for each objective functional based on the current particles. Then, it aggregates these gradients into a single Wasserstein gradient using dynamically adjusted weights and updates the particles accordingly. In addition, we provide theoretical analysis and present experimental results on both synthetic and real-world datasets, demonstrating the effectiveness of MWGraD.

ID: 129

Beyond Invisibility: Learning Robust Visible Watermarks for Stronger Copyright Protection

Tianci Liu, Tong Yang, Quan Zhang, Qi Lei
[OpenReview]

Abstract

As AI advances, copyrighted content faces growing risk of unauthorized use, whether through model training or direct misuse. Building upon invisible adversarial perturbation, recent works developed copyright protections against specific AI techniques such as unauthorized personalization through DreamBooth that are misused. However, these methods offer only short-term security, as they require retraining whenever the underlying model architectures change. To establish long-term protection aiming at better robustness, we go beyond invisible perturbation, and propose a universal approach that embeds \textit{visible} watermarks that are \textit{hard-to-remove} into images. Grounded in a new probabilistic and inverse problem-based formulation, our framework maximizes the discrepancy between the \textit{optimal} reconstruction and the original content. We develop an effective and efficient approximation algorithm to circumvent a intractable bi-level optimization. Experimental results demonstrate superiority of our approach across diverse scenarios.

ID: 131

Adaptive Threshold Sampling for Pure Exploration in Submodular Bandits

Wenjing Chen, Shuo Xing, Victoria G. Crawford
[OpenReview]

Abstract

We address the problem of submodular maximization under bandit feedback, where the objective function $f:2^U\to\mathbb{R}_{\geq 0}$ can only be accessed through noisy, i.i.d. sub-Gaussian queries. This problem arises in many applications including influence maximization, diverse recommendation systems, and large-scale facility location optimization. In this paper, we focus on the pure-exploration setting, where the goal is to identify a high-quality solution set using as few noisy queries as possible. We propose an efficient adaptive sampling strategy, called Confident Sample (CS) that can serve as a versatile subroutine to propose approximation algorithms for many submodular maximization problems. Our algorithms achieve approximation guarantees arbitrarily close to the standard value oracle setting and are highly sample-efficient. We propose and analyze algorithms for monotone submodular maximization with cardinality and matroid constraints, as well as unconstrained non-monotone submodular maximization. Our theoretical analysis is complemented by empirical evaluation on real instances, demonstrating the superior sample efficiency of our proposed algorithm relative to alternative approaches.

ID: 141

Lower Bound on Howard Policy Iteration for Deterministic Markov Decision Processes

Ali Asadi, Krishnendu Chatterjee, Jakob de Raaij
[OpenReview]

Abstract

Deterministic Markov Decision Processes (DMDPs) are a mathematical framework for decision-making where the outcomes and future possible actions are deterministically determined by the current action taken. DMDPs can be viewed as a finite directed weighted graph, where in each step, the controller chooses an outgoing edge. An objective is a measurable function on runs (or infinite trajectories) of the DMDP, and the value for an objective is the maximal cumulative reward (or weight) that the controller can guarantee. We consider the classical mean-payoff (aka limit-average) objective, which is a basic and fundamental objective. Howard's policy iteration algorithm is a popular method for solving DMDPs with mean-payoff objectives. Although Howard's algorithm performs well in practice, as experimental studies suggested, the best known upper bound is exponential and the current known lower bound is as follows: For the input size $I$, the algorithm requires $\widetilde{\Omega}(\sqrt{I})$ iterations, where $\widetilde{\Omega}$ hides the poly-logarithmic factors, i.e., the current lower bound on iterations is sub-linear with respect to the input size. Our main result is an improved lower bound for this fundamental algorithm where we show that for the input size $I$, the algorithm requires $\widetilde{\Omega}(I)$ iterations.

ID: 143

Correlated Quantization for Faster Nonconvex Distributed Optimization

Andrei Panferov, Yury Demidovich, Ahmad Rammal, Peter Richtárik
[OpenReview]

Abstract

Quantization [Alistarh et al., 2017] is an important (stochastic) compression technique that reduces the volume of transmitted bits during each communication round in distributed model training. Suresh et al. [2022] introduce correlated quantizers and show their advantages over independent counterparts by analyzing distributed SGD communication complexity. We analyze the fore- front distributed non-convex optimization algorithm MARINA [Gorbunov et al., 2022] utilizing the proposed correlated quantizers and show that it outperforms the original MARINA and distributed SGD of Suresh et al. [2022] with regard to the communication complexity. We significantly re- fine the original analysis of MARINA without any additional assumptions using the weighted Hessian variance [Tyurin et al., 2022], and then we expand the theoretical framework of MARINA to accommodate a substantially broader range of potentially correlated and biased compressors, thus dilating the applicability of the method beyond the conventional independent unbiased compressor setup. Extensive experimental results corroborate our theoretical findings.

ID: 144

Learning to Sample in Stochastic Optimization

Sijia Zhou, Yunwen Lei, Ata Kaban
[OpenReview]

Abstract

We consider a PAC-Bayes analysis of stochastic optimization algorithms, and devise a new SGDA algorithm inspired from our bounds. Our algorithm learns a data-dependent sampling scheme along with model parameters, which may be seen as assigning a probability to each training point. We demonstrate that learning the sampling scheme increases robustness against misleading training points, as our algorithm learns to avoid bad examples during training. We conduct experiments in both standard and adversarial learning problems on several benchmark datasets, and demonstrate various applications including interpretability upon visual inspection, and robustness to the ill effects of bad training points. We also extend our analysis to pairwise SGD to demonstrate the generalizability of our methodology.

ID: 146

Moments of Causal Effects

Yuta Kawakami, Jin Tian
[OpenReview]

Abstract

The moments of random variables are fundamental statistical measures for characterizing the shape of a probability distribution, encompassing metrics such as mean, variance, skewness, and kurtosis. Additionally, the product moments, including covariance and correlation, reveal the relationships between multiple random variables. On the other hand, the primary focus of causal inference is the evaluation of causal effects, which are defined as the difference between two potential outcomes. While traditional causal effect assessment focuses on the average causal effect, this work provides definitions, identification theorems, and bounds for moments and product moments of causal effects to analyze their distribution and relationships. We conduct experiments to illustrate the estimation of the moments of causal effects from finite samples and demonstrate their practical application using a real-world medical dataset.

ID: 147

Optimal Zero-shot Regret Minimization for Selective Classification with Out-of-Distribution Detection

Eduardo Dadalto Câmara Gomes, Marco Romanelli
[OpenReview]

Abstract

Selective Classification with Out-of-Distribution Detection (SCOD) is a general framework that combines the detection of incorrectly classified in-distribution samples and out-of-distribution samples. Previous solutions for SCOD heavily rely on the choice of Selective Classification (SC) and Out-of-Distribution (OOD) detectors selected at test time. Notably, the performance of these detectors varies across different underlying data distributions. Hence, a poor choice can affect the efficacy of the SCOD framework. On the other hand, making an informed choice is impossible without samples from both in- and out-distribution. We propose an optimal zero-shot black-box method for SCOD that aggregates off-the-shelf detectors, is based on the principle of regret minimization, and therefore provides guarantees on the worst-case performance. We demonstrate that our method achieves performance comparable to state-of-the-art methods in several benchmarks while also shielding the user from the burden of blindly selecting the SC and OOD detectors, optimally reducing the worst-case rejection risk.

ID: 155

Finding Interior Optimum of Black-box Constrained Objective with Bayesian Optimization

Fengxue Zhang, Yuxin Chen
[OpenReview]

Abstract

Optimizing objectives under constraints, where both the objectives and constraints are black box functions, is a common challenge in real-world applications such as medical therapy design, industrial process optimization, and hyperparameter optimization. Bayesian Optimization (BO) is a popular approach for tackling these complex scenarios. However, constrained Bayesian Optimization (CBO) often relies on heuristics, approximations, or relaxation of objectives, which can lead to weaker theoretical guarantees compared to canonical BO. In this paper, we address this gap by focusing on identifying the interior optimum of the constrained objective, deliberately excluding boundary candidates susceptible to noise perturbations. Our approach leverages the insight that jointly optimizing the objective and learning the constraints can help pinpoint high-confidence **regions of interest** (ROI) likely to contain the interior optimum. We introduce an efficient CBO framework, which intersects these ROIs within a discretized search space to determine a general ROI. Within this ROI, we optimize the acquisition functions, balancing constraints learning and objective optimization. We showcase the efficiency and robustness of our proposed framework by deriving high-probability regret bounds and validating its performance through extensive empirical evaluations.

ID: 162

A Mirror Descent Perspective of Smoothed Sign Descent

Shuyang Wang, Diego Klabjan
[OpenReview]

Abstract

The optimization dynamics of gradient descent for overparameterized problems can be viewed as low-dimensional dual dynamics induced by a mirror map, providing a mirror descent perspective on the implicit regularization phenomenon. However, the dynamics of adaptive gradient descent methods that are widely used in practice remain less understood. Meanwhile, empirical evidence of performance gaps suggests fundamental differences in their underlying dynamics. In this work, we introduce the dual dynamics of smoothed sign descent with stability constant $\varepsilon$ for regression problems, formulated using the mirror descent framework. Unlike prior methods, our approach applies to algorithms where update directions deviate from true gradients such as ADAM. We propose a mirror map that reveals the equivalent dual dynamics under some assumptions. By studying dual dynamics, we characterize the convergent solution as approximately minimizing a Bregman divergence style function closely related to the $l_{3/2}$ norm. Furthermore, we demonstrate the role of the stability constant $\varepsilon$ in shaping the convergent solution. Our analyses offer new insights into the distinct properties of the smoothed sign descent algorithm, and show the potential of applying the mirror descent framework to study complex dynamics beyond gradient descent.

ID: 166

Optimal Submanifold Structure in Log-linear Models

Zhou Derun, Mahito Sugiyama
[OpenReview]

Abstract

In the modeling of discrete distributions using log-linear models, the model selection process is equivalent to imposing zero-value constraints on a subset of natural parameters, which is an established concept in information geometry. This zero-value constraint has been implicitly employed, from classic Boltzmann machines to recent many-body approximations of tensors. However, in theory, any constant value other than zero can be used for these constraints, leading to different submanifolds onto which the empirical distribution is projected, a possibility that has not been explored. Here, we investigate the asymptotic behavior of these constraint values from the perspective of information geometry. Specifically, we prove that the optimal value converges to zero as the size of the support of the empirical distribution increases, which corresponds to the size of the input tensors in the context of tensor decomposition. While our primary focus is on many-body approximation of tensors, it is straightforward to extend this analysis to a wide range of log-linear modeling applications.

ID: 175

DyGMAE: A Novel Dynamic Graph Masked Autoencoder for Link Prediction

Weixiong Liu, Junwei Cheng, Zhongyu Pan, Chaobo He, Quanlong Guan
[OpenReview]

Abstract

Dynamic link prediction (DLP) is a crucial task in graph learning, aiming to predict future links between nodes at subsequent time in dynamic graphs. Recently, graph masked autoencoders (GMAEs) have shown promising performance in self-supervised learning. However, their application to DLP is under-explored. Existing GMAEs struggle to capture temporal dependencies, and their random masking causes crucial information loss for DLP. Moreover, most existing DLP methods rely on local information, ignoring global information and failing to capture complex features in real-world dynamic graphs. To address these issues, we propose DyGMAE, a novel dynamic GMAE method specifically designed for DLP. DyGMAE introduces a Multi-Scale Masking Strategy (MSMS), which generates multiple graph views by masking parts of the edges and tries to reconstruct them. Additionally, a multi-scale masking representation alignment module with a contrastive learning objective is employed to align representations which are encoded by unmasked edges across these views. Through this design, different masked views can provide diverse information to alleviate the drawbacks of random masking, and contrastive learning can align different views to mitigate the problem of exploiting local and global information simultaneously. Experiments on benchmark datasets show DyGMAE achieves superior performance in the DLP task.

ID: 176

Truthful Elicitation of Imprecise Forecasts

Anurag Singh, Siu Lun Chau, Krikamol Muandet
[OpenReview]

Abstract

The quality of probabilistic forecasts is crucial for decision-making under uncertainty. While proper scoring rules incentivize truthful reporting of precise forecasts, they fall short when forecasters face epistemic uncertainty about their beliefs, limiting their use in safety-critical domains where decision-makers~(DMs) prioritize proper uncertainty management. To address this, we propose a framework for scoring \emph{imprecise forecasts}---forecasts given as a set of beliefs. Despite existing impossibility results for deterministic scoring rules, we enable truthful elicitation by drawing connection to social choice theory and introducing a two-way communication framework where DMs first share their aggregation rules (e.g., averaging or min-max) used in downstream decisions for resolving forecast ambiguity. This, in turn, helps forecasters resolve indecision during elicitation. We further show that truthful elicitation of imprecise forecasts is achievable using proper scoring rules randomized over the aggregation procedure. Our approach allows DM to elicit and integrate the forecaster's epistemic uncertainty into their decision-making process, improving the credibility of downstream decisions.

ID: 180

Probabilistic Semantics Guided Discovery of Approximate Functional Dependencies

Liang Duan, Xinran Wu, Xinhui Li, Lixing Yu, Kun Yue
[OpenReview]

Abstract

As the general description of relationships between attributes, approximate functional dependencies (AFDs) almost hold for a given dataset with a few violations. Most of existing methods for AFD discover are insufficient to balance the efficiency and accuracy due to the massive search space and permission of violations. To address these issues, we propose an efficient method of probabilistic semantics guided discovery of AFDs based on Bayesian network (BN). Firstly, we learn a BN structure and conduct conditional independence tests on the learned structure rather than the entire search space, such that candidate AFDs could be obtained. Secondly, we fulfill search space reduction and structure pruning by making use of probabilistic semantics of graphical models in terms of BN. Consequently, we provide a branch-and-bound algorithm to discover the AFDs with the highest smoothed mutual information scores. Experimental results illustrate that our proposed method is more effective and efficient than the comparison methods. Our code is available at [https://github.com/DKE-Code/BNAFD](https://github.com/DKE-Code/BNAFD).

ID: 181

Multi-group Uncertainty Quantification for Long-form Text Generation

Terrance Liu, Steven Wu
[OpenReview]

Abstract

While past works have shown how uncertainty quantification can be applied to large language model (LLM) outputs, the question of whether resulting uncertainty guarantees still hold within sub-groupings of data remains open. In our work, given some long-form text generated by an LLM, we study uncertainty at both the level of individual claims contained within the output (via calibration) and across the entire output itself (via conformal prediction). Using biography generation as a testbed for this study, we derive a set of (demographic) attributes (e.g., whether some text describes a man or woman) for each generation to form such "subgroups" of data. We find that although canonical methods for both types of uncertainty quantification perform well when measuring across the entire dataset, such guarantees break down when examining particular subgroups. Having established this issue, we invoke group-conditional methods for uncertainty quantification---multicalibration and multivalid conformal prediction---and find that across a variety of approaches, additional subgroup information consistently improves calibration and conformal prediction within subgroups (while crucially retaining guarantees across the entire dataset). As the problems of calibration, conformal prediction, and their multi-group counterparts have not been extensively explored in the context of long-form text generation, we consider these results to form a benchmark for this setting.

ID: 193

Limit-sure Reachability for Small Memory Policies in POMDPs is NP-complete

Ali Asadi, Krishnendu Chatterjee, Raimundo Saona, Ali Shafiee
[OpenReview]

Abstract

A standard model that arises in several applications in sequential decision-making is partially observable Markov decision processes (POMDPs) where a decision-making agent interacts with an uncertain environment. A basic objective in POMDPs is the reachability objective, where given a target set of states, the goal is to eventually arrive at one of them. The limit-sure problem asks whether reachability can be ensured with probability arbitrarily close to 1. In general, the limit-sure reachability problem for POMDPs is undecidable. However, in many practical cases, the most relevant question is the existence of policies with a small amount of memory. In this work, we study the limit-sure reachability problem for POMDPs with a fixed amount of memory. We establish that the computational complexity of the problem is NP-complete.

ID: 194

Discriminative ordering through ensemble consensus

Louis Ohl, Fredrik Lindsten
[OpenReview]

Abstract

Evaluating the performance of clustering models is a challenging task where the outcome depends on the definition of what constitutes a cluster. Due to this design, current existing metrics rarely handle multiple clustering models with diverse cluster definitions, nor do they comply with the integration of constraints when available. In this work, we take inspiration from consensus clustering and assume that a set of clustering models is able to uncover hidden structures in the data. We propose to construct a discriminative ordering through ensemble clustering based on the distance between the connectivity of a clustering model and the consensus matrix. We first validate the proposed method with synthetic scenarios, highlighting that the proposed score ranks the models that best match the consensus first. We then show that this simple ranking score significantly outperforms other scoring methods when comparing sets of different clustering algorithms that are not restricted to a fixed number of clusters and is compatible with clustering constraints.

ID: 202

Stochastic Embeddings : A Probabilistic and Geometric Analysis of Out-of-Distribution Behavior

Anthony Nguyen, Emanuel Aldea, Sylvie Le Hégarat-Mascle, Renaud LUSTRAT
[OpenReview]

Abstract

Deep neural networks perform well in many applications but often fail when exposed to out-of-distribution (OoD) inputs. We identify a geometric phenomenon in the embedding space: in-distribution (ID) data show higher variance than OoD data under stochastic perturbations. Using high-dimensional geometry and statistics, we explain this behavior and demonstrate its application in improving OoD detection. Unlike traditional post-hoc methods, our approach integrates uncertainty-aware tools, such as Bayesian approximations, directly into the detection process. Then, we show how considering the unit hypersphere enhances the separation of ID and OoD samples. Our mathematically sound method achieves competitive performance while remaining simple.

ID: 205

Symbiotic Local Search for Small Decision Tree Policies in MDPs

Roman Andriushchenko, Milan Ceska, Debraj Chakraborty, Sebastian Junges, Jan Kretinsky, Filip Macák
[OpenReview]

Abstract

We study decision making policies in Markov decision processes (MDPs). Two key performance indicators of such policies are their value and their interpretability. On the one hand, policies that optimize value can be efficiently computed via a plethora of standard methods. However, the representation of these policies may prevent their interpretability. On the other hand, policies with good interpretability, such as policies represented by a small decision tree, are computationally hard to obtain. This paper contributes a local search approach to find policies with good value, represented by small decision trees. Our local search symbiotically combines learning decision trees from value-optimal policies with symbolic approaches that optimize the size of the decision tree within a constrained neighborhood. Our empirical evaluation shows that this combination provides drastically smaller decision trees for MDPs that are significantly larger than what can be handled by optimal decision tree learners.

ID: 213

Causal Eligibility Traces for Confounding Robust Off-Policy Evaluation

Junzhe Zhang, Elias Bareinboim
[OpenReview]

Abstract

A unifying theme in Artificial Intelligence is learning an effective policy to control an agent in an unknown environment in order to optimize a certain performance measure. Off-policy methods can significantly improve sample efficiency during training, since they allow an agent to learn from observed trajectories generated by different behavior policies, without directly deploying target policies in the underlying environment. This paper studies off-policy evaluation from biased offline data where (1) unobserved confounding bias cannot be ruled out a priori; or (2) the observed trajectories do not overlap with intended behaviors of the learner, i.e., the target and behavior policies do not share a common support. Specifically, we extend Bellman's equation to derive effective closed-form bounds over value functions from the observational distribution contaminated with unobserved confounding and no overlap. Second, we propose two novel algorithms that use eligibility traces to estimate these bounds from finite observational data. Compared to other methods for robust off-policy evaluation in sequential environments, these methods are model-free and extend, for the first time, the well-celebrated temporal difference algorithms (Sutton, 1988) to biased offline data with unobserved confounding and no overlap.

ID: 221

Label Distribution Learning using the Squared Neural Family on the Probability Simplex

Daokun Zhang, Russell Tsuchida, Dino Sejdinovic
[OpenReview]

Abstract

Label distribution learning (LDL) provides a framework wherein a distribution over categories rather than a single category is predicted, with the aim of addressing ambiguity in labeled data. Existing research on LDL mainly focuses on the task of point estimation, i.e., finding an optimal distribution in the probability simplex conditioned on the given sample. In this paper, we propose a novel label distribution learning model SNEFY-LDL, which estimates a probability distribution of all possible label distributions over the simplex, by unleashing the expressive power of the recently introduced Squared Neural Family (SNEFY), a new class of tractable probability models. As a way to summarize the fitted model, we derive the closed-form label distribution mean, variance and covariance conditioned on the given sample, which can be used to predict the ground-truth label distributions, construct label distribution confidence intervals, and measure the correlations between different labels. Moreover, more information about the label distribution prediction uncertainties can be acquired from the modeled probability density function. Extensive experiments on conformal prediction, active learning and ensemble learning are conducted, verifying SNEFY-LDL’s great effectiveness in LDL uncertainty quantification. The source code of this paper is available at https://github.com/daokunzhang/SNEFY-LDL.

ID: 226

Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling

JIAN XU, Shian Du, Junmei Yang, Qianli Ma, Delu Zeng, John Paisley
[OpenReview]

Abstract

Gaussian Process Latent Variable Models (GPLVMs) have become increasingly popular for unsupervised tasks such as dimensionality reduction and missing data recovery due to their flexibility and non-linear nature. An importance-weighted version of the Bayesian GPLVMs has been proposed to obtain a tighter variational bound. However, this version of the approach is primarily limited to analyzing simple data structures, as the generation of an effective proposal distribution can become quite challenging in high-dimensional spaces or with complex data sets. In this work, we propose VAIS-GPLVM, a variational Annealed Importance Sampling method that leverages time-inhomogeneous unadjusted Langevin dynamics to construct the variational posterior. By transforming the posterior into a sequence of intermediate distributions using annealing, we combine the strengths of Sequential Monte Carlo samplers and VI to explore a wider range of posterior distributions and gradually approach the target distribution. We further propose an efficient algorithm by reparameterizing all variables in the evidence lower bound (ELBO). Experimental results on both toy and image datasets demonstrate that our method outperforms state-of-the-art methods in terms of tighter variational bounds, higher log-likelihoods, and more robust convergence.

ID: 235

Online Learning with Stochastically Partitioning Experts

Puranjay Datta, Sharayu Moharir, Jaya Prakash Champati
[OpenReview]

Abstract

We study a variant of the experts problem in which new experts are revealed over time according to a stochastic process. The experts are represented by partitions of a hypercube $\mathbb{B}$ in $d$-dimensional Euclidean space. In each round, a point is drawn from $\mathbb{B}$ in an independent and identically distributed manner using an unknown distribution. For each chosen point, we draw $d$ orthogonal hyperplanes parallel to the $d$ faces of $\mathbb{B}$ passing through the point. The set of experts available in a round is the set of partitions of $\mathbb{B}$ created by all the hyperplanes drawn up to that point. Losses are adversarial, and the performance metrics of interest include expected regret and high probability bounds on the sample-path regret. We propose a suitably adapted version of the Hedge algorithm called Hedge-G, which uses a constant learning rate and has $O(\sqrt{2^d T \log T})$ expected regret, which is order-optimal. Further, we show that for Hedge-G, there exists a trade-off between choosing a learning rate that has optimal expected regret and a learning rate that leads to a high probability sample-path regret bound. We address this limitation by proposing AdaHedge-G, a variant of Hedge-G that uses an adaptive learning rate by tracking the loss of the experts revealed up to that round. AdaHedge-G simultaneously achieves $O(\log(\log T)\sqrt{ T \log T})$ expected regret and $O(\log T \sqrt{T \log T})$ sample-path regret, with probability at least $1-T^{-c}$, where $c > 0$ is a constant dependent on $d$.

ID: 237

Periodical Moving Average Accelerates Gradient Accumulation for Post-Training

Yumou Liu, An Li, Chaojie Li, Fei Yu, Benyou Wang
[OpenReview]

Abstract

High gradient variance presents a significant obstacle to efficient post-training of large language models (LLMs) on memory-constrained devices. Existing practical strategies—such as reducing batch sizes or adopting gradient accumulation (GA)—suffer from an inherent trade-off: smaller batches exacerbate convergence issues due to increased gradient noise, while GA substantially prolongs training time owing to its sequential processing. In this work, we reveal that the Exponential Moving Average (EMA) in momentum-based optimizers exponentially discounts historical gradients, thereby limiting their effectiveness in stabilizing parameter updates, especially during post-training when parameter drift is minimal. Motivated by this, we propose integrating the core idea of GA directly into momentum updates via a novel Periodical Moving Average (PMA) mechanism, which structures training into fixed periods and replaces EMA with a uniform moving average within each period. We instantiate PMA within AdamW and Lion, resulting in the AdamW-PMA and Lion-PMA optimizers. Theoretical analysis establishes that AdamW-PMA matches the convergence guarantees of standard Adam. Extensive empirical evaluation on supervised fine-tuning and direct preference optimization tasks demonstrates that PMA-based methods achieve approximately $2\times$ faster training compared to GA, while yielding consistently better performance on downstream evaluations.

ID: 244

Can a Bayesian Oracle Prevent Harm from an Agent?

Yoshua Bengio, Michael K. Cohen, Nikolay Malkin, Matt MacDermott, Damiano Fornasiere, Younesse Kaddar
[OpenReview]

Abstract

Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees? With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we consider estimating a context-dependent bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at run-time to provide a guardrail against dangerous actions of an AI. Noting that different plausible hypotheses about the world could produce very different outcomes, and because we do not know which one is right, we derive bounds on the safety violation probability predicted under the true but unknown hypothesis. Such bounds could be used to reject potentially dangerous actions. Our main results involve searching for cautious but plausible hypotheses, obtained by a maximization that involves Bayesian posteriors over hypotheses. We consider two forms of this result, in the i.i.d. case and in the non-i.i.d. case, and conclude with open problems towards turning such theoretical results into practical AI guardrails.

ID: 245

The Causal Information Bottleneck and Optimal Causal Variable Abstractions

Francisco N. F. Q. Simoes, Mehdi Dastani, Thijs van Ommen
[OpenReview]

Abstract

To effectively study complex causal systems, it is often useful to construct abstractions of parts of the system by discarding irrelevant details while preserving key features. The Information Bottleneck (IB) method is a widely used approach to construct variable abstractions by compressing random variables while retaining predictive power over a target variable. Traditional methods like IB are purely statistical and ignore underlying causal structures, making them ill-suited for causal tasks. We propose the Causal Information Bottleneck (CIB), a causal extension of the IB, which compresses a set of chosen variables while maintaining causal control over a target variable. This method produces abstractions of (sets of) variables which are causally interpretable, give us insight about the interactions between the abstracted variables and the target variable, and can be used when reasoning about interventions. We present experimental results demonstrating that the learned abstractions accurately capture causal relations as intended.

ID: 246

RL, but don't do anything I wouldn't do

Michael K. Cohen, Marcus Hutter, Yoshua Bengio, Stuart Russell
[OpenReview]

Abstract

In reinforcement learning (RL), if the agent's reward differs from the designers' true utility, even only rarely, the state distribution resulting from the agent's policy can be very bad, in theory and in practice. When RL policies would devolve into undesired behavior, a common countermeasure is KL regularization to a trusted policy ("Don't do anything I wouldn't do"). All current cutting-edge language models are RL agents that are KL-regularized to a "base policy" that is purely predictive. Unfortunately, we demonstrate that when this base policy is a Bayesian predictive model of a trusted policy, the KL constraint is no longer reliable for controlling the behavior of an advanced RL agent. We demonstrate this theoretically using algorithmic information theory, and while systems today are too weak to exhibit this theorized failure precisely, we RL-finetune a language model and find evidence that our formal results are plausibly relevant in practice. We also propose a theoretical alternative that avoids this problem by replacing the "Don't do anything I wouldn't do" principle with "Don't do anything I mightn't do".

ID: 252

Adaptive Human-Robot Collaboration using Type-Based IRL

Prasanth Sengadu Suresh, Prashant Doshi, Bikramjit Banerjee
[OpenReview]

Abstract

Human-robot collaboration (HRC) integrates the consistency and precision of robotic systems with the dexterity and cognitive abilities of humans to create synergy. However, human performance may degrade due to various factors (e.g., fatigue, trust) which can manifest unpredictably, and typically results in diminished output and reduced quality. To address this challenge toward successful HRCs, we present a human-aware approach to collaboration using a novel multi-agent decision-making framework. Type-based decentralized Markov decision processes (TB-DecMDP) additionally model latent, causal decision-making factors influencing agent behavior (e.g., fatigue), leading to dynamic agent types. In this framework, agents can switch between types and each maintains a belief about others' current type based on observed actions while aiming to achieve a shared objective. We introduce a new inverse reinforcement learning (IRL) algorithm, TB-DecAIRL, which uses TB-DecMDP to model complex HRCs. TB-DecAIRL learns a type-contingent reward function and corresponding vector of policies from team demonstrations. Our evaluations in a realistic HRC problem setting establish that modeling human types in TB-DecAIRL improves robot behavior on the default of ignoring human factors, by increasing throughput in a human-robot produce sorting task.

ID: 258

The Consistency Hypothesis in Uncertainty Quantification for Large Language Models

Quan Xiao, Debarun Bhattacharjya, Balaji Ganesan, Radu Marinescu, Katsiaryna Mirylenka, Nhan H Pham, Michael Glass, Junkyu Lee
[OpenReview]

Abstract

Estimating the confidence of large language model (LLM) outputs is essential for real-world applications requiring high user trust. Black-box uncertainty quantification (UQ) methods, relying solely on model API access, have gained popularity due to their practical benefits. In this paper, we examine the implicit assumption behind several UQ methods, which use generation consistency as a proxy for confidence—an idea we formalize as the consistency hypothesis. We introduce three mathematical statements with corresponding statistical tests to capture variations of this hypothesis and metrics to evaluate LLM output conformity across tasks. Our empirical investigation, spanning 8 benchmark datasets and 3 tasks (question answering, text summarization, and text-to-SQL), highlights the prevalence of the hypothesis under different settings. Among the statements, we highlight the `Sim-Any' hypothesis as the most actionable, and demonstrate how it can be leveraged by proposing data-free black-box UQ methods that aggregate similarities between generations for confidence estimation. These approaches can outperform the closest baselines, showcasing the practical value of the empirically observed consistency hypothesis.

ID: 259

Corruption-Robust Variance-aware Algorithms for Generalized Linear Bandits under Heavy-tailed Rewards

Qingyuan Yu, Euijin Baek, Xiang Li, Qiang Sun
[OpenReview]

Abstract

Stochastic linear bandits have recently received significant attention in sequential decision-making. However, real-world challenges such as heavy-tailed noise, reward corruption, and nonlinear reward functions remain difficult to address. To tackle these difficulties, we propose GAdaOFUL, a novel algorithm that leverages adaptive Huber regression to achieve robustness in generalized linear models (GLMs), where rewards can be nonlinear functions of features. GAdaOFUL achieves a state-of-the-art variance-aware regret bound, scaling with the square root of the cumulative reward variance over time, plus an additional term proportional to the level of corruption. The algorithm adapts to problem complexity, yielding improved regret when the cumulative variance is small. Simulation results demonstrate the robustness and effectiveness of GAdaOFUL in practice. The code is available at \url{https://github.com/NeXAIS/GAdaOFUL}.

ID: 261

Black-box Optimization with Unknown Constraints via Overparameterized Deep Neural Networks

Dat Phan Trong, Hung The Tran, Sunil Gupta
[OpenReview]

Abstract

Optimizing expensive black-box functions under unknown constraints is a fundamental challenge across a range of real-world domains, such as hyperparameter tuning in machine learning, safe control in robotics, and material or drug discovery. In these settings, each function evaluation may be costly or time-consuming, and the system may need to operate within unknown or difficult-to-specify safety boundaries. We apply the Expected Improvement (EI) acquisition function to select the next samples within a feasible region, determined by Lower Confidence Bound (LCB) conditions for all constraints. The LCB approach guarantees constraint feasibility, while EI efficiently balances exploration and exploitation, especially when the feasible regions are much smaller than the overall search space. To model both the objective function and constraints, we use Deep Neural Networks (DNNs) instead of Gaussian Processes (GPs) to improve scalability and handle complex structured data. We provide a theoretical analysis showing our method's convergence using recent Neural Tangent Kernel (NTK) theory. Under regularity conditions, both cumulative regret and constraint violation are bounded by the maximum information gain, with equivalent upper bounds to GP-based methods. To validate our algorithm, we conduct experiments on synthetic and real-world benchmarks, showing its benefit over recent methods in black-box optimization with unknown constraints.

ID: 262

Learning to Stabilize Unknown LTI Systems on a Single Trajectory under Stochastic Noise

Ziyi Zhang, yorie nakahira, Guannan Qu
[OpenReview]

Abstract

We study the problem of learning to stabilize unknown noisy Linear Time-Invariant (LTI) systems on a single trajectory. The state-of-the-art guarantees that the system is stabilized before the system state reaches $2^{O(k \log n)}$ in $L^2$-norm, where $n$ is the state dimension, and $k$ is the dimension of the unstable subspace. However, this bound only holds in *noiseless* LTI systems that have a control input dimension at least as large as the dimension of unstable subspace, making it impractical in many real-life scenarios. In noisy systems, unknown noise is not only amplified by unstable system modes but also imposes significant difficulty in estimating the system dynamics or bounding the estimation errors. Furthermore, the aforementioned complexity is only achievable when the system has a number of control inputs that are at least as many as the dimension of the unstable subspace. To address these issues, we develop a novel algorithm with a singular-value-decomposition(SVD)-based analytical framework and show that the system is stabilized with the same complexity guarantee with the state-of-the-art in a noisy environment. With the SVD-based framework, we can bound the error of system identification with Davis-Kahan Theorem and design a controller that does not require the invertibility of the control matrix, making it possible to apply this algorithm in under-actuated settings. To the best of our knowledge, this paper is the first to achieve learning-to-stabilize unknown LTI system without exponential blow-up in noisy and under-actuated systems. We further demonstrate the advantage of the proposed algorithm in under-actuated settings.

ID: 264

COS-DPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework

Yinuo Ren, Tesi Xiao, Michael Shavlovsky, Lexing Ying, Holakou Rahmanian
[OpenReview]

Abstract

In LLM alignment and many other ML applications, one often faces the *Multi-Objective Fine-Tuning* (MOFT) problem, *i.e.*, fine-tuning an existing model with datasets labeled w.r.t. different objectives simultaneously. To address the challenge, we propose a *Conditioned One-Shot* fine-tuning framework (COS-DPO) that extends the Direct Preference Optimization technique, originally developed for efficient LLM alignment with preference data, to accommodate the MOFT settings. By direct conditioning on the weight across auxiliary objectives, our Weight-COS-DPO method enjoys an efficient one-shot training process for profiling the Pareto front and is capable of achieving comprehensive trade-off solutions even in the post-training stage. Based on our theoretical findings on the linear transformation properties of the loss function, we further propose the Temperature-COS-DPO method that augments the temperature parameter to the model input, enhancing the flexibility of post-training control over the trade-offs between the main and auxiliary objectives. We demonstrate the effectiveness and efficiency of the COS-DPO framework through its applications to various tasks, including the Learning-to-Rank (LTR) and LLM alignment tasks, highlighting its viability for large-scale ML deployments.

ID: 265

A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

Yeqi Gao, Zhao Song, Weixin Wang, Junze Yin
[OpenReview]

Abstract

Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on providing a provable guarantee for the one-layer attention network objective function: given input matrices of a layer, $A_1, A_2, A_3 \in \mathbb{R}^{n \times d}$, our goal is to minimize the loss function: \begin{align*} L(X,Y) = \sum\_{j_0=1}^n \sum\_{i_0=1}^d ( \langle \langle \exp( \mathsf{A}\_{j_0} x ) , {\bf 1}\_n \rangle^{-1} \cdot \exp( \mathsf{A}\_{j_0} x ), A\_{3} Y\_{\*,i_0} \rangle - b\_{j_0,i_0} )^2, \end{align*} where $\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}$ is the $j_0$-th block of the Kronecker product of $A_1$ and $A_2$. The matrix $B \in \mathbb{R}^{n \times d}$ represents the output of a layer, and $b\_{j_0,i_0} \in \mathbb{R}$ is the $(j_0, i_0)$-th entry of $B$. $Y\_{*,i_0} \in \mathbb{R}^d$ is the $i_0$-th column vector of $Y$, and $x \in \mathbb{R}^{d^2}$ is the vectorization of $X$. In self-attention, $Q, K, V \in \mathbb{R}^{d \times d}$ represent the weight matrices for the query, key, and value, respectively. Unlike prior works that relied on simplified and single-variable versions of the attention computation problem, our multivariate loss function analyzes a complete and unsimplified attention layer, treating all these weights as variables, where $X = QK^\top \in \mathbb{R}^{d \times d}$ and $Y = V \in \mathbb{R}^{d \times d}$. We propose an iterative greedy algorithm to train a neural network using the loss function $L(X,Y)$, achieving an error bound of $\epsilon \in (0, 0.1)$. The algorithm runs in $O( ({\cal T}\_{\mathrm{mat}}(n,n,d) + {\cal T}\_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )$ time, where ${\cal T}\_{\mathrm{mat}}(a,b,c)$ denotes the time complexity of multiplying an $a \times b$ matrix with a $b \times c$ matrix, and $\omega \approx 2.37$ is the exponent of matrix multiplication.

ID: 269

Full Network Capacity Framework for Sample-Efficient Deep Reinforcement Learning

Wentao Yang, Xinyue Liu, Yunlong Gao, Wenxin Liang, Linlin Zong, Guanglu Wang, Xianchao Zhang
[OpenReview]

Abstract

In deep reinforcement learning (DRL), the presence of dormant neurons leads to a significant reduction in network capacity, which results in sub-optimal performance and limited sample efficiency. Existing training techniques, especially those relying on periodic resetting (PR), exacerbate this issue. We propose the Full Network Capacity (FNC) framework based on PR, which consists of two novel modules: Dormant Neuron Reactivation (DNR) and Stable Policy Update (SPU). DNR continuously reactivates dormant neurons, thereby enhancing network capacity. SPU mitigates perturbation from DNR and PR and stabilizes the Q-values for the actor, ensuring smooth training and reliable policy updates. Our experimental evaluations on the Atari 100K and DMControl 100K benchmarks demonstrate the remarkable sample efficiency of FNC. On Atari 100K, FNC achieves a superhuman IQM HNS of 107.3\%, outperforming the previous state-of-the-art method BBF by 13.3\%. On DMControl 100K, FNC excels in 5 out of 6 tasks in terms of episodic return and attains the highest median and mean aggregated scores. FNC not only maximizes network capacity but also provides a practical solution for real-world applications where data collection is costly and time-consuming. Our implementation is publicly accessible at \url{https://github.com/tlyy/FNC}.

ID: 271

Optimal Transport Alignment of User Preferences from Ratings and Texts

Nhu-Thuat Tran, Hady W. Lauw
[OpenReview]

Abstract

Modeling hidden factors driving user preferences is crucial for recommendation yet challenging due to sparse rating data. While aligning preference factors from ratings and texts, as a solution, shows improvements, existing methods impose restrictive one-to-one factor correspondences and underutilize cross-modal interest signals. We propose an optimal transport (OT) approach to address these gaps. By modeling rating- and text-based preference factors as distributions, we compute an OT plan that captures their probabilistic relationships. This plan serves dual roles: 1) to regularize cross-modal preference factors without rigid correspondence assumptions, and 2) to blend preference signals across modalities through barycentric mapping. Experiments on real-world datasets validate our method’s effectiveness over competitive baselines, highlighting its novel use of OT for adaptive preference factor alignment, an underexplored direction in recommender system research.

ID: 273

Enhancing Uncertainty Quantification in Large Language Models through Semantic Graph Density

Zhaoye Li, Siyuan Shen, Wenjing Yang, Ruochun Jin, Huan Chen, ligong cao, Jing Ren
[OpenReview]

Abstract

Large Language Models (LLMs) excel in language understanding but are susceptible to "confabulation," where they generate arbitrary, factually incorrect responses to uncertain questions. Detecting confabulation in question answering often relies on Uncertainty Quantification (UQ), which measures semantic entropy or consistency among sampled answers. While several methods have been proposed for UQ in LLMs, they suffer from key limitations, such as overlooking fine-grained semantic relationships among answers and neglecting answer probabilities. To address these issues, we propose Semantic Graph Density (SGD). SGD quantifies semantic consistency by evaluating the density of a semantic graph that captures fine-grained semantic relationships among answers. Additionally, it integrates answer probabilities to adjust the contribution of each edge to the overall uncertainty score. We theoretically prove that SGD generalizes the previous state-of-the-art method, Deg, and empirically demonstrate its superior performance across four LLMs and four free-form question-answering datasets. In particular, in experiments with Llama3.1-8B, SGD outperformed the best baseline by 1.52% in AUROC on the CoQA dataset and by 1.22% in AUARC on the TriviaQA dataset.

ID: 274

FALCON: Adaptive Cross-Domain APT Attack Investigation with Federated Causal Learning

Jialu Tang, Yali Gao, Xiaoyong Li, Jiawei Li, Shui Yu, Binxing Fang
[OpenReview]

Abstract

With the extensive deployment and application of Internet of Things (IoT) devices, vulnerable edge nodes have emerged as primary targets for Advanced Persistent Threat (APT) attacks. Attackers compromise IoT terminal devices to establish an initial foothold and subsequently exploit lateral movement techniques to progressively infiltrate core business networks. Prior investigation methods struggle with fragmented threat intelligence and sparse attack samples in heterogeneous audit logs, resulting in incomplete attack chain reconstruction and high false positives. We propose a novel approach to APT attack investigation, FALCON, which captures complex causal relationships between entities from discrete audit logs and constructs cross-domain provenance graphs, enabling rapid and accurate identification of potential APT activities. FALCON trains an adaptive edge-side local model with cross-domain behavior sequences containing extensive and remote contextual information, and employs a bidirectional transformer pre-trained model to learn latent representations from unlabeled sequences. To the best of our knowledge, FALCON is the first APT investigation method to conduct causal provenance based on cross-domain audit logs while ensuring privacy protection. The experimental results demonstrate that FALCON effectively detects APT attacks with accuracy 99.71% and reconstructs attack scenarios with accuracy 87.4%.

ID: 275

A Multivariate Unimodality Test Harnessing the Dip Statistic of Mahalanobis Distances Over Random Projections

Prodromos Kolyvakis, Aristidis Likas
[OpenReview]

Abstract

Unimodality, pivotal in statistical analysis, offers insights into dataset structures and drives sophisticated analytical procedures. While unimodality's confirmation is straightforward for one-dimensional data using methods like Silverman's approach and Hartigans' dip statistic, its generalization to higher dimensions remains challenging. By extrapolating one-dimensional unimodality principles to multi-dimensional spaces through linear random projections and leveraging point-to-point distancing, our method, rooted in $\alpha$-unimodality assumptions, presents a novel multivariate unimodality test named $\textit{mud-pod}$. Both theoretical and empirical studies confirm the efficacy of our method in unimodality assessment of multidimensional datasets as well as in estimating the number of clusters.

ID: 277

Metric Learning in an RKHS

Gokcan Tatli, Yi Chen, Blake Mason, Robert D Nowak, Ramya Korlakai Vinayak
[OpenReview]

Abstract

This paper investigates metric learning in a Reproducing Kernel Hilbert Space (RKHS) based on a set of random triplet comparisons in the form of *"Do you think item h is more similar to item i or item j?"* indicating similarity and differences between various items. The goal is to learn a metric in the RKHS that reflects the comparisons. Nonlinear metric learning using kernel methods and neural networks has shown great empirical promise. While previous works have addressed certain aspects of this problem, there is little or no theoretical understanding of such methods. The exception is the special (linear) case in which the RKHS is the standard $d$-dimensional Euclidean space; there is a comprehensive theory for metric learning in the $d$-dimensional Euclidean space. This paper develops a general RKHS framework for metric learning and provides novel generalization guarantees and sample complexity bounds. We validate our findings through a set of simulations and experiments on real datasets. Our code is publicly available at https://github.com/RamyaLab/metric-learning-RKHS.

ID: 279

Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models

Aishwarya Venkataramanan, Paul Bodesheim, Joachim Denzler
[OpenReview]

Abstract

Vision-Language Models (VLMs) learn joint representations by mapping images and text into a shared latent space. However, recent research highlights that deterministic embeddings from standard VLMs often struggle to capture the uncertainties arising from the ambiguities in visual and textual descriptions and the multiple possible correspondences between images and texts. Existing approaches tackle this by learning probabilistic embeddings during VLM training, which demands large datasets and does not leverage the powerful representations already learned by large-scale VLMs like CLIP. In this paper, we propose GroVE, a post-hoc approach to obtaining probabilistic embeddings from frozen VLMs. GroVE builds on Gaussian Process Latent Variable Model (GPLVM) to learn a shared low-dimensional latent space where image and text inputs are mapped to a unified representation, optimized through single-modal embedding reconstruction and cross-modal alignment objectives. Once trained, the Gaussian Process model generates uncertainty-aware probabilistic embeddings. Evaluation shows that GroVE achieves state-of-the-art uncertainty calibration across multiple downstream tasks, including cross-modal retrieval, visual question answering, and active learning.

ID: 282

Sparse Structure Exploration and Re-optimization for Vision Transformer

Sangho An, Jinwoo Kim, Keonho Lee, Jingang Huh, Chanwoong Kwak, Yujin Lee, Moonsub Jin, Jangho Kim
[OpenReview]

Abstract

Vision Transformers (ViTs) achieve outstanding performance by effectively capturing long-range dependencies between image patches (tokens). However, the high computational cost and memory requirements of ViTs present challenges for model compression and deployment on edge devices. In this study, we introduce a new framework, Sparse Structure Exploration and Re-optimization (SERo), specifically designed to maximize pruning efficiency in ViTs. Our approach focuses on (1) hardware-friendly pruning that fully compresses pruned parameters instead of zeroing them out, (2) separating the exploration and re-optimization phases \red{in order to find the optimal structure among various possible sparse structures}, and (3) using a simple gradient magnitude-based criterion for pruning a pre-trained model. SERo iteratively refines pruning masks to identify optimal sparse structures and then re-optimizes the pruned structure, reducing computational costs while maintaining model performance. Experimental results indicate that SERo surpasses existing pruning methods across various ViT models in both performance and computational efficiency. For example, SERo achieves a 69\% reduction in computational cost and a 2.4x increase in processing speed for DeiT-Base model, with only a 1.55\% drop in accuracy. Implementation code: https://github.com/Ahnho/SERo/

ID: 287

The Relativity of Causal Knowledge

Gabriele D'Acunto, Claudio Battiloro
[OpenReview]

Abstract

Recent advances in *artificial intelligence* reveal the limits of purely predictive systems and call for a shift toward causal *and* collaborative reasoning. Drawing inspiration from the revolution of Grothendieck in mathematics, we introduce the *relativity of causal knowledge*, which posits structural causal models (SCMs) are inherently imperfect, subjective representations embedded within networks of relationships. By leveraging category theory, we arrange SCMs into a functor category and show that their observational and interventional probability measures naturally form convex structures. This result allows us to encode non-intervened SCMs with convex spaces of probability measures. Next, using sheaf theory, we construct the *network sheaf and cosheaf of causal knowledge*. These structures enable the transfer of causal knowledge across the network while incorporating interventional consistency and the perspective of the subjects, ultimately leading to the formal, mathematical definition of *relative causal knowledge*.

ID: 289

RDI: An adversarial robustness evaluation metric for deep neural networks based on model statistical features

Jialei Song, Xingquan Zuo, Feiyang Wang, Hai Huang, Tianle Zhang
[OpenReview]

Abstract

Deep neural networks (DNNs) are highly susceptible to adversarial samples, raising concerns about their reliability in safety-critical tasks. Currently, methods of evaluating adversarial robustness are primarily categorized into attack-based and certified robustness evaluation approaches. The former not only relies on specific attack algorithms but also is highly time-consuming, while the latter due to its analytical nature, is typically difficult to implement for large and complex models. A few studies evaluate model robustness based on the model's decision boundary, but they suffer from low evaluation accuracy. To address the aforementioned issues, we propose a novel adversarial robustness evaluation metric, Robustness Difference Index (RDI), which is based on model statistical features. RDI draws inspiration from clustering evaluation by analyzing the intra-class and inter-class distances of feature vectors separated by the decision boundary to quantify model robustness. It is attack-independent and has high computational efficiency. Experiments show that, RDI demonstrates a stronger correlation with the gold-standard adversarial robustness metric of attack success rate (ASR). The average computation time of RDI is only 1/30 of the evaluation method based on the PGD attack. Our open-source code is available at: https://github.com/BUPTAIOC/RDI.

ID: 290

Well-Defined Function-Space Variational Inference in Bayesian Neural Networks via Regularized KL-Divergence

Tristan Cinquin, Robert Bamler
[OpenReview]

Abstract

Bayesian neural networks (BNN) promise to combine the predictive performance of neural networks with principled uncertainty modeling crucial for safety-critical systems and decision making. However, posterior uncertainties depend on the choice of prior, and finding informative priors in weight-space has proven difficult. This has motivated variational inference (VI) methods that pose priors directly on the function represented by the BNN rather than on weights. In this paper, we address a fundamental issue with such function-space VI approaches pointed out by Burt et al. (2020), who showed that the objective function (ELBO) is negative infinite for most priors of interest. Our solution builds on generalized VI with the regularized KL divergence and is, to the best of our knowledge, the first well-defined variational objective for inference in BNNs with Gaussian process (GP) priors. Experiments show that our method successfully incorporates the properties specified by the GP prior, and that it provides competitive uncertainty estimates for regression, classification and out-of-distribution detection compared to BNN baselines with both function and weight-space priors.

ID: 293

Geodesic Slice Sampler for Multimodal Distributions with Strong Curvature

Bernardo Williams, Hanlin Yu, Hoang Phuc Hau Luu, Georgios Arvanitidis, Arto Klami
[OpenReview]

Abstract

Traditional Markov Chain Monte Carlo sampling methods often struggle with sharp curvatures, intricate geometries, and multimodal distributions. Slice sampling can resolve local exploration inefficiency issues, and Riemannian geometries help with sharp curvatures. Recent extensions enable slice sampling on Riemannian manifolds, but they are restricted to cases where geodesics are available in a closed form. We propose a method that generalizes Hit-and-Run slice sampling to more general geometries tailored to the target distribution, by approximating geodesics as solutions to differential equations. Our approach enables the exploration of the regions with strong curvature and rapid transitions between modes in multimodal distributions. We demonstrate the advantages of the approach over challenging sampling problems.

ID: 296

Enumerating Optimal Cost-Constrained Adjustment Sets

Batya Kenig
[OpenReview]

Abstract

Estimating causal effects from observational data is a key problem in causal inference, often addressed through covariate adjustment sets that enable unbiased estimation of interventional means. This paper tackles the challenge of finding optimal covariate adjustment sets under budget constraints, a practical concern in many applications. We present algorithms for enumerating valid and minimal adjustment sets up to a specified cost, ordered by their proximity to outcome variables, which coincides with estimator variance. Our approach builds on existing graphical criteria and extends them to accommodate budgetary considerations, providing a useful tool for addressing resource limitations.

ID: 297

Adapting Prediction Sets to Distribution Shifts Without Labels

Kevin Kasa, Zhiyu Zhang, Heng Yang, Graham W. Taylor
[OpenReview]

Abstract

Recently there has been a surge of interest to deploy confidence set predictions rather than point predictions in machine learning. Unfortunately, the effectiveness of such prediction sets is frequently impaired by distribution shifts in practice, and the challenge is often compounded by the lack of ground truth labels at test time. Focusing on a standard set-valued prediction framework called conformal prediction (CP), this paper studies how to improve its practical performance using only unlabeled data from the shifted test domain. This is achieved by two new methods called $\texttt{ECP}$ and $\texttt{E{\small A}CP}$, whose main idea is to adjust the score function in CP according to its base model's own uncertainty evaluation. Through extensive experiments on a number of large-scale datasets and neural network architectures, we show that our methods provide consistent improvement over existing baselines and nearly match the performance of fully supervised methods.

ID: 298

Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression

Hooman Shahrokhi, Devjeet Raj Roy, Yan Yan, Venera Arnaoudova, Jana Doppa
[OpenReview]

Abstract

We consider the problem of generating valid and small prediction sets by sampling outputs (e.g., software code and natural language text) from a black-box deep generative model for a given input (e.g., textual prompt). The validity of a prediction set is determined by a user-defined binary admissibility function depending on the target application. For example, requiring at least one program in the set to pass all test cases in code generation application. To address this problem, we develop a simple and effective conformal inference algorithm referred to as {\em Generative Prediction Sets (GPS)}. Given a set of calibration examples and black-box access to a deep generative model, GPS can generate prediction sets with provable guarantees. The key insight behind GPS is to exploit the inherent structure within the distribution over the minimum number of samples needed to obtain an admissible output to develop a simple conformal regression approach over the minimum number of samples. Unlike prior work , the sets generated by GPS do not require iterative sampling at test time, while maintaining strict marginal coverage guarantees. Experiments on multiple datasets for code and math word problems using different large language models demonstrate the efficacy of GPS over state-of-the-art methods.

ID: 299

Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation

Runze Zhao, Yue Yu, Adams Yiyue Zhu, Chen Yang, Dongruo Zhou
[OpenReview]

Abstract

Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time. Despite its empirical success, the theoretical understanding of CTRL remains limited, especially in settings with general function approximation. In this work, we propose a model-based CTRL algorithm that achieves both sample and computational efficiency. Our approach leverages optimism-based confidence sets to establish the first sample complexity guarantee for CTRL with general function approximation, showing that a near-optimal policy can be learned with a suboptimality gap of $\tilde{O}(\sqrt{d_{\mathcal{R}} + d_{\mathcal{F}}}N^{-1/2})$ using $N$ measurements, where $d_{\mathcal{R}}$ and $d_{\mathcal{F}}$ denote the distributional Eluder dimensions of the reward and dynamic functions, respectively, capturing the complexity of general function approximation in reinforcement learning. Moreover, we introduce structured policy updates and an alternative measurement strategy that significantly reduce the number of policy updates and rollouts while maintaining competitive sample efficiency. Our proposed algorithms are validated through experiments on continuous control tasks and diffusion model fine-tuning, demonstrating comparable performance with significantly fewer policy updates and rollouts.

ID: 302

Are You Doing Better Than Random Guessing? A Call for Using Negative Controls When Evaluating Causal Discovery Algorithms

Anne Helby Petersen
[OpenReview]

Abstract

New proposals for causal discovery algorithms are typically evaluated using simulations and a few selected real data examples with known data generating mechanisms. However, there does not exist a general guideline for how such evaluation studies should be designed, and therefore, comparing results across different studies can be difficult. In this article, we propose to use negative controls as a common evaluation baseline by posing the question: Are we doing better than random guessing? For the task of graph skeleton estimation, we derive exact distributional results under random guessing for the expected behavior of a range of typical causal discovery evaluation metrics, including precision and recall. We show that these metrics can achieve very favorable values under random guessing in certain scenarios, and hence warn against using them without also reporting negative control results, i.e., performance under random guessing. We also propose an exact test of overall skeleton fit, and showcase its use on a real data application. Finally, we propose a general pipeline for using negative controls beyond the skeleton estimation task, and apply it both in a simulated example and a real data application.

ID: 308

Efficiently Escaping Saddle Points for Policy Optimization

Mohammadsadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash, Niao He, Matthias Grossglauser
[OpenReview]

Abstract

Policy gradient (PG) is widely used in reinforcement learning due to its scalability and good performance. In recent years, several variance-reduced PG methods have been proposed with a theoretical guarantee of converging to an approximate first-order stationary point (FOSP) with the sample complexity of $O(\epsilon^{-3})$. However, FOSPs could be bad local optima or saddle points. Moreover, these algorithms often use importance sampling (IS) weights which could impair the statistical effectiveness of variance reduction. In this paper, we propose a variance-reduced second-order method that uses second-order information in the form of Hessian vector products (HVP) and converges to an approximate second-order stationary point (SOSP) with sample complexity of $\tilde{O}(\epsilon^{-3})$. This rate improves the best-known sample complexity for achieving approximate SOSPs by a factor of $O(\epsilon^{-0.5})$. Moreover, the proposed variance reduction technique bypasses IS weights by using HVP terms. Our experimental results show that the proposed algorithm outperforms the state of the art and is more robust to changes in random seeds.

ID: 309

Simulation-based Inference for High-dimensional Data using Surjective Sequential Neural Likelihood Estimation

Simon Dirmeier, Carlo Albert, Fernando Perez-Cruz
[OpenReview]

Abstract

Neural likelihood estimation methods for simulation-based inference can suffer from performance degradation when the modeled data is very high-dimensional or lies along a lower-dimensional manifold, which is due to the inability of the density estimator to accurately estimate a density function. We present Surjective Sequential Neural Likelihood (SSNL) estimation, a novel member in the family of methods for simulation-based inference (SBI). SSNL fits a dimensionality-reducing surjective normalizing flow model and uses it as a surrogate likelihood function, which allows for computational inference via Markov chain Monte Carlo or variational Bayes methods. Among other benefits, SSNL avoids the requirement to manually craft summary statistics for inference of high-dimensional data sets, since the lower-dimensional representation is computed simultaneously with learning the likelihood and without additional computational overhead. We evaluate SSNL on a wide variety of experiments, including two challenging real-world examples from the astrophysics and neuroscience literatures, and show that it either outperforms or is on par with state-of-the-art methods, making it an excellent off-the-shelf estimator for SBI for high-dimensional data sets.

ID: 311

A Parallel Network for LRCT Segmentation and Uncertainty Mitigation with Fuzzy Sets

Shiyi Wang, Yang Nan, Xiaodan Xing, Yingying Fang, SIMON LF WALSH, Guang Yang
[OpenReview]

Abstract

Accurate segmentation of airways in Low-Resolution CT (LRCT) scans is vital for diagnostics in scenarios such as reduced radiation exposure, emergency response, or limited resources. Yet manual annotation is labor-intensive and prone to variability, while existing automated methods often fail to capture small airway branches in lower-resolution 3D data. To address this, we introduce \textbf{FuzzySR}, a parallel framework that merges super-resolution (SR) and segmentation. By concurrently producing high-resolution reconstructions and precise airway masks, it enhances anatomic fidelity and captures delicate bronchi. FuzzySR employs a deep fuzzy set mechanism, leveraging learnable $t$-distribution and triangular membership functions via cross-attention. Through parameters $\mu$, $\sigma$, and $d_f$, it preserves uncertain features and mitigates boundary noise. Extensive evaluations on lung cancer, COVID-19, and pulmonary fibrosis datasets confirm FuzzySR's superior segmentation accuracy on LRCT, surpassing even high-resolution baselines. By uniting fuzzy-logic-driven uncertainty handling with SR-based resolution enhancement, FuzzySR effectively bridges the gap for robust airway delineation from LRCT data.

ID: 312

STIMULUS: Achieving Fast Convergence and Low Sample Complexity in Stochastic Multi-Objective Learning

Zhuqing Liu, Chaosheng Dong, Michinari Momma, Simone Shao, Shaoyuan Xu, Yan Gao, Haibo Yang, Jia Liu
[OpenReview]

Abstract

Recently, multi-objective optimization (MOO) has gained attention for its broad applications in ML, operations research, and engineering. However, MOO algorithm design remains in its infancy and many existing MOO methods suffer from unsatisfactory convergence rate and sample complexity performance. To address this challenge, in this paper, we propose an algorithm called STIMULUS (**st**ochastic path-**i**ntegrated **mul**ti-gradient rec**u**rsive e**s**timator), a new and robust approach for solving MOO problems. Different from the traditional methods, STIMULUS introduces a simple yet powerful recursive framework for updating stochastic gradient estimates to improve convergence performance with low sample complexity. In addition, we introduce an enhanced version of \algns, termed \algmns, which incorporates a momentum term to further expedite convergence. We establish $\mathcal{O}(1/T)$ convergence rates of the proposed methods for non-convex settings and $\mathcal{O}(\exp{-\mu T})$ for strongly convex settings, where $T$ is the total number of iteration rounds. Additionally, we achieve the state-of-the-art $O\left(n+\sqrt{n}\epsilon^{-1}\right)$ sample complexities for non-convex settings and $\mathcal{O}\left(n+ \sqrt{n} \ln ({\mu/\epsilon})\right)$ for strongly convex settings, where $\epsilon>0$ is a desired stationarity error. Moreover, to alleviate the periodic full gradient evaluation requirement in STIMULUS and STIMULUS-M, we further propose enhanced versions with adaptive batching called STIMULUS$^+$/ STIMULUS-M$^+$ and provide their theoretical analysis.

ID: 313

Multi-Label Bayesian Active Learning with Inter-Label Relationships

YUANYUAN QI, Jueqing Lu, Xiaohao Yang, Joanne Enticott, Lan Du
[OpenReview]

Abstract

The primary challenge of multi-label active learning, differing it from multi-class active learning, lies in assessing the informativeness of an indefinite number of labels while also accounting for the inherited label correlation. Existing studies either require substantial computational resources to leverage correlations or fail to fully explore label dependencies. Additionally, real-world scenarios often require addressing intrinsic biases stemming from imbalanced data distributions. In this paper, we propose a new multi-label active learning strategy to address both challenges. Our method incorporates progressively updated positive and negative correlation matrices to capture co-occurrence and disjoint relationships within the label space of annotated samples, enabling a holistic assessment of uncertainty rather than treating labels as isolated elements. Furthermore, alongside diversity, our model employs ensemble pseudo labeling and beta scoring rules to address data imbalances. Extensive experiments on four realistic datasets demonstrate that our strategy consistently achieves more reliable and superior performance, compared to several established methods.

ID: 317

MSCGrapher: Learning Multi-Scale Dynamic Correlations for Multivariate Time Series Forecasting

Xian Yang, Zhenguo Zhang, Shihao Lu
[OpenReview]

Abstract

Efficient learning intra-series and inter-series correlations is essential for multivariate time series forecasting (MTSF). However, in real-world scenarios, persistent and significant inter-series correlations are challenging to be represented in a static way and the strength of correlations varies across different time scales. In this paper, we address this challenge by modeling the complex inter-series relationships through dynamical correlations, considering the varying strengths of correlations. We propose a novel MTSF model: MSCGrapher, which leverages an adaptive correlation learning block to uncover inter-series correlations across different scales. Concretely, time series are first decomposed into different scales based on their periodicities. The graph representation of MTS is then constructed and an adaptive correlation learning method is introduced to capture the inter-series correlations across different scales. To quantify the strength of these correlations, we compute correlation scores based on the characteristics of the graph edges and classify correlations as either $\textit{Strong}$ or $\textit{Weak}$. Finally, we employ a self-attention module to capture intra-series correlations and then fuse features from all scales to obtain the final representation. Extensive experiments on 12 real-world datasets show that MSCGrapher gains significant forecasting performance, highlighting the critical role of inter-series correlations in capturing implicit patterns for MTS.

ID: 325

A Trust-Region Method for Graphical Stein Variational Inference

Liam Pavlovic, David M Rosen
[OpenReview]

Abstract

Stein variational inference (SVI) is a sample-based approximate Bayesian inference technique that generates a sample set by jointly optimizing the samples’ locations to minimize an information-theoretic measure of discrepancy with the target probability distribution. SVI thus provides a fast and significantly more sample-efficient approach to Bayesian inference than traditional (random-sampling-based) alternatives. However, the optimization techniques employed in existing SVI methods struggle to address problems in which the target distribution is high-dimensional, poorly-conditioned, or non-convex, which severely limits the range of their practical applicability. In this paper, we propose a novel trust-region optimization approach for SVI that successfully addresses each of these challenges. Our method builds upon prior work in SVI by leveraging conditional independences in the target distribution (to achieve high-dimensional scaling) and second-order information (to address poor conditioning), while additionally providing an effective adaptive step control procedure, which is essential for ensuring convergence on challenging non-convex optimization problems. Experimental results show our method achieves superior numerical performance, both in convergence rate and sample accuracy, and scales better in high-dimensional distributions, than previous SVI techniques.

ID: 326

Minimax Optimal Nonsmooth Nonparametric Regression via Fractional Laplacian Eigenmaps

Zhaoyang Shi, Krishna Balasubramanian, Wolfgang Polonik
[OpenReview]

Abstract

We develop minimax optimal estimators for nonparametric regression methods when the true regression function lies in an $L_2$-fractional Sobolev space with order $s\in (0,1)$. This function class is a Hilbert space lying between the space of square-integrable functions and the first-order Sobolev space consisting of differentiable functions. It contains fractional power functions, piecewise constant or piecewise polynomial functions and bump function as canonical examples. We construct an estimator based on performing Principal Component Regression using Fractional Laplacian Eigenmaps and show that the in-sample mean-squared estimation error of this estimator is of order $n^{-\frac{2s}{2s+d}}$, where $d$ is the dimension, $s$ is the order parameter and $n$ is the number of observations. We next prove a minimax lower bound of the same order, thereby establishing that no other estimator can improve upon the proposed estimator, up to context factors. We also provide preliminary empirical results validating the practical performance of the developed estimators.

ID: 327

Towards Provably Efficient Learning of Imperfect Information Extensive-Form Games with Linear Function Approximation

Canzhe Zhao, Shuze Chen, Weiming Liu, Haobo Fu, QIANG FU, Shuai Li
[OpenReview]

Abstract

Despite significant advances in learning imperfect information extensive-form games (IIEFGs), most existing theoretical guarantees are limited to IIEFGs in the tabular case. To permit efficient learning of large-scale IIEFGs, we take the first step in studying two-player zero-sum IIEFGs with linear function approximation. In particular, we consider linear IIEFGs in the formulation of partially observable Markov games (POMGs) with linearly parameterized rewards. To address the challenge that the underlying function approximation structure is difficult to directly apply due to the imperfect information of states, we construct the composite "feature vectors" for information set-action pairs. Based on this, we further propose a "least-squares loss estimator", which we call the *fictitious* least-squares loss estimator. Through integrating this estimator with the follow-the-regularized-leader (FTRL) framework, we propose the *fictitious* least-squares follow-the-regularized-leader ($\text{F}^2\text{TRL}$) algorithm, which achieves a provable $\widetilde{\mathcal{O}}(\lambda\sqrt{d H^2 T})$ regret guarantee in the large $T$ regime, where $d$ is the ambient dimension of the feature mapping, $H$ is the horizon length, $\lambda$ is a "balance coefficient" and $T$ is the number of episodes. At the core of the analysis of $\text{F}^2\text{TRL}$ is the leverage of our proposed new "balanced transition" over information set-action space. Additionally, we complement our results with an $\Omega(\sqrt{d\min(d,H)T})$ regret lower bound for this problem and conduct empirical evaluations across various environments, which corroborate the effectiveness of our algorithm.

ID: 333

Coevolutionary Emergent Systems Optimization with Applications to Ultra-High-Dimensional Metasurface Design : OAM Wave Manipulation

Zhengxuan Jiang, Guowen Ding, Wen Jiang
[OpenReview]

Abstract

Optimization problems in electromagnetic wave manipulation and metasurface design are becoming increasingly high-dimensional, often involving thousands of variables that need precise control. Traditional optimization algorithms face significant challenges in maintaining both accuracy and computational efficiency when dealing with such ultra-high-dimensional problems. This paper presents a novel Coevolutionary Emergent Systems Optimization (CESO) algorithm that integrates coevolutionary dynamics, emergent behavior, and adaptive mechanisms to address these challenges. CESO features a unique multi-subsystem architecture that enables parallel exploration of solution spaces while maintaining interactive influences between subsystems. The algorithm incorporates an efficient adaptive mechanism for parameter adjustment and a distinctive emergent behavior simulation mechanism that prevents local optima traps through periodic subsystem reorganization. Experimental results on the CEC2017 benchmark suite demonstrate CESO's superior performance. The algorithm's practical effectiveness is validated through a challenging application in electromagnetic wave manipulation - OAM wave demultiplexing system (10,000 dimensions). In this application, CESO achieves superior mode separation for OAM wave demultiplexing compared to traditional algorithms. These results demonstrate CESO's significant advantages in solving practical high-dimensional optimization problems.

ID: 334

Self-tuned Robust Mean Estimators

Qiang Sun
[OpenReview]

Abstract

This paper introduces an empirical risk minimization based approach with concomitant scaling, which eliminates the need for tuning a robustification parameter in the presence of heavy-tailed data. This method leverages a new loss function that concurrently optimizes both the mean and robustification parameters. Through this dual-parameter optimization, the robustification parameter automatically adjusts to the unknown data variance, rendering the method self-tuning. Our approach surpasses previous models in both computational and asymptotic efficiency. Notably, it avoids the reliance on cross-validation or Lepski's method for tuning the robustification parameter, and the variance of our estimator attains the Cram\'{e}r-Rao lower bound, demonstrating optimal efficiency. In essence, our approach demonstrates optimal performance across both finite-sample and large-sample scenarios, a feature we describe as \textit{algorithmic adaptivity to both asymptotic and finite-sample regimes}. Numerical studies lend strong support to our methodology. The code is available at \url{https://github.com/NeXAIS/automean}.

ID: 337

Federated R\'enyi Fair Inference in Federated Heterogeneous System

Zhiyong Ma, Yuanjie Shi, Yan Yan, Jian Chen
[OpenReview]

Abstract

Federated Learning (FL) is a prominent distributed learning approach that addresses two major challenges: statistical heterogeneity (i.e., non-identically distributed data) and system heterogeneity (i.e., variability in the communication and computation on each client). As FL is commonly applied in sectors such as commercial and financial, group disparities may emerge and cause harm. However, current fairness algorithms assume homogeneous data, which does not align with the FL context. The main challenge is estimating global fairness measures (e.g., R\'enyi or Pearson correlation) in an asynchronous, heterogeneous system. To address this, we propose the FedR\'enyi algorithm, which regularizes fairness by R\'enyi correlation. For statistical heterogeneity, FedR\'enyi aggregates local fairness statistics to estimate the global R\'enyi correlation with an estimation error bound of $O(1/\sqrt{n})$, where $n$ is the total number of data samples. This theoretical result improves significantly over the prior result $O(1/\sqrt{K})$ with $K$ clients. We further prove that FedR\'enyi converges at the same rate as in the homogeneous setting. For system heterogeneity, FedR\'enyi approximates missing client updates through weighted averaging over a nearest neighbor region, ensuring a non-expansive approximation error under non-convex conditions. Extensive experiments demonstrate that FedR\'enyi achieves a promising fairness-accuracy trade-off, with at least 2\\% improvement over baselines.

ID: 341

Order-Optimal Global Convergence for Actor-Critic with General Policy and Neural Critic Parametrization

Swetha Ganesh, Jiayu Chen, Washim Uddin Mondal, Vaneet Aggarwal
[OpenReview]

Abstract

This paper addresses the challenge of achieving order-optimal sample complexity in reinforcement learning for discounted Markov Decision Processes (MDPs) with general policy parameterization and multi-layer neural network critics. Existing approaches either fail to achieve the optimal rate or assume a linear critic. We introduce Natural Actor-Critic with Data Drop (NAC-DD) algorithm, which integrates Natural Policy Gradient methods with a Data Drop technique to mitigate statistical dependencies inherent in Markovian sampling. NAC-DD achieves an optimal sample complexity of $\tilde{\mathcal{O}}(1/\epsilon^2)$, marking a significant improvement over the previous state-of-the-art guarantee of $\tilde{O}(1/\epsilon^3)$. The algorithm employs a multi-layer neural network critic with differentiable activation functions, aligning with real-world applications where tabular policies and linear critics are insufficient. Our work represents the first to achieve order-optimal sample complexity for actor-critic methods with neural function approximation, continuous state and action spaces, and Markovian sampling. Empirical evaluations on benchmark tasks confirm the theoretical findings, demonstrating the practical efficacy of the proposed method.

ID: 342

Online Generalized Magician's Problem with Multiple Workers

Ruoyu Wu, Wei Bao, Ben Liang, Liming Ge
[OpenReview]

Abstract

We study the online Generalized Magician's Problem with Multiple Workers (GMPMW), where tasks arrive sequentially and must be assigned to one of several workers for processing, with each worker consuming a stochastic amount of resources and generating an unknown reward. The system must decide on the acceptance of each task and its assignment to a worker, in order to maximize the accumulated reward within the budget. To address this problem, we propose the Online Worker Assignment (OWA) Algorithm. It optimally solves an optimization problem to balance resource allocation across workers and maintains virtual resource utilization according to the joint evolution of different workers. The competitive ratio of OWA is lower bounded by the closed-form expression $\max${${1}/{L},c$}$\cdot(1-K^{-\frac{1}{2}})$, where $L$ is the number of workers, $K$ is the resource budget, and $c$ is a constant derived from the problem instance. We perform trace-driven experiments with real-time video analytics, demonstrating the excellent capability of OWA to accommodate multiple workers in GMPMW.

ID: 345

Letting Uncertainty Guide Your Multimodal Machine Translation

Wuyi Liu, Yue Gao, Yige Mao, Jing Zhao
[OpenReview]

Abstract

Multimodal Machine Translation (MMT) leverages additional modalities, such as visual data, to enhance translation accuracy and resolve linguistic ambiguities inherent in text-only approaches. Recent advancements predominantly focus on integrating image information via attention mechanisms or feature fusion techniques. However, current approaches lack explicit mechanisms to quantify and manage the uncertainty during translation process, resulting in the utilization of image information being a black box. This makes it difficult to effectively address the issues of incomplete utilization of visual information and even potential degradation of translation quality when using visual information.To address these challenges, we introduce a novel Uncertainty-Guided Multimodal Machine Translation (UG-MMT) framework that redefines how translation systems handle ambiguity through systematic uncertainty reduction. Designed with plug-and-play flexibility, our framework enables seamless integration into existing MMT systems, requiring minimal modification while delivering significant performance gains.

ID: 348

Targeted Learning for Variable Importance

Xiaohan Wang, Yunzhe Zhou, Giles Hooker
[OpenReview]

Abstract

Variable importance is one of the most widely used measures for interpreting machine learning with significant interest from both statistics and machine learning communities. Recently, increasing attention has been directed toward uncertainty quantification in these metrics. Current approaches largely rely on one-step procedures, which, while asymptotically efficient, can present higher sensitivity and instability in finite sample settings. To address these limitations, we propose a novel method by employing the targeted learning (TL) framework, designed to enhance robustness in inference for variable importance metrics. Our approach is particularly suited for conditional permutation variable importance. We show that it (i) retains the asymptotic efficiency of traditional methods, (ii) maintains comparable computational complexity, and (iii) delivers improved accuracy, especially in finite sample contexts. We further support these findings with numerical experiments that illustrate the practical advantages of our method and validate the theoretical results.

ID: 350

Probabilistic Explanations for Regression Models

Frédéric Koriche, Jean-Marie Lagniez, Chi Tran
[OpenReview]

Abstract

Formal explainability is an emerging field that aims to provide mathematically guaranteed explanations for the predictions made by machine learning models. Recent work in this area focuses on computing “probabilistic explanations” for the predictions made by classifiers based on specific data instances. The goal of this paper is to extend the concept of probabilistic explanations to the regression setting, treating the target regressor as a black box function. The class of probabilistic explanations consists of linear functions that meet a sparsity constraint, alongside a hyperplane constraint defined for the data instance being explained. While minimizing the precision error of such explanations is generally $\text{NP}^{\text{PP}}$-hard, we demonstrate that it can be approximated by substituting the precision measure with a fidelity measure. Optimal explanations based on this fidelity objective can be effectively approached using Mixed Integer Programming (MIP). Moreover, we show that for certain distributions used to define the precision measure, explanations with approximation guarantees can be computed in polynomial time using a variant of Iterative Hard Thresholding (IHT). Experiments conducted on various datasets indicate that both the MIP and IHT approaches outperform the state-of-the-art LIME and MAPLE explainers.

ID: 351

Probability-Raising Causality for Uncertain Parametric Markov Decision Processes with PAC Guarantees

Ryohei Oura, Yuji Ito
[OpenReview]

Abstract

Recent decision-making systems are increasingly complicated, making it crucial to verify and understand their behavior for a given specification. A promising approach is to comprehensively explain undesired behavior in the systems modeled by Markov decision processes (MDPs) through formal verification and causal reasoning. However, the reliable explanation using model-based probabilistic causal analysis has not been explored when the MDP's transition probabilities are uncertain. This paper proposes a method to identify potential causes of undesired behaviors in an uncertain parametric MDP (upMDP) using parameter sampling, model checking, and a set covering for the samples. A cause is defined as a subset of states based on a probability-raising principle. We show that the probability of each identified subset being a cause exceeds a specified threshold. Further, a lower bound of the probability that the undesired paths visit the subsets is maximized as much as possible while satisfying a nonredundancy condition. While computing these probabilities is complicated, this study derives probabilistically approximately correct lower bounds of both probabilities by the sampling. We demonstrate the effectiveness of the proposed method through a path-planning scenario.

ID: 356

i$^2$VAE: Interest Information Augmentation with Variational Regularizers for Cross-Domain Sequential Recommendation

Xuying Ning, Wujiang Xu, Tianxin Wei, Xiaolei Liu
[OpenReview]

Abstract

Cross-Domain Sequential Recommendation (CDSR) leverages user behaviors across multiple domains to mitigate data sparsity and cold-start challenges in Single-Domain Sequential Recommendation. Existing methods primarily rely on shared users (overlapping users) to learn transferable interest representations. However, these approaches have limited information propagation, benefiting mainly overlapping users and those with rich interaction histories while neglecting non-overlapping (cold-start) and long-tailed users, who constitute the majority in real-world scenarios. To address this issue, we propose i$^2$VAE, a novel variational autoencoder (VAE)-based framework that enhances user interest learning with mutual information-based regularizers. i$^2$VAE improves recommendations for cold-start and long-tailed users while maintaining strong performance across all user groups. Specifically, cross-domain and disentangling regularizers extract transferable features for cold-start users, while a pseudo-sequence generator synthesizes interactions for long-tailed users, refined by a denoising regularizer to filter noise and preserve meaningful interest signals. Extensive experiments demonstrate that i$^2$VAE outperforms state-of-the-art methods, underscoring its effectiveness in real-world CDSR applications. Code and datasets are available at https://github.com/WujiangXu/IM-VAE.

ID: 358

Learning Multi-interest Embedding with Dynamic Graph Cluster for Sequention Recommendation

XiaoChunjing, Guoranhao, Zhang Yongwang, Xiaoming Wu
[OpenReview]

Abstract

Multi-interest recommendation is to predict the next item by representing diversity of a user preference with multiple interest embeddings. Although existing methods have achieved convincing results in recommendation tasks, they ignore the continuously changing relations of no-adjacent items in a sequence. In this paper, we focus on how to fully capture the changing relations when capturing the user multi-interest representations. Specifically, we propose a novel dynamic graph cluster-based multi-interest model named MDGR, which not only comprehensively explores the real changing item relations between no-adjacent items by iteratively constructing and continuously optimizing interest sub-graph to update the multiple interest embeddings but also collaborates temporal information and interest weight to model the interactive behaviors of users and items.Our model iteratively constructs and continuously optimizes the interest sub-graph by comprehensively adopting dynamic graph cluster to explore the item relations in sequences. That is beneficial to dynamically model user multiple interests. Furthermore, we employ the attention module to extract different influence of various interest embeddings. Finally, we use the refined item embedding and the final multi-interest embeddings to retrieval the next item that a user is most likely to interact with. To the best of our knowledge, this is the first attempt to explore multi-interest embeddings by iteratively constructing and continuously optimizing the interest sub-graph. Extensive experiments on three popular benchmark datasets demonstrate that MDGR outperforms several state-of-the-art methods.

ID: 361

Nonparametric Bayesian inference of item-level features in classifier combination

Patrick Stinson, Nikolaus Kriegeskorte
[OpenReview]

Abstract

In classification tasks, examples belonging to the same class can often still differ substantially from one another, and being able to capture such heterogeneity and its impact on classification can be important for aggregating estimates across multiple classifiers. Bayesian models developed so far have relied on a fixed set of latent variables to model these causal factors, which not only introduces the need for model selection but also assumes that each item is governed by the same set of causal factors. We develop a Bayesian model that can infer generic item features by modeling item feature membership as distributed according to an Indian Buffet Process. Despite its flexibility, our model is scalable to a large number of classifiers and examples. We compare our method with models from item response theory and Bayesian classifier combination on black-box crowdsourcing tasks and with neural network instance-dependent models in white-box classifier combination tasks.

ID: 362

VADIS: Investigating Inter-View Representation Biases for Multi-View Partial Multi-Label Learning

Jie Wang, Ning Xu, Xin Geng
[OpenReview]

Abstract

Multi-view partial multi-label learning (MVPML) deals with training data where each example is represented by multiple feature vectors and associated with a set of candidate labels, only a subset of which are correct. The diverse representation biases present in different views complicate the annotation process in MVPML, leading to the inclusion of incorrect labels in the candidate label set. Existing methods typically merge features from different views to identify the correct labels in the training data without addressing the representation biases inherent in different views. In this paper, we propose a novel MVPML method called \textsc{Vadis}, which investigates view-aware representations for disambiguation and predictive model learning. Specifically, we exploit the global common representation shared by all views, aligning it with a local semantic similarity matrix to estimate ground-truth labels via a low-rank mapping matrix. Additionally, to identify incorrect labels, the view-specific inconsistent representation is recovered by leveraging the sparsity assumption. Experiments on real-world datasets validate the superiority of our approach over other state-of-the-art methods.

ID: 365

Learning Causal Response Representations through Direct Effect Analysis

Homer Durand, Gherardo Varando, Gustau Camps-Valls
[OpenReview]

Abstract

We propose a novel approach for learning causal response representations. Our method aims to extract directions in which a multidimensional outcome is most directly caused by a treatment variable. By bridging conditional independence testing with causal representation learning, we formulate an optimisation problem that maximises the evidence against conditional independence between the treatment and outcome, given a conditioning set. This formulation employs flexible regression models tailored to specific applications, creating a versatile framework. The problem is addressed through a generalised eigenvalue decomposition. We show that, under mild assumptions, the distribution of the largest eigenvalue can be bounded by a known $F$-distribution, enabling testable conditional independence. We also provide theoretical guarantees for the optimality of the learned representation in terms of signal-to-noise ratio and Fisher information maximisation. Finally, we demonstrate the empirical effectiveness of our approach in simulation and real-world experiments. Our results underscore the utility of this framework in uncovering direct causal effects within complex, multivariate settings.

ID: 366

$\sigma$-Maximal Ancestral Graphs

Binghua Yao, Joris Mooij
[OpenReview]

Abstract

Maximal Ancestral Graphs (MAGs) provide an abstract representation of Directed Acyclic Graphs (DAGs) with latent (selection) variables. These graphical objects encode information about ancestral relations and d-separations of the DAGs they represent. This abstract representation has been used amongst others to prove the soundness and completeness of the FCI algorithm for causal discovery, and to derive a do-calculus for its output. One significant inherent limitation of MAGs is that they rule out the possibility of cyclic causal relationships. In this work, we address that limitation. We introduce and study a class of graphical objects that we coin "$\sigma$-Maximal Ancestral Graphs" ("$\sigma$-MAGs"). We show how these graphs provide an abstract representation of (possibly cyclic) Directed Graphs (DGs) with latent (selection) variables, analogously to how MAGs represent DAGs. We study the properties of these objects and provide a characterization of their Markov equivalence classes.

ID: 374

Quantum Speedups for Bayesian Network Structure Learning

Juha Harviainen, Kseniya Rychkova, Mikko Koivisto
[OpenReview]

Abstract

The Bayesian network structure learning (BNSL) problem asks for a directed acyclic graph that maximizes a given score function. For networks with $n$ nodes, the fastest known algorithms run in time $O(2^n n^2)$ in the worst case, with no improvement in the asymptotic bound for two decades. Inspired by recent advances in quantum computing, we ask whether BNSL admits a polynomial quantum speedup, that is, whether the problem can be solved by a quantum algorithm in time $O(c^n)$ for some constant $c$ less than $2$. We answer the question in the affirmative by giving two algorithms achieving $c \leq 1.817$ and $c \leq 1.982$ assuming the number of potential parent sets is, respectively, subexponential and $O(1.453^n)$. Both algorithms assume the availability of a quantum random access memory. We also prove that one presumably cannot lower the base $2$ for any classical algorithm, as that would refute the strong exponential time hypothesis.

ID: 382

Cutting Through Privacy: A Hyperplane-Based Data Reconstruction Attack in Federated Learning

Francesco Diana, André Nusser, Chuan Xu, Giovanni Neglia
[OpenReview]

Abstract

Federated Learning (FL) enables collaborative training of machine learning models across distributed clients without sharing raw data, ostensibly preserving data privacy. Nevertheless, recent studies have revealed critical vulnerabilities in FL, showing that a malicious central server can manipulate model updates to reconstruct clients' private training data. Existing data reconstruction attacks have important limitations: they often rely on assumptions about the clients' data distribution or their efficiency significantly degrades when batch sizes exceed just a few tens of samples. In this work, we introduce a novel data reconstruction attack that overcomes these limitations. Our method leverages a new geometric perspective on fully connected layers to craft malicious model parameters, enabling the perfect recovery of arbitrarily large data batches in classification tasks without any prior knowledge of clients' data. Through extensive experiments on both image and tabular datasets, we demonstrate that our attack outperforms existing methods and achieves perfect reconstruction of data batches two orders of magnitude larger than the state of the art.

ID: 384

Explaining Negative Classifications of AI Models in Tumor Diagnosis

David A. Kelly, Hana Chockler, Nathan Blake
[OpenReview]

Abstract

Using AI models in healthcare is gaining popularity. To improve clinician confidence in the results of automated triage and to provide further information about the suggested diagnosis, an explanation produced by a separate post-hoc explainability tool often accompanies the classification of an AI model. If no abnormalities are detected, however, it is not clear what an explanation should be. A human clinician might be able to describe certain salient features of tumors that are not in scan, but existing Explainable AI tools cannot do that, as they cannot point to features that are absent from the input. In this paper, we present a definition of and algorithm for providing explanations of absence; that is, explanations of negative classifications in the context of healthcare AI. Our approach is rooted in the concept of explanations in actual causality. It uses the model as a black-box and is hence portable and works with proprietary models. Moreover, the computation is done in the preprocessing stage, based on the model and the dataset. During the execution, the algorithm only projects the precomputed explanation template on the current image. We implemented this approach in a tool, nito, and trialed it on a number of medical datasets to demonstrate its utility on the classification of solid tumors. We discuss the differences between the theoretical approach and the implementation in the domain of classifying solid tumors and address the additional complications posed by this domain. Finally, we discuss the assumptions we make in our algorithm and its possible extensions to explanations of absence for general image classifiers.

ID: 386

Simulation-Free Differential Dynamics Through Neural Conservation Laws

Mengjian Hua, Eric Vanden-Eijnden, Ricky T. Q. Chen
[OpenReview]

Abstract

We present a novel simulation-free framework for training continuous-time diffusion processes over very general objective functions. Existing methods typically involve either prescribing the optimal diffusion process---which only works for heavily restricted problem formulations---or require expensive simulation to numerically obtain the time-dependent densities and sample from the diffusion process. In contrast, we propose a coupled parameterization which jointly models a time-dependent density function, or probability path, and the dynamics of a diffusion process that generates this probability path. To accomplish this, our approach directly bakes in the Fokker-Planck equation and density function requirements as hard constraints, by extending and greatly simplifying the construction of Neural Conservation Laws. This enables simulation-free training for a large variety of problem formulations, from data-driven objectives as in generative modeling and dynamical optimal transport, to optimality-based objectives as in stochastic optimal control, with straightforward extensions to mean-field objectives due to the ease of accessing exact density functions. We validate our method in a diverse range of application domains from modeling spatio-temporal events, to learning optimal dynamics from population data.

ID: 387

Tuning-Free Coreset Markov Chain Monte Carlo via Hot DoG

Naitong Chen, Jonathan H. Huggins, Trevor Campbell
[OpenReview]

Abstract

A Bayesian coreset is a small, weighted subset of a data set that replaces the full data during inference to reduce computational cost. The state-of-the-art coreset construction algorithm, Coreset Markov chain Monte Carlo (Coreset MCMC), uses draws from an adaptive Markov chain targeting the coreset posterior to train the coreset weights via stochastic gradient optimization. However, the quality of the constructed coreset, and thus the quality of its posterior approximation, is sensitive to the stochastic optimization learning rate. In this work, we propose a learning-rate-free stochastic gradient optimization procedure, Hot-start Distance over Gradient (Hot DoG), for training coreset weights in Coreset MCMC without user tuning effort. We provide a theoretical analysis of the convergence of the coreset weights produced by Hot DoG. We also provide empirical results demonstrate that Hot DoG provides higher quality posterior approximations than other learning-rate-free stochastic gradient methods, and performs competitively to optimally-tuned ADAM.

ID: 388

Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge

Yassir Fathullah, Mark Gales
[OpenReview]

Abstract

This paper explores generalised probabilistic modelling and uncertainty estimation in comparative LLM-as-a-judge frameworks. We show that existing Product-of-Experts methods are specific cases of a broader framework, enabling diverse modelling options. Furthermore, we propose improved uncertainty estimates for individual comparisons, enabling more efficient selection and achieving strong performance with fewer evaluations. We also introduce a method for estimating overall ranking uncertainty. Finally, we demonstrate that combining absolute and comparative scoring improves performance. Experiments show that the specific expert model has a limited impact on final rankings but our proposed uncertainty estimates, especially the probability of reordering, significantly improve the efficiency of systems reducing the number of needed comparisons by $\sim$50%. Furthermore, ranking-level uncertainty metrics can be used to identify low-performing predictions, where the nature of the probabilistic model has a notable impact on the quality of the overall uncertainty.

ID: 389

Do Vendi Scores Converge with Finite Samples? Truncated Vendi Score for Finite-Sample Convergence Guarantees

Azim Ospanov, Farzan Farnia
[OpenReview]

Abstract

Evaluating the diversity of generative models without reference data poses methodological challenges. The reference-free Vendi and RKE scores address this by quantifying the diversity of generated data using matrix-based entropy measures. Among these two, the Vendi score is typically computed via the eigendecomposition of an $n \times n$ kernel matrix constructed from n generated samples. However, the prohibitive computational cost of eigendecomposition for large $n$ often limits the number of samples used to fewer than 20,000. In this paper, we investigate the statistical convergence of the Vendi and RKE scores under restricted sample sizes. We numerically demonstrate that, in general, the Vendi score computed with standard sample sizes below 20,000 may not converge to its asymptotic value under infinite sampling. To address this, we introduce the $t$-truncated Vendi score by truncating the eigenspectrum of the kernel matrix, which is provably guaranteed to converge to its population limit with $n=\mathcal{O}(t)$ samples. We further show that existing Nyström and FKEA approximation methods converge to the asymptotic limit of the truncated Vendi score. In contrast to the Vendi score, we prove that the RKE score enjoys universal convergence guarantees across all kernel functions. We conduct several numerical experiments to illustrate the concentration of Nyström and FKEA computed Vendi scores around the truncated Vendi score, and we analyze how the truncated Vendi and RKE scores correlate with the diversity of image and text data. The code is available at \url{https://github.com/aziksh-ospanov/truncated-vendi}.

ID: 390

EERO: Early Exit with Reject Option for Efficient Classification with limited budget

Florian Valade, Mohamed Hebiri, Paul Gay
[OpenReview]

Abstract

The increasing complexity of advanced machine learning models requires innovative approaches to manage computational resources effectively. One such method is the Early Exit strategy, which allows for adaptive computation by providing a mechanism to shorten the processing path for simpler data instances. In this paper, we propose EERO, a new methodology to translate the problem of early exiting to a problem of using multiple classifiers with reject option in order to better select the exiting head for each instance. We calibrate the probabilities of exiting at the different heads using aggregation with exponential weights to guarantee a fixed budget. We consider factors such as Bayesian risk, budget constraints, and head-specific budget consumption. Experimental results demonstrate that our method achieves competitive compromise between budget allocation and accuracy.

ID: 391

Revisiting the Berkeley Admissions data: Statistical Tests for Causal Hypotheses

Sourbh Bhadane, Joris Mooij, Philip Boeken, Onno Zoeter
[OpenReview]

Abstract

Reasoning about fairness through correlation-based notions is rife with pitfalls. The 1973 University of California, Berkeley graduate school admissions case from \citet{BickelHO75} is a classic example of one such pitfall, namely Simpson’s paradox. The discrepancy in admission rates among male and female applicants, in the aggregate data over all departments, vanishes when admission rates per department are examined. We reason about the Berkeley graduate school admissions case through a causal lens. In the process, we introduce a statistical test for causal hypothesis testing based on Pearl's instrumental-variable inequalities \citep{Pearl95}. We compare different causal notions of fairness that are based on graphical, counterfactual and interventional queries on the causal model, and develop statistical tests for these notions that use only observational data. We study the logical relations between notions, and show that while notions may not be equivalent, their corresponding statistical tests coincide for the case at hand. We believe that a thorough case-based causal analysis helps develop a more principled understanding of both causal hypothesis testing and fairness.

ID: 393

Testing Generalizability in Causal Inference

Daniel de Vassimon Manela, Linying Yang, Robin J. Evans
[OpenReview]

Abstract

Ensuring robust model performance in diverse real-world scenarios requires addressing generalizability across domains with covariate shifts. However, no formal procedure exists for statistically evaluating generalizability in machine learning algorithms. Existing methods often rely on arbitrary proxy predictive metrics like mean squared error, but do not directly answer whether a model can or cannot generalize. To address this gap in the domain of causal inference, we propose a systematic framework for statistically evaluating the generalizability of high-dimensional causal inference models. Our approach uses the frugal parameterization to flexibly simulate from fully and semi-synthetic causal benchmarks, offering a comprehensive evaluation for both mean and distributional regression methods. Grounded in real-world data, our method ensures more realistic evaluations, which is often missing in current work relying on simplified datasets. Furthermore, using simulations and statistical testing, our framework is robust and avoids over-reliance on conventional metrics, providing statistical safeguards for decision making.

ID: 397

BELIEF - Bayesian Sign Entropy Regularization for LIME Framework

Revoti Prasad Bora, Philipp Terhörst, Raymond Veldhuis, Raghavendra Ramachandra, Kiran Raja
[OpenReview]

Abstract

Explanations of Local Interpretable Model-agnostic Explanations (LIME) are often inconsistent across different runs making them unreliable for eXplainable AI (XAI). The inconsistency stems from sign flips and variability in ranks of the segments for each different run. We propose a Bayesian Regularization approach to reduce sign flips, which in turn stabilizes feature rankings and ensures significantly higher consistency in explanations. The proposed approach enforces sparsity by incorporating a Sign Entropy prior on the coefficient distribution and dynamically eliminates features during optimization. Our results demonstrate that the explanations from the proposed method exhibit significantly better consistency and fidelity than LIME (and its earlier variants). Further, our approach exhibits comparable consistency and fidelity with a significantly lower execution time than the latest LIME variant, i.e., SLICE (CVPR 2024).

ID: 398

An Information-theoretic Perspective of Hierarchical Clustering on Graphs

Yicheng Pan, Bingchen Fan, Pengyu Long, Feng Zheng
[OpenReview]

Abstract

The seminal work of \citep{dasgupta2016cost} has introduced a combinatorial cost function for hierarchical graph clustering that has inspired numerous follow-up studies adopting similar combinatorial approaches. In this paper, we investigate this problem from the \emph{information-theoretic} perspective. We formulate a new cost function that is fully explainable and establish the relationship between combinatorial and information-theoretic perspectives. We present two algorithms for expander-like and well-clustered cardinality weighted graphs, respectively, and show that both of them achieve $O(1)$-approximation for our new cost function. Addressing practical needs, we consider non-binary hierarchical clustering problem, and propose a hyperparameter-free framework HCSE that recursively stratifies cluster trees through sparsity-aware partitioning, automatically determining the optimal hierarchy depth via an interpretable mechanism. Extensive experimental results demonstrate the superiority of our cost function and algorithms in binary clustering performance, hierarchy level identification, and reconstruction accuracy compared to existing approaches.

ID: 399

Hybrid Bernstein Normalizing Flows for Flexible Multivariate Density Regression with Interpretable Marginals

Marcel Arpogaus, Thomas Kneib, Thomas Nagler, David Rügamer
[OpenReview]

Abstract

Density regression models allow a comprehensive understanding of data by modeling the complete conditional probability distribution. While flexible estimation approaches such as normalizing flows (NFs) work particularly well in multiple dimensions, interpreting the input-output relationship of such models is often difficult, due to the black-box character of deep learning models. In contrast, existing statistical methods for multivariate outcomes such as multivariate conditional transformation models (MCTMs) are restricted in flexibility and are often not expressive enough to represent complex multivariate probability distributions. In this paper, we combine MCTMs with state-of-the-art and autoregressive NFs to leverage the transparency of MCTMs for modeling interpretable feature effects on the marginal distributions in the first step and the flexibility of neural-network-based NFs techniques to account for complex and non-linear relationships in the joint data distribution. We demonstrate our method's versatility in various numerical experiments and compare it with MCTMs and other NF models on both simulated and real-world data.

ID: 403

Moment Alignment: Unifying Gradient and Hessian Matching for Domain Generalization

Yuen Chen, Haozhe Si, Guojun Zhang, Han Zhao
[OpenReview]

Abstract

Domain generalization (DG) seeks to develop models that generalize well to unseen target domains, addressing distribution shifts in real-world applications. One line of research in DG focuses on aligning domain-level gradients and Hessians to enhance generalization. However, existing methods are computationally inefficient and the underlying principles of these approaches are not well understood. In this paper, we develop a theory of moment alignment for DG. Grounded in transfer measures, a principled framework for quantifying generalizability between domains, we prove that aligning derivatives across domains improves transfer measures. Moment alignment provides a unifying understanding of Invariant Risk Minimization, gradient matching, and Hessian matching, three previously disconnected approaches. We further establish the duality between feature moments and derivatives of the classifier head. Building upon our theory, we introduce Closed-Form Moment Alignment (CMA), a novel DG algorithm that aligns domain-level gradients and Hessians in closed-form. Our method overcomes the computational inefficiencies of existing gradient and Hessian-based techniques by eliminating the need for repeated backpropagation or sampling-based Hessian estimation. We validate our theory and algorithm through quantitative and qualitative experiments.

ID: 406

SpinSVAR: Estimating Structural Vector Autoregression Assuming Sparse Input

Panagiotis Misiakos, Markus Püschel
[OpenReview]

Abstract

We introduce SpinSVAR, a novel method for estimating a (linear) structural vector autoregression (SVAR) from time-series data under a sparse input assumption. Unlike prior approaches using Gaussian noise, we model the input as independent and identically distributed (i.i.d.) Laplacian variables, enforcing sparsity and yielding a maximum likelihood estimator (MLE) based on least absolute error regression. We provide theoretical consistency guarantees for the MLE under mild assumptions. SpinSVAR is efficient: it can leverage GPU acceleration to scale to thousands of nodes. On synthetic data with Laplacian or Bernoulli-uniform inputs, SpinSVAR outperforms state-of-the-art methods in accuracy and runtime. When applied to S&P 500 data, it clusters stocks by sectors and identifies significant structural shocks linked to major price movements, demonstrating the viability of our sparse input assumption.

ID: 414

Experimentation under Treatment Dependent Network Interference

Shiv Shankar, Ritwik Sinha, Madalina Fiterau
[OpenReview]

Abstract

Randomized Controlled Trials (RCTs) are a fundamental aspect of data-driven decision-making. RCTs often assume that the units are not influenced by each other. Traditional approaches addressing such effects assume a fixed network structure between the interfering units. However, real-world networks are rarely static, and treatment assignments can actively reshape the interference structure itself, as seen in financial access interventions that alter informal lending networks or healthcare programs that modify peer influence dynamics. This creates a novel and unexplored problem: estimating treatment effects when the interference network is determined by treatment allocation. In this work, we address this gap by proposing two single-experiment estimators for scenarios where network edges depend on nodal treatments constructed from instrumental variables derived from neighbourhood treatments. We prove their unbiasedness and experimentally validate the proposed estimators both on synthetic and real data.

ID: 415

On Constant Regret for Low-Rank MDPs

Alexander Sturm, Sebastian Tschiatschek
[OpenReview]

Abstract

Although there exist instance-dependent regret bounds for linear Markov decision processes (MDPs) and low-rank bandits, extensions to low-rank MDPs remain unexplored. In this work, we close this gap and provide regret bounds for low-rank MDPs in an instance-dependent setting. Specifically, we introduce an algorithm, called UniSREP-UCB, which utilizes a constrained optimization objective to learn features with good spectral properties. Furthermore, we demonstrate that our algorithm enjoys constant regret if the minimal sub-optimality gap and the occupancy distribution of the optimal policy are well-defined and known. To the best of our knowledge, these are the first instance-dependent regret results for low-rank MDPs.

ID: 418

Divide and Orthogonalize: Efficient Continual Learning with Local Model Space Projection

Jin Shang, Simone Shao, Tian Tong, Fan Yang, Yetian Chen, Yang Jiao, Jia Liu, Yan Gao
[OpenReview]

Abstract

Continual learning (CL) has gained increasing interest in recent years due to the need for models that can continuously learn new tasks while retaining knowledge from previous ones. However, existing CL methods often require either computationally expensive layer-wise gradient projections or large-scale storage of past task data, making them impractical for resource-constrained scenarios. To address these challenges, we propose a local model space projection (LMSP)-based continual learning framework that significantly reduces computational complexity from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2)$ while preserving both forward and backward knowledge transfer with minimal performance trade-offs. We establish a theoretical analysis of the error and convergence properties of LMSP compared to conventional global approaches. Extensive experiments on multiple public datasets demonstrate that our method achieves competitive performance while offering substantial efficiency gains, making it a promising solution for scalable continual learning.

ID: 419

MOHITO: Multi-Agent Reinforcement Learning using Hypergraphs for Task-Open Systems

Gayathri Anil, Prashant Doshi, Daniel Alan Redder, Adam Eck, Leen-Kiat Soh
[OpenReview]

Abstract

Open agent systems are prevalent in the real world, where the sets of agents and tasks change over time. In this paper, we focus on task-open multi-agent systems, exemplified by applications such as ridesharing, where passengers (tasks) appear spontaneously over time and disappear if not attended to promptly. Task-open settings challenge us with an action space which changes dynamically. This renders existing reinforcement learning (RL) methods--intended for fixed state and action spaces--inapplicable. Whereas multi-task learning approaches learn policies generalized to multiple known and related tasks, they struggle to adapt to previously unseen tasks. Conversely, lifelong learning adapts to new tasks over time, but generally assumes that tasks come sequentially from a static and known distribution rather than simultaneously and unpredictably. We introduce a novel category of RL for addressing task openness, modeled using a task-open Markov game. Our approach, MOHITO, is a multi-agent actor-critic schema which represents knowledge about the relationships between agents and changing tasks and actions as dynamically evolving 3-uniform hypergraphs. As popular multi-agent RL testbeds do not exhibit task openness, we evaluate MOHITO on two realistic and naturally task-open domains to establish its efficacy and provide a benchmark for future work in this setting.

ID: 420

Scaling Probabilistic Circuits via Data Partitioning

Jonas Seng, Florian Peter Busch, Pooja Prasad, Devendra Singh Dhami, Martin Mundt, Kristian Kersting
[OpenReview]

Abstract

Probabilistic circuits (PCs) enable us to learn joint distributions over a set of random variables and to perform various probabilistic queries in a tractable fashion. Though the tractability property allows PCs to scale beyond non-tractable models such as Bayesian Networks, scaling training and inference of PCs to larger, real-world datasets remains challenging. To remedy the situation, we show how PCs can be learned across multiple machines by recursively partitioning a distributed dataset, thereby unveiling a deep connection between PCs and federated learning (FL). This leads to federated circuits (FCs)---a novel and flexible federated learning (FL) framework that (1) allows one to scale PCs on distributed learning environments (2) train PCs faster and (3) unifies for the first time horizontal, vertical, and hybrid FL in one framework by re-framing FL as a density estimation problem over distributed datasets. We demonstrate FC's capability to scale PCs on various large-scale datasets. Also, we show FC's versatility in handling horizontal, vertical, and hybrid FL within a unified framework on multiple classification tasks.

ID: 423

Toward Universal Laws of Outlier Propagation

Aram Ebtekar, Yuhao Wang, Dominik Janzing
[OpenReview]

Abstract

When a variety of anomalous features motivate flagging different samples as *outliers*, Algorithmic Information Theory (AIT) offers a principled way to unify them in terms of a sample's *randomness deficiency*. Subject to the Independence of Mechanisms Principle, we show that for a joint sample on the nodes of a causal Bayesian network, the randomness deficiency decomposes into a sum of randomness deficiencies at each causal mechanism. Consequently, anomalous observations can be attributed to their root causes, i.e., the mechanisms that behaved anomalously. As an extension of Levin's law of randomness conservation, we show that weak outliers cannot cause strong ones. We show how these information theoretic laws clarify our understanding of outlier detection and attribution, in the context of more specialized outlier scores from prior literature.

ID: 432

Bayesian Optimization over Bounded Domains with the Beta Product Kernel

Huy Hoang Nguyen, Han Zhou, Matthew B. Blaschko, Aleksei Tiulpin
[OpenReview]

Abstract

Bayesian optimization with Gaussian processes (GP) is commonly used to optimize black-box functions. The Matérn and the Radial Basis Function (RBF) covariance functions are used frequently, but they do not make any assumptions about the domain of the function, which may limit their applicability in bounded domains. To address the limitation, we introduce the Beta kernel, a non-stationary kernel induced by a product of Beta distribution density functions. Such a formulation allows our kernel to naturally model functions on bounded domains. We present statistical evidence supporting the hypothesis that the kernel exhibits an exponential eigendecay rate, based on empirical analyses of its spectral properties across different settings. Our experimental results demonstrate the robustness of the Beta kernel in modeling functions with optima located near the faces or vertices of the unit hypercube. The experiments show that our kernel consistently outperforms a wide range of kernels, including the well-known Matérn and RBF, in different problems, including synthetic function optimization and the compression of vision and language models.

ID: 436

A Quantum Information Theoretic Approach to Tractable Probabilistic Models

Pedro Zuidberg Dos Martires
[OpenReview]

Abstract

By recursively nesting sums and products, probabilistic circuits have emerged in recent years as an attractive class of generative models as they enjoy, for instance, polytime marginalization of random variables. In this work we study these machine learning models using the framework of quantum information theory, leading to the introduction of positive unital circuits (PUnCs), which generalize circuit evaluations over positive real-valued probabilities to circuit evaluations over positive semi-definite matrices. As a consequence, PUnCs strictly generalize probabilistic circuits as well as recently introduced circuit classes such as PSD circuits.

ID: 437

Building Conformal Prediction Intervals with Approximate Message Passing

Lucas Clarté, Lenka Zdeborova
[OpenReview]

Abstract

Conformal prediction has emerged as a powerful tool for building prediction intervals that are valid in a distribution-free way. However, its evaluation may be computationally costly, especially in the high-dimensional setting where the dimensionality and sample sizes are both large and of comparable magnitudes. To address this challenge in the context of generalized linear regression, we propose a novel algorithm based on Approximate Message Passing (AMP) to accelerate the computation of prediction intervals using full conformal prediction, by approximating the computation of conformity scores. Our work bridges a gap between modern uncertainty quantification techniques and tools for high-dimensional problems involving the AMP algorithm. We evaluate our method on both synthetic and real data, and show that it produces prediction intervals that are close to the baseline methods, while being orders of magnitude faster. Additionally, in the high-dimensional limit and under assumptions on the data distribution, the conformity scores computed by AMP converge to the one computed exactly, which allows theoretical study and benchmarking of conformal methods in high dimensions.

ID: 445

Unsupervised Attributed Dynamic Network Embedding with Stability Guarantees

Emma Ceccherini, Ian Gallagher, Andrew Jones, Daniel John Lawson
[OpenReview]

Abstract

Stability for dynamic network embeddings ensures that nodes behaving the same at different times receive the same embedding, allowing comparison of nodes in the network across time. We present attributed unfolded adjacency spectral embedding (AUASE), a stable unsupervised representation learning framework for dynamic networks in which nodes are attributed with time-varying covariate information. To establish stability, we prove uniform convergence to an associated latent position model. We quantify the benefits of our dynamic embedding by comparing with state-of-the-art network representation learning methods on three real attributed networks. To the best of our knowledge, AUASE is the only attributed dynamic embedding that satisfies stability guarantees without the need for ground truth labels, which we demonstrate provides significant improvements for link prediction and node classification.

ID: 449

Calibrated Regression Against An Adversary Without Regret

Shachi Deshpande, Charles Marx, Volodymyr Kuleshov
[OpenReview]

Abstract

We are interested in probabilistic prediction in online settings in which data does not follow a probability distribution. Our work seeks to achieve two goals: (1) producing valid probabilities that accurately reflect model confidence; (2) ensuring that traditional notions of performance (e.g., high accuracy) still hold. We introduce online algorithms guaranteed to achieve these goals on arbitrary streams of data-points, including data chosen by an adversary. Specifically, our algorithms produce forecasts that are (1) calibrated i.e., an 80% confidence interval contains the true outcome 80% of the time—and (2) have low regret relative to a user-specified baseline model. We implement a post-hoc recalibration strategy that provably achieves these goals in regression; previous algorithms applied to classification or achieved (1) but not (2). In the context of Bayesian optimization, an online model-based decision-making task in which the data distribution shifts over time, our method yields accelerated convergence to improved optima.

ID: 464

Epistemic Uncertainty in Conformal Scores: A Unified Approach

Luben Miguel Cruz Cabezas, Vagner Silva Santos, Thiago Ramos, Rafael Izbicki
[OpenReview]

Abstract

Conformal prediction methods create prediction bands with distribution-free guarantees but do not explicitly capture epistemic uncertainty, which can lead to overconfident predictions in data-sparse regions. Although recent conformal scores have been developed to address this limitation, they are typically designed for specific tasks, such as regression or quantile regression. Moreover, they rely on particular modeling choices for epistemic uncertainty, restricting their applicability. We introduce $\texttt{EPICSCORE}$, a model-agnostic approach that enhances any conformal score by explicitly integrating epistemic uncertainty. Leveraging Bayesian techniques such as Gaussian Processes, Monte Carlo Dropout, or Bayesian Additive Regression Trees, $\texttt{EPICSCORE}$ adaptively expands predictive intervals in regions with limited data while maintaining compact intervals where data is abundant. As with any conformal method, it preserves finite-sample marginal coverage. Additionally, it also achieves asymptotic conditional coverage. Experiments demonstrate its good performance compared to existing methods. Designed for compatibility with any Bayesian model, but equipped with distribution-free guarantees, $\texttt{EPICSCORE}$ provides a general-purpose framework for uncertainty quantification in prediction problems.

ID: 467

Complete Characterization for Adjustment in Summary Causal Graphs of Time Series

Clément Yvernes, Emilie Devijver, Eric Gaussier
[OpenReview]

Abstract

The identifiability problem for interventions aims at assessing whether the total causal effect can be written with a do-free formula, and thus be estimated from observational data only. We study this problem, considering multiple interventions, in the context of time series when only an abstraction of the true causal graph, in the form of a summary causal graph, is available. We propose in particular both necessary and sufficient conditions for the adjustment criterion, which we show is complete in this setting, and provide a pseudo-linear algorithm to decide whether the query is identifiable or not.

ID: 469

Decomposition of Probabilities of Causation with Two Mediators

Yuta Kawakami, Jin Tian
[OpenReview]

Abstract

Mediation analysis for probabilities of causation (PoC) provides a fundamental framework for evaluating the necessity and sufficiency of treatment in provoking an event through different causal pathways. One of the primary objectives of causal mediation analysis is to decompose the total effect into path-specific components. In this study, we investigate the path-specific probability of necessity and sufficiency (PNS) to decompose the total PNS into path-specific components along distinct causal pathways between treatment and outcome, incorporating two mediators. We define the path-specific PNS for decomposition and provide an identification theorem. Furthermore, we conduct numerical experiments to assess the properties of the proposed estimators from finite samples and demonstrate their practical application using a real-world educational dataset.

ID: 470

Out-of-distribution Robust Optimization

Zhongze Cai, Hansheng Jiang, Xiaocheng Li
[OpenReview]

Abstract

In this paper, we consider the contextual robust optimization problem under an out-of-distribution setting. The contextual robust optimization problem considers a risk-sensitive objective function for an optimization problem with the presence of a context vector (also known as covariates or side information) capturing related information. While the existing works mainly consider the in-distribution setting, and the resultant robustness achieved is in an out-of-sample sense, our paper studies an out-of-distribution setting where there can be a difference between the test environment and the training environment where the data are collected. We propose methods that handle this out-of-distribution setting, and the key relies on a density ratio estimation for the distribution shift. We show that additional structures such as covariate shift and label shift are not only helpful in defending distribution shift but also necessary in avoiding non-trivial solutions compared to other principled methods such as distributionally robust optimization. We also illustrate how the covariates can be useful in this procedure. Numerical experiments generate more intuitions and demonstrate that the proposed methods can help avoid over-conservative solutions.

ID: 477

MSP-SR: Multi-Stage Probabilistic Generative Super Resolution with Scarce High-Resolution Data

Ruike Zhu, Matthew Charles Weston, Hanwen Zhang, Arindam Banerjee
[OpenReview]

Abstract

Several application domains, especially in science and medicine, benefit tremendously from acquiring high-resolution images of objects and phenomena of interest. Recognizing this need, generative models for super-resolution (SR) have emerged as a promising approach for such data generation. However, when training data are scarce due to high acquisition costs, such models struggle and often fail to capture the true data distribution due to insufficient data and domain knowledge. While transfer learning, domain adaptation, or few shot learning of such generative models can be a reasonable approach, most existing large scale generative models have been (pre)trained on natural images and it is unclear if such models can be seamlessly transferred to say medical images. In this paper, we propose Multi-Stage Probabilistic Super Resolution (MSP-SR), a cascaded few-shot learning framework for super-resolution through multi-stage transfer learning. At a high level, MSP-SR first transfers a generative model from out-of-domain to in-domain, e.g., from natural to medical images, and then from in-domain to the target application. We present the details based on conditional diffusion models and validate MSP-SR on multiple Magnetic Resonance Imaging (MRI) datasets, demonstrating that MSP-SR persistently and usually significantly outperforms direct fine-tuning (DFT) approaches as well as SR baselines. Further, MSP-SR empirically provides more accurate characterization of uncertainty in SR compared to DFT.

ID: 482

Causal Effect Identification in Heterogeneous Environments from Higher-Order Moments

Yaroslav Kivva, Sina Akbari, Saber Salehkaleybar, Negar Kiyavash
[OpenReview]

Abstract

We investigate the estimation of the causal effect of a treatment variable on an outcome in the presence of a latent confounder. We first show that the causal effect is identifiable under certain conditions when data is available from multiple environments, provided that the target causal effect remains invariant across these environments. Secondly, we propose a moment-based algorithm for estimating the causal effect as long as only a single parameter of the data-generating mechanism varies across environments -- whether it be the exogenous noise distribution or the causal relationship between two variables. Conversely, we prove that identifiability is lost if both exogenous noise distributions of both the latent and treatment variables vary across environments. Finally, we propose a procedure to identify which parameter of the data-generating mechanism has varied across the environments and evaluate the performance of our proposed methods through experiments on synthetic data.

ID: 483

Conformal Prediction for Federated Graph Neural Networks with Missing Neighbor Information

Ömer Faruk Akgül, Rajgopal Kannan, Viktor Prasanna
[OpenReview]

Abstract

Uncertainty quantification is essential for reliable federated graph learning, yet existing methods struggle with decentralized and heterogeneous data. In this work, we first extend Conformal Prediction (CP), a well-established method for uncertainty quantification, to federated graph learning, formalizing conditions for CP validity under partial exchangeability across distributed subgraphs. We prove that our approach maintains rigorous coverage guarantees even with client-specific data distributions. Building on this foundation, we address a key challenge in federated graph learning: missing neighbor information, which inflates CP set sizes and reduces efficiency. To mitigate this, we propose a variational autoencoder (VAE)-based architecture that reconstructs missing neighbors while preserving data privacy. Empirical evaluations on real-world datasets demonstrate the effectiveness of our method: our theoretically grounded federated training strategy reduces CP set sizes by 15.4\%, with the VAE-based reconstruction providing an additional 4.9\% improvement, all while maintaining rigorous coverage guarantees.

ID: 486

A Probabilistic Neuro-symbolic Layer for Algebraic Constraint Satisfaction

Leander Kurscheidt, Paolo Morettin, Roberto Sebastiani, Andrea Passerini, Antonio Vergari
[OpenReview]

Abstract

In safety-critical applications, guaranteeing the satisfaction of constraints over continuous environments is crucial, e.g., an autonomous agent should never crash over obstacles or go off-road. Neural models struggle in the presence of these constraints, especially when they involve intricate algebraic relationships. To address this, we introduce a differentiable probabilistic layer that guarantees the satisfaction of non-convex algebraic constraints over continuous variables. This probabilistic algebraic layer (PAL) can be seamlessly plugged into any neural architecture and trained via maximum likelihood without requiring approximations. PAL defines a distribution over conjunctions and disjunctions of linear inequalities, parametrized by polynomials. This formulation enables efficient and exact renormalization via symbolic integration, which can be amortized across different data points and easily parallelized on a GPU. We showcase PAL and our integration scheme on a number of benchmarks for algebraic constraint integration and on real-world trajectory data.

ID: 487

Probabilistic Graph Circuits: Deep Generative Models for Tractable Probabilistic Inference over Graphs

Milan Papez, Martin Rektoris, Vaclav Smidl, Tomáš Pevný
[OpenReview]

Abstract

Deep generative models (DGMs) have recently demonstrated remarkable success in capturing complex probability distributions over graphs. Although their excellent performance is attributed to powerful and scalable deep neural networks, it is, at the same time, exactly the presence of these highly non-linear transformations that makes DGMs intractable. Indeed, despite representing probability distributions, intractable DGMs deny probabilistic foundations by their inability to answer even the most basic inference queries without approximations or design choices specific to a very narrow range of queries. To address this limitation, we propose probabilistic graph circuits (PGCs), a framework of tractable DGMs that provide exact and efficient probabilistic inference over (arbitrary parts of) graphs. Nonetheless, achieving both exactness and efficiency is challenging in the permutation-invariant setting of graphs. We design PGCs that are inherently invariant and satisfy these two requirements, yet at the cost of low expressive power. Therefore, we investigate two alternative strategies to achieve the invariance: the first sacrifices the efficiency, and the second sacrifices the exactness. We demonstrate that ignoring the permutation invariance can have severe consequences in anomaly detection, and that the latter approach is competitive with, and sometimes better than, existing intractable DGMs in the context of molecular graph generation.

ID: 488

Dependent Randomized Rounding for Budget Constrained Experimental Design

Khurram Yamin, Edward Kennedy, Bryan Wilder
[OpenReview]

Abstract

Policymakers in resource-constrained settings require experimental designs that satisfy strict budget limits while ensuring precise estimation of treatment effects. We propose a framework that applies a dependent randomized rounding procedure to convert assignment probabilities into binary treatment decisions. Our proposed solution preserves the marginal treatment probabilities while inducing negative correlations among assignments, leading to improved estimator precision through variance reduction. We establish theoretical guarantees for the inverse propensity weighted and general linear estimators, and demonstrate through empirical studies that our approach yields efficient and accurate inference under fixed budget constraints.

ID: 490

An Optimal Algorithm for Strongly Convex Min-Min Optimization

Dmitry Kovalev, Alexander Gasnikov, Grigory Malinovsky
[OpenReview]

Abstract

We consider the problem of minimizing a function $ f(x, y) $, where $ f $ is a smooth and strongly convex function with respect to both variables, being $ \mu_x $-strongly convex in $ x $ and $ \mu_y $-strongly convex in $ y $. The optimal accelerated gradient method of Yurii Nesterov achieves a convergence rate that requires approximately $ \mathcal{O}((\min(\mu_x, \mu_y))^{-1/2}) $ evaluations of the partial gradients $ \nabla_x f $ and $ \nabla_y f $. In this paper, we propose a novel optimization algorithm that improves upon this complexity by requiring only $ \mathcal{O}(\mu_x^{-1/2}) $ computations of $ \nabla_x f $ and $ \mathcal{O}(\mu_y^{-1/2}) $ computations of $ \nabla_y f $. This improvement is particularly advantageous in scenarios where there is a significant disparity between the strong convexity parameters, specifically when $ \mu_x \gg \mu_y $. Furthermore, in practical applications where the computation of $ \nabla_y f $ is considerably more efficient than that of $ \nabla_x f $, the proposed method leads to a substantial reduction in the overall wall-clock time required for optimization. As a key application, we consider Partially Local Federated Learning, a setting in which the model is partitioned into a local component and a global component. We demonstrate how our proposed method can be effectively applied in this framework, highlighting its practical advantages in improving computational efficiency.

ID: 493

On Information-Theoretic Measures of Predictive Uncertainty

Kajetan Schweighofer, Lukas Aichberger, Mykyta Ielanskyi, Sepp Hochreiter
[OpenReview]

Abstract

Reliable estimation of predictive uncertainty is crucial for machine learning applications, particularly in high-stakes scenarios where hedging against risks is essential. Despite its significance, there is no universal agreement on how to best quantify predictive uncertainty. In this work, we revisit core concepts to propose a framework for information-theoretic measures of predictive uncertainty. Our proposed framework categorizes predictive uncertainty measures according to two factors: (I) The predicting model (II) The approximation of the true predictive distribution. Examining all possible combinations of these two factors, we derive a set of predictive uncertainty measures that includes both known and newly introduced ones. We extensively evaluate these measures across a broad set of tasks, identifying conditions under which certain measures excel. Our findings show the importance of aligning the choice of uncertainty measure with the predicting model on in-distribution (ID) data, the limitations of epistemic uncertainty measures for out-of-distribution (OOD) data, and that the disentanglement between measures varies substantially between ID and OOD data. Together, these insights provide a more comprehensive understanding of predictive uncertainty measures, revealing their implicit assumptions and relationships.

ID: 495

Improved Variational Inference in Discrete VAEs using Error Correcting Codes

María Martínez-García, Grace Villacrés, David Mitchell, Pablo M. Olmos
[OpenReview]

Abstract

Despite advances in deep probabilistic models, learning discrete latent representations remains challenging. This work introduces a novel method to improve inference in discrete Variational Autoencoders by reframing the inference problem through a generative perspective. We conceptualize the model as a communication system, and propose to leverage Error-Correcting Codes (ECCs) to introduce redundancy in latent representations, allowing the variational posterior to produce more accurate estimates and reduce the variational gap. We present a proof-of-concept using a Discrete Variational Autoencoder with binary latent variables and low-complexity repetition codes, extending it to a hierarchical structure for disentangling global and local data features. Our approach significantly improves generation quality, data reconstruction, and uncertainty calibration, outperforming the uncoded models even when trained with tighter bounds such as the Importance Weighted Autoencoder objective. We also outline the properties that ECCs should possess to be effectively utilized for improved discrete variational inference.

ID: 501

Root Cause Analysis of Failures from Partial Causal Structures

Azam Ikram, Kenneth Lee, Shubham Agarwal, Shiv Kumar Saini, Saurabh Bagchi, Murat Kocaoglu
[OpenReview]

Abstract

Finding the root cause of failures is a prominent problem in many complex networks. Causal inference provides us with tools to address this problem algorithmically to automate this process and solve it efficiently. The existing methods either use a known causal structure to identify root cause by backtracking the changes, or ignore the causal structure but relies on invariance tests to identify the changing causal mechanisms after the failure. Assuming a single, unknown root cause, we first establish a novel connection between root cause analysis and the \textit{Interactive Graph Search (IGS)} problem. This mapping highlights the importance of causal knowledge: we demonstrate that any algorithm relying solely on marginal invariance tests to identify the root cause must perform at least $\Omega(\log_{2}(n) + d\log_{1+d}n)$ many tests, where $n$ represents the number of components and $d$ denotes the maximum out-degree of the graph. We then present an optimal algorithm that achieves this bound by reducing the root cause identification problem as an instance of IGS. Beyond the single root cause scenario, we propose a practical extension for settings with multiple root causes and partial causal knowledge. More specifically, we show that even if the causal graph is partially known, we can identify the root-causes with a linear number of invariance tests. This is the first known result on incorporating a partial causal structure for root cause analysis. Our experiments on a production-level application demonstrate that, even in the absence of complete causal information, our approach accurately identifies the root causes of failures.

ID: 507

Privacy-Preserving Neural Processes for Probabilistic User Modeling

Amir Sonee, HARIPRIYA HARIKUMAR, Alex Hämäläinen, Lukas Prediger, Samuel Kaski
[OpenReview]

Abstract

Uncertainty-aware user modeling is crucial for designing AI systems that adapt to users in real-time while addressing privacy concerns. This paper proposes a novel framework for privacy-preserving probabilistic user modeling that integrates uncertainty quantification and differential privacy (DP). Building on neural processes (NPs), a scalable latent variable probabilistic model, we enable meta-learning for user behaviour prediction under privacy constraints. By employing differentially private stochastic gradient descent (DP-SGD), our method achieves rigorous privacy guarantees while preserving predictive accuracy. Unlike prior work, which primarily addresses privacy-preserving learning for convex or smooth functions, we establish theoretical guarantees for non-convex objectives, focusing on the utility-privacy trade-offs inherent in uncertainty-aware models. Through extensive experiments, we demonstrate that our approach achieves competitive accuracy under stringent privacy budgets. Our results showcase the potential of privacy-preserving probabilistic user models to enable trustworthy AI systems in real-world interactive applications.

ID: 511

Proxy-informed Bayesian transfer learning with unknown sources

Sabina J. Sloman, Julien Martinelli, Samuel Kaski
[OpenReview]

Abstract

Generalization outside the scope of one's training data requires leveraging prior knowledge about the effects that transfer, and the effects that don't, between different data sources. Transfer learning is a framework for specifying and refining this knowledge about sets of source (training) and target (prediction) data. A challenging open problem is addressing the empirical phenomenon of negative transfer, whereby the transfer learner performs worse on the target data after taking the source data into account than before. We first introduce a Bayesian perspective on negative transfer, and then a method to address it. The key insight from our formulation is that negative transfer can stem from misspecified prior information about non-transferable causes of the source data. Our proposed method, proxy-informed robust method for probabilistic transfer learning (PROMPT), does not require prior knowledge of the source data (the data sources may be "unknown"). PROMPT is thus applicable when differences between tasks are unobserved, such as in the presence of latent confounders. Moreover, the learner need not have access to observations in the target task (may not have the ability to "fine-tune"), and instead makes use of proxy (indirect) information. Our theoretical results show that the threat of negative transfer does not depend on the informativeness of the proxy information, highlighting the usefulness of PROMPT in cases where only noisy indirect information, such as human feedback, is available.

ID: 512

Nonparametric Bayesian Multi-Facet Clustering for Longitudinal Data

Luwei Wang, Kieran Richards, Sohan Seth
[OpenReview]

Abstract

Complex real-world time series data are inherently multi-faceted, e.g., temporal data can be described by seasonality and trend. Popular clustering methods typically aggregate information from all facets, treating them collectively rather than individually. This aggregation may diminish the interpretability of clusters by obscuring the specific contributions of individual facets to the clustering outcome. This limitation can be addressed by multi-facet clustering that builds a separate clustering model for each facet simultaneously. In this paper, we explore Bayesian multi-facet clustering modelling for temporal data using nonparametric priors to select an appropriate number of clusters automatically and using variational inference to efficiently explore the parameter space. We apply this framework to nonlinear growth models and vector autoregressive models and observe their performance through simulation studies. We apply these models to real-world time series data from the English Longitudinal Study of Ageing (ELSA), highlighting its utility in identifying meaningful and interpretable clusters. These findings underscore the potential of the framework for advancing the analysis of multi-faceted longitudinal data in diverse fields. Code is available at \href{https://github.com/Demi-wlw/Nonparametric-Bayesian-Multi-Facet-Clustering-for-Longitudinal-Data.git}{GitHub}.

ID: 514

Causal Discovery for Linear Non-Gaussian Models with Disjoint Cycles

Mathias Drton, Marina Garrote-López, Niko Nikov, Elina Robeva, Y. Samuel Wang
[OpenReview]

Abstract

The paradigm of linear structural equation modeling readily allows one to incorporate causal feedback loops in the model specification. These appear as directed cycles in the common graphical representation of the models. However, the presence of cycles entails difficulties such as the fact that models need no longer be characterized by conditional independence relations. As a result, learning cyclic causal structures remains a challenging problem. In this paper, we offer new insights on this problem in the context of linear non-Gaussian models. First, we precisely characterize when two directed graphs determine the same linear non-Gaussian model. Next, we take up a setting of cycle-disjoint graphs, for which we are able to show that simple quadratic and cubic polynomial relations among low-order moments of a non-Gaussian distribution allow one to locate source cycles. Complementing this with a strategy of decorrelating cycles and multivariate regression allows one to infer a block-topological order among the directed cycles, which leads to a consistent and computationally efficient algorithm for learning causal structures with disjoint cycles.

ID: 515

Guaranteed Prediction Sets for Functional Surrogate Models

Ander Gray, Vignesh Gopakumar, Sylvain Rousseau, Sebastien Destercke
[OpenReview]

Abstract

We propose a method for obtaining statistically guaranteed prediction sets for functional machine learning methods: surrogate models which map between function spaces, motivated by the need to build reliable PDE emulators. The method constructs nested prediction sets on a low-dimensional representation (an SVD) of the surrogate model's error, and then maps these sets to the prediction space using set-propagation techniques. This results in prediction sets for functional surrogate models with conformal prediction coverage guarantees. We use zonotopes as basis of the set construction, which allow an exact linear propagation and are closed under Cartesian products, making them well-suited to this high-dimensional problem. The method is model agnostic and can thus be applied to complex Sci-ML models, including Neural Operators, but also in simpler settings. We also introduce a technique to capture the truncation error of the SVD, preserving the guarantees of the method.

ID: 516

Fast Calculation of Feature Contributions in Boosting Trees

Zhongli Jiang, Min Zhang, Dabao Zhang
[OpenReview]

Abstract

Recently, several fast algorithms have been proposed to decompose predicted value into Shapley values, enabling individualized feature contribution analysis in tree models. While such local decomposition offers valuable insights, it underscores the need for a global evaluation of feature contributions. Although coefficients of determination ($R^2$) allow for comparative assessment of individual features, individualizing $R^2$ is challenged by the underlying quadratic losses. To address this, we propose Q-SHAP, an efficient algorithm that reduces the computational complexity of calculating Shapley values for quadratic losses to polynomial time. Our extensive simulations show that Q-SHAP not only improves computational efficiency but also enhances the accuracy of feature-specific $R^2$ estimates.

ID: 517

Adaptive Reward Design for Reinforcement Learning

Minjae Kwon, Ingy ElSayed-Aly, Lu Feng
[OpenReview]

Abstract

There is a surge of interest in using formal languages such as Linear Temporal Logic (LTL) to precisely and succinctly specify complex tasks and derive reward functions for Reinforcement Learning (RL). However, existing methods often assign sparse rewards (e.g., giving a reward of 1 only if a task is completed and 0 otherwise). By providing feedback solely upon task completion, these methods fail to encourage successful subtask completion. This is particularly problematic in environments with inherent uncertainty, where task completion may be unreliable despite progress on intermediate goals. To address this limitation, we propose a suite of reward functions that incentivize an RL agent to complete a task specified by an LTL formula as much as possible, and develop an adaptive reward shaping approach that dynamically updates reward functions during the learning process. Experimental results on a range of benchmark RL environments demonstrate that the proposed approach generally outperforms baselines, achieving earlier convergence to a better policy with higher expected return and task completion rate.

ID: 523

Multi-Cost-Bounded Reachability Analysis of POMDPs

Alexander Bork, Joost-Pieter Katoen, Tim Quatmann, Svenja Stein
[OpenReview]

Abstract

We consider multi-dimensional cost-bounded reachability probability objectives for partially observable Markov decision processes (POMDPs). The goal is to compute the maximal probability to reach a set of target states while simultaneously satisfying specified bounds on incurred costs. Such objectives generalise well-studied POMDP objectives by allowing multiple upper and lower bounds on different cost or reward measures, e.g. to naturally model scenarios where an agent acts under limited resources. We present a reduction of the multi-cost-bounded problem to unbounded reachability probabilities on an unfolding of the original POMDP. We employ a refined approach in case the agent is cost-aware—i.e., collected costs are fully observed—and also consider a setting where only partial information about the collected costs is known. Our approaches elegantly lift existing results from the fully observable MDP case to POMDPs. An empirical evaluation shows the potential of analysing POMDPs under multi-cost-bounded reachability objectives in practical settings.

ID: 529

Guiding Time-Varying Generative Models with Natural Gradients on Exponential Family Manifold

Song Liu, Leyang Wang, Yakun Wang
[OpenReview]

Abstract

Optimising probabilistic models is a well-studied field in statistics. However, its connection with the training of generative models remains largely under-explored. In this paper, we show that the evolution of time-varying generative models can be projected onto an exponential family manifold, naturally creating a link between the parameters of a generative model and those of a probabilistic model. We then train the generative model by moving its projection on the manifold according to the natural gradient descent scheme. This approach also allows us to efficiently approximate the natural gradient of the KL divergence without relying on MCMC for intractable models. Furthermore, we propose particle versions of the algorithm, which feature closed-form update rules for any parametric model within the exponential family. Through toy and real-world experiments, we validate the effectiveness of the proposed algorithms. The code of the proposed algorithms can be found at \url{https://github.com/anewgithubname/iNGD}.

ID: 534

Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference

Colin Samplawski, Adam D. Cobb, Manoj Acharya, Ramneet Kaur, Susmit Jha
[OpenReview]

Abstract

Despite their widespread use, large language models (LLMs) are known to hallucinate incorrect information and be poorly calibrated. This makes the uncertainty quantification of these models of critical importance, especially in high-stakes domains, such as autonomy and healthcare. Prior work has made Bayesian deep learning-based approaches to this problem more tractable by performing inference over the low-rank adaptation (LoRA) parameters of a fine-tuned model. While effective, these approaches struggle to scale to larger LLMs due to requiring further additional parameters compared to LoRA. In this work we present $\textbf{Scala}$ble $\textbf{B}$ayesian $\textbf{L}$ow-Rank Adaptation via Stochastic Variational Subspace Inference (ScalaBL). We perform Bayesian inference in an $r$-dimensional subspace, for LoRA rank $r$. By repurposing the LoRA parameters as projection matrices, we are able to map samples from this subspace into the full weight space of the LLM. This allows us to learn all the parameters of our approach using stochastic variational inference. Despite the low dimensionality of our subspace, we are able to achieve competitive performance with state-of-the-art approaches while only requiring ${\sim}1000$ additional parameters. Furthermore, it allows us to scale up to the largest Bayesian LLM to date, with four times as a many base parameters as prior work.

ID: 540

Instance-Wise Monotonic Calibration by Constrained Transformation

Yunrui Zhang, Gustavo Enrique Batista, Salil S. Kanhere
[OpenReview]

Abstract

Deep neural networks often produce miscalibrated probability estimates, leading to overconfident predictions. A common approach for calibration is fitting a post-hoc calibration map on unseen validation data that transforms predicted probabilities. A key desirable property of the calibration map is instance-wise monotonicity (i.e., preserving the ranking of probability outputs). However, most existing post-hoc calibration methods do not guarantee monotonicity. Previous monotonic approaches either use an under-parameterized calibration map with limited expressive ability or rely on black-box neural networks, which lack interpretability and robustness. In this paper, we propose a family of novel monotonic post-hoc calibration methods, which employs a constrained calibration map parameterized linearly with respect to the number of classes. Our proposed approach ensures expressiveness, robustness, and interpretability while preserving the relative ordering of the probability output by formulating the proposed calibration map as a constrained optimization problem. Our proposed methods achieve state-of-the-art performance across datasets with different deep neural network models, outperforming existing calibration methods while being data and computation-efficient. Our code is available at https://github.com/YunruiZhang/Calibration-by-Constrained-Transformation

ID: 542

Distributionally and Adversarially Robust Logistic Regression via Intersecting Wasserstein Balls

Aras Selvi, Eleonora Kreacic, Mohsen Ghassemi, Vamsi K. Potluru, Tucker Balch, Manuela Veloso
[OpenReview]

Abstract

Adversarially robust optimization (ARO) has emerged as the *de facto* standard for training models that hedge against adversarial attacks in the test stage. While these models are robust against adversarial attacks, they tend to suffer severely from overfitting. To address this issue, some successful methods replace the empirical distribution in the training stage with alternatives including *(i)* a worst-case distribution residing in an ambiguity set, resulting in a distributionally robust (DR) counterpart of ARO; *(ii)* a mixture of the empirical distribution with a distribution induced by an auxiliary (*e.g.*, synthetic, external, out-of-domain) dataset. Inspired by the former, we study the Wasserstein DR counterpart of ARO for logistic regression and show it admits a tractable convex optimization reformulation. Adopting the latter setting, we revise the DR approach by intersecting its ambiguity set with another ambiguity set built using the auxiliary dataset, which offers a significant improvement whenever the Wasserstein distance between the data generating and auxiliary distributions can be estimated. We study the underlying optimization problem, develop efficient solution algorithms, and demonstrate that the proposed method outperforms benchmark approaches on standard datasets.

ID: 543

Just Trial Once: Ongoing Causal Validation of Machine Learning Models

Jacob M. Chen, Michael Oberst
[OpenReview]

Abstract

Machine learning (ML) models are increasingly used as decision-support tools in high-risk domains. Evaluating the causal impact of deploying such models can be done with a randomized controlled trial (RCT) that randomizes users to ML vs. control groups and assesses the effect on relevant outcomes. However, ML models are inevitably updated over time, and we often lack evidence for the causal impact of these updates. While the causal effect could be repeatedly validated with ongoing RCTs, such experiments are expensive and time-consuming to run. In this work, we present an alternative solution: using only data from a prior RCT, we give conditions under which the causal impact of a new ML model can be precisely bounded or estimated, even if it was not included in the RCT. Our assumptions incorporate two realistic constraints: ML predictions are often deterministic, and their impacts depend on user trust in the model. Based on our analysis, we give recommendations for trial designs that maximize our ability to assess future versions of an ML model. Our hope is that our trial design recommendations will save practitioners time and resources while allowing for quicker deployments of updates to ML models.

ID: 544

Transparent Trade-offs between Properties of Explanations

Hiwot Belay Tadesse, Alihan Hüyük, Yaniv Yacoby, Weiwei Pan, Finale Doshi-Velez
[OpenReview]

Abstract

When explaining machine learning models, it is important for explanations to have certain properties like faithfulness, robustness, smoothness, low complexity, etc. However, many properties are in tension with each other, making it challenging to achieve them simultaneously. For example, reducing the complexity of an explanation can make it less expressive, compromising its faithfulness. The ideal balance of trade-offs between properties tends to vary across different tasks and users. Motivated by these varying needs, we aim to find explanations that make optimal trade-offs while allowing for transparent control over the balance between different properties. Unlike existing methods that encourage desirable properties implicitly through their design, our approach optimizes explanations explicitly for a linear mixture of multiple properties. By adjusting the mixture weights, users can control the balance between those properties and create explanations with precisely what is needed for their particular task.

ID: 546

Evasion Attacks Against Bayesian Predictive Models

Pablo G. Arce, Roi Naveiro, David Ríos Insua
[OpenReview]

Abstract

There is an increasing interest in analyzing the behavior of machine learning systems against adversarial attacks. However, most of the research in adversarial machine learning has focused on studying weaknesses against evasion or poisoning attacks to predictive models in classical setups, with the susceptibility of Bayesian predictive models to attacks remaining underexplored. This paper introduces a general methodology for designing optimal evasion attacks against such models. We investigate two adversarial objectives: perturbing specific point predictions and altering the entire posterior predictive distribution. For both scenarios, we propose novel gradient-based attacks and study their implementation and properties in various computational setups.

ID: 552

Off-policy Predictive Control with Causal Sensitivity Analysis

Myrl G Marmarelis, Ali Hasan, Kamyar Azizzadenesheli, R. Michael Alvarez, Anima Anandkumar
[OpenReview]

Abstract

Predictive models are often deployed for decision-making tasks for which they were not explicitly trained. When only partial observations of the relevant state are available, as in most real-world applications, there is a strong possibility of hidden confounding. Therefore, partial observability often makes the outcome of an action unidentifiable, and could render a model's predictions unreliable for action planning. We present an identification bound and propose an algorithm to account for hidden confounding during model-predictive control. To that end, we introduce a generalized causal sensitivity model for action-state dynamics. We place a constraint on the hidden confounding between trajectories of future actions and states, enabling sharp bounds on interventional outcomes. Unlike previous sensitivity models, ours accommodates hidden confounding with memory, while maintaining computational and statistical tractability. We benchmark on a wide variety of multivariate stochastic differential equations with arbitrary confounding. The results suggest that a calibrated sensitivity model helps controllers achieve higher rewards.

ID: 554

Accurate and Scalable Stochastic Gaussian Process Regression via Learnable Coreset-based Variational Inference

Mert Ketenci, Adler J Perotte, Noémie Elhadad, Iñigo Urteaga
[OpenReview]

Abstract

We introduce a novel stochastic variational inference method for Gaussian process ($\mathcal{GP}$) regression, by deriving a posterior over a learnable set of coresets: i.e., over pseudo-input/output, weighted pairs. Unlike former free-form variational families for stochastic inference, our coreset-based variational $\mathcal{GP}$ (CVGP) is defined in terms of the $\mathcal{GP}$ prior and the (weighted) data likelihood. This formulation naturally incorporates inductive biases of the prior, and ensures its kernel and likelihood dependencies are shared with the posterior. We derive a variational lower-bound on the log-marginal likelihood by marginalizing over the latent $\mathcal{GP}$ coreset variables, and show that CVGP's lower-bound is amenable to stochastic optimization. CVGP reduces the dimensionality of the variational parameter search space to linear $\mathcal{O}(M)$ complexity, while ensuring numerical stability at $\mathcal{O}(M^3)$ time complexity and $\mathcal{O}(M^2)$ space complexity. Evaluations on real-world and simulated regression problems demonstrate that CVGP achieves superior inference and predictive performance than state-of-the-art, stochastic sparse $\mathcal{GP}$ approximation methods.

ID: 557

Learning Robust XGBoost Ensembles for Regression Tasks

Atri Vivek Sharma, Panagiotis Kouvaros, Alessio Lomuscio
[OpenReview]

Abstract

Methods to improve the adversarial robustness of tree-based ensemble models for classification tasks have received significant attention in recent years. In this work, we propose a novel method for training robust tree-based boosted ensembles applicable to any task that employs a differentiable loss function, leveraging the XGBoost framework. Our work introduces an analytical solution to the upper-bound of the robust loss function, that can be computed in constant time, enabling the construction of robust splits without sacrificing computational efficiency. Although our method is general, we focus its application on regression tasks, extending conventional regression metrics to better quantify model robustness. An extensive evaluation on 19 regression datasets from a widely-used tabular data benchmark demonstrates that in the face of adversarial perturbations in the input space, our proposed method results in ensembles that are up to 44% more robust compared to the present SoA and 113% more robust than the conventional XGBoost model when considering norm bounded attacks of radius 0.05.

ID: 573

FDR-SVM: A Federated Distributionally Robust Support Vector Machine via a Mixture of Wasserstein Balls Ambiguity Set

Michael Ibrahim, Heraldo Rozas, Nagi Gebraeel, Weijun Xie
[OpenReview]

Abstract

We study a federated classification problem over a network of multiple clients and a central server, in which each client’s local data remains private and is subject to uncertainty in both the features and labels. To address these uncertainties, we develop a novel Federated Distributionally Robust Support Vector Machine (FDR-SVM), robustifying the classification boundary against perturbations in local data distributions. Specifically, the data at each client is governed by a unique true distribution that is unknown. To handle this heterogeneity, we develop a novel Mixture of Wasserstein Balls (MoWB) ambiguity set, naturally extending the classical Wasserstein ball to the federated setting. We then establish theoretical guarantees for our proposed MoWB, deriving an out-of-sample performance bound and showing that its design preserves the separability of the FDR-SVM optimization problem. Next, we rigorously derive two algorithms that solve the FDR-SVM problem and analyze their convergence behavior as well as their worst-case time complexity. We evaluate our algorithms on industrial data and various UCI datasets, whereby we demonstrate that they frequently outperform existing state-of-the-art approaches.

ID: 576

Adversarial Training May Induce Deteriorating Distributions

Runzhi Tian, Yongyi Mao
[OpenReview]

Abstract

The interactions between the update of model parameters and the update of perturbation operators complicate the dynamics of adversarial training (AT). This paper reveals a surprising behavior in AT, namely that the distribution induced by adversarial perturbations during AT becomes progressively more difficult to learn. We derived a generalization bound to theoretically attribute this behavior to the increasing of a quantity associated with the perturbation operator, namely, its local dispersion. We corroborate this explanation with concrete experimental validations and show that this deteriorating behavior of the induced distributions is correlated with robust overfitting of AT.

ID: 590

Relational Causal Discovery with Latent Confounders

Matteo Negro, Andrea Piras, Ragib Ahsan, David Arbour, Elena Zheleva
[OpenReview]

Abstract

Estimating causal effects from real-world relational data can be challenging when the underlying causal model and potential confounders are unknown. While several causal discovery algorithms exist for learning causal models with latent confounders from data, they assume that the data is independent and identically distributed (i.i.d.) and are not well-suited for learning from relational data. Similarly, existing relational causal discovery algorithms assume causal sufficiency, which is unrealistic for many real-world datasets. To address this gap, we propose RelFCI, a sound and complete causal discovery algorithm for relational data with latent confounders. Our work builds upon the Fast Causal Inference (FCI) and Relational Causal Discovery (RCD) algorithms and it defines new graphical models, necessary to support causal discovery in relational domains. We also establish soundness and completeness guarantees for relational d-separation with latent confounders. We present experimental results demonstrating the effectiveness of RelFCI in identifying the correct causal structure in relational causal models with latent confounders.

ID: 595

DF$^2$: Distribution-Free Decision-Focused Learning

Lingkai Kong, Wenhao Mu, Jiaming Cui, Yuchen Zhuang, B. Aditya Prakash, Bo Dai, Chao Zhang
[OpenReview]

Abstract

Decision-focused learning (DFL), which differentiates through the KKT conditions, has recently emerged as a powerful approach for predict-then-optimize problems. However, under probabilistic settings, DFL faces three major bottlenecks: model mismatch error, sample average approximation error, and gradient approximation error. Model mismatch error stems from the misalignment between the model's parameterized predictive distribution and the true probability distribution. Sample average approximation error arises when using finite samples to approximate the expected optimization objective. Gradient approximation error occurs when the objectives are non-convex and KKT conditions cannot be directly applied. In this paper, we present \ours—the first \textit{distribution-free} decision-focused learning method designed to mitigate these three bottlenecks. Rather than depending on a task-specific forecaster that requires precise model assumptions, our method directly learns the expected optimization function during training. To efficiently learn the function in a data-driven manner, we devise an attention-based model architecture inspired by the distribution-based parameterization of the expected objective. We evaluate \ours on two synthetic problems and three real-world problems, demonstrating the effectiveness of \ours. Our code can be found at: https://github.com/Lingkai-Kong/DF2.

ID: 602

Tuning Algorithmic and Architectural Hyperparameters in Graph-Based Semi-Supervised Learning with Provable Guarantees

Ally Yalei Du, Eric Huang, Dravyansh Sharma
[OpenReview]

Abstract

Graph-based semi-supervised learning is a powerful paradigm in machine learning for modeling and exploiting the underlying graph structure that captures the relationship between labeled and unlabeled data. A large number of classical as well as modern deep learning based algorithms have been proposed for this problem, often having tunable hyperparameters. We initiate a formal study of tuning algorithm hyperparameters from parameterized algorithm families for this problem. We obtain novel $O(\log n)$ pseudo-dimension upper bounds for hyperparameter selection in three classical label propagation-based algorithm families, where $n$ is the number of nodes, implying bounds on the amount of data needed for learning provably good parameters. We further provide matching $\Omega(\log n)$ pseudo-dimension lower bounds, thus asymptotically characterizing the learning-theoretic complexity of the parameter tuning problem. We extend our study to selecting architectural hyperparameters in modern graph neural networks. We bound the Rademacher complexity for tuning the self-loop weighting in recently proposed Simplified Graph Convolution (SGC) networks. We further propose a tunable architecture that interpolates graph convolutional neural networks (GCN) and graph attention networks (GAT) in every layer, and provide Rademacher complexity bounds for tuning the interpolation coefficient.

ID: 612

Best Arm Identification with Possibly Biased Offline Data

Le Yang, Vincent Y. F. Tan, Wang Chi Cheung
[OpenReview]

Abstract

We study the best arm identification (BAI) problem with potentially biased offline data in the fixed confidence setting, which commonly arises in real-world scenarios such as clinical trials. We prove an impossibility result for adaptive algorithms without prior knowledge of the bias bound between online and offline distributions. To address this, we propose the LUCB-H algorithm, which introduces adaptive confidence bounds by incorporating an auxiliary bias correction to balance offline and online data within the LUCB framework. Theoretical analysis shows that LUCB-H matches the sample complexity of standard LUCB when offline data is misleading and significantly outperforms it when offline data is helpful. We also derive an instance-dependent lower bound that matches the upper bound of LUCB-H in certain scenarios. Numerical experiments further demonstrate the robustness and adaptability of LUCB-H in effectively incorporating offline data.

ID: 615

FedSPD: A Soft-clustering Approach for Personalized Decentralized Federated Learning

I-Cheng Lin, Osman Yagan, Carlee Joe-Wong
[OpenReview]

Abstract

Federated learning has recently gained popularity as a framework for distributed clients to collaboratively train a machine learning model using local data. While traditional federated learning relies on a central server for model aggregation, recent advancements adopt a decentralized framework, enabling direct model exchange between clients and eliminating the single point of failure. However, existing decentralized frameworks often assume all clients train a shared model. Personalizing each client's model can enhance performance, especially with heterogeneous client data distributions. We propose FedSPD, an efficient personalized federated learning algorithm for the decentralized setting, and show that it learns accurate models in low-connectivity networks. To provide theoretical guarantees on convergence, we introduce a clustering-based framework that enables consensus on models for distinct data clusters while personalizing to unique mixtures of these clusters at different clients. This flexibility, allowing selective model updates based on data distribution, substantially reduces communication costs compared to prior work on personalized federated learning in decentralized settings. Experimental results on real-world datasets show that FedSPD outperforms multiple decentralized variants of existing personalized federated learning algorithms in scenarios with low-connectivity networks.

ID: 616

CATE Estimation With Potential Outcome Imputation From Local Regression

Ahmed Aloui, Juncheng Dong, Cat P. Le, Vahid Tarokh
[OpenReview]

Abstract

One of the most significant challenges in Conditional Average Treatment Effect (CATE) estimation is the statistical discrepancy between distinct treatment groups. To address this issue, we propose a model-agnostic data augmentation method for CATE estimation. First, we derive regret bounds for general data augmentation methods suggesting that a small imputation error may be necessary for accurate CATE estimation. Inspired by this idea, we propose a contrastive learning approach that reliably imputes missing potential outcomes for a selected subset of individuals formed using a similarity measure. We augment the original dataset with these reliable imputations to reduce the discrepancy between different treatment groups while inducing minimal imputation error. The augmented dataset can subsequently be employed to train standard CATE estimation models. We provide both theoretical guarantees and extensive numerical studies demonstrating the effectiveness of our approach in improving the accuracy and robustness of numerous CATE estimation models.

ID: 618

Optimal Transport for Probabilistic Circuits

Adrian Ciotinga, YooJung Choi
[OpenReview]

Abstract

We introduce a novel optimal transport framework for probabilistic circuits (PCs). While it has been shown recently that divergences between distributions represented as certain classes of PCs can be computed tractably, to the best of our knowledge, there is no existing approach to compute the Wasserstein distance between probability distributions given by PCs. We propose a Wasserstein-type distance that restricts the coupling measure of the associated optimal transport problem to be a probabilistic circuit. We then develop an algorithm for computing this distance by solving a series of small linear programs and derive the circuit conditions under which this is tractable. Furthermore, we show that we can easily retrieve the optimal transport plan between the PCs from the solutions to these linear programs. Lastly, we study the empirical Wasserstein distance between a PC and a dataset, and show that we can estimate the PC parameters to minimize this distance through an efficient iterative algorithm.

ID: 619

Collapsing Sequence-Level Data-Policy Coverage via Poisoning Attack in Offline Reinforcement Learning

Xue Zhou, Dapeng Man, Chen Xu, Fanyi Zeng, Tao Liu, Huan Wang, Shucheng He, Chaoyang Gao, Wu Yang
[OpenReview]

Abstract

Offline reinforcement learning (RL) heavily relies on the coverage of pre-collected data over the target policy's distribution. Existing studies aim to improve data-policy coverage to mitigate distributional shifts, but overlook security risks from insufficient coverage, and the single-step analysis is not consistent with the multi-step decision-making nature of offline RL. To address this, we introduce the sequence-level concentrability coefficient to quantify coverage, and reveal its exponential amplification on the upper bound of estimation errors through theoretical analysis. Building on this, we propose the Collapsing Sequence-Level Data-Policy Coverage (CSDPC) poisoning attack. Considering the continuous nature of offline RL data, we convert state-action pairs into decision units, and extract representative decision patterns that capture multi-step behavior. We identify rare patterns likely to cause insufficient coverage, and poison them to reduce coverage and exacerbate distributional shifts. Experiments show that poisoning just 1% of the dataset can degrade agent performance by 90%. This finding provides new perspectives for analyzing and safeguarding the security of offline RL.

ID: 621

Conditional Average Treatment Effect Estimation Under Hidden Confounders

Ahmed Aloui, Juncheng Dong, Ali Hasan, Vahid Tarokh
[OpenReview]

Abstract

One of the major challenges in estimating conditional potential outcomes and conditional average treatment effects (CATE) is the presence of hidden confounders. Since testing for hidden confounders cannot be accomplished only with observational data, conditional unconfoundedness is commonly assumed in the literature of CATE estimation. Nevertheless, under this assumption, CATE estimation can be significantly biased due to the effects of unobserved confounders. In this work, we consider the case where in addition to a potentially large observational dataset, a small dataset from a randomized controlled trial (RCT) is available. Notably, we make no assumptions on the existence of any covariate information for the RCT dataset, we only require the outcomes to be observed. We propose a CATE estimation method based on a pseudo-confounder generator and a CATE model that aligns the learned potential outcomes from the observational data with those observed from the RCT. Our method is applicable to many practical scenarios of interest, particularly those where privacy is a concern (e.g., medical applications). Extensive numerical experiments are provided demonstrating the effectiveness of our approach for both synthetic and real-world datasets.

ID: 628

Learning from Label Proportions and Covariate-shifted Instances

Sagalpreet Singh, Navodita Sharma, Shreyas Havaldar, Rishi Saket, Aravindan Raghuveer
[OpenReview]

Abstract

In many applications, especially due to lack of supervision or privacy concerns, the training data is grouped into bags of instances (feature-vectors) and for each bag we have only an aggregate label derived from the instance-labels in the bag. In learning from label proportions (LLP) the aggregate label is the average of the instance-labels in a bag, and a significant body of work has focused on training models in the LLP setting to predict instance-labels. In practice however, the training data may have fully supervised albeit covariate-shifted source data, along with the usual target data with bag-labels, and we wish to train a good instance-level predictor on the target domain. We call this the covariate-shifted hybrid LLP problem. Fully supervised covariate shifted data often has useful training signals and the goal is to leverage them for better predictive performance in the hybrid LLP setting. To achieve this, we develop methods for hybrid LLP which naturally incorporate the target bag-labels along with the source instance-labels, in the domain adaptation framework. Apart from proving theoretical guarantees bounding the target generalization error, we also conduct experiments on several publicly available datasets showing that our methods outperform LLP and domain adaptation baselines as well techniques from previous related work.

ID: 630

Mixup Regularization: A Probabilistic Perspective

Yousef El-Laham, Niccolo Dalmasso, Svitlana Vyetrenko, Vamsi K. Potluru, Manuela Veloso
[OpenReview]

Abstract

In recent years, mixup regularization has gained popularity as an effective way to improve the generalization performance of deep learning models by training on convex combinations of training data. While many mixup variants have been explored, the proper adoption of the technique to conditional density estimation and probabilistic machine learning remains relatively unexplored. This work introduces a novel framework for mixup regularization based on probabilistic fusion that is better suited for conditional density estimation tasks. For data distributed according to a member of the exponential family, we show that likelihood functions can be analytically fused using log-linear pooling. We further propose an extension of probabilistic mixup, which allows for fusion of inputs at an arbitrary intermediate layer of the neural network. We provide a theoretical analysis comparing our approach to standard mixup variants. Empirical results on synthetic and real datasets demonstrate the benefits of our proposed framework compared to existing mixup variants.

ID: 633

Measuring IIA Violations in Similarity Choices with Bayesian Models

Hugo Sales Correa, Suryanarayana Sankagiri, Daniel R. Figueiredo, Matthias Grossglauser
[OpenReview]

Abstract

Similarity choice data occur when humans make choices among alternatives based on their similarity to a target, \emph{e.g.}, in the context of information retrieval and in embedding learning settings. Classical metric-based models of similarity choice assume independence of irrelevant alternatives (IIA), a property that allows for a simpler formulation. While IIA violations have been detected in many discrete choice settings, the similarity choice setting has received scant attention. This is because the target-dependent nature of the choice complicates IIA testing. We propose two statistical methods to test for IIA: a classical goodness-of-fit test and a Bayesian counterpart based on the framework of Posterior Predictive Checks (PPC). This Bayesian approach, our main technical contribution, quantifies the degree of IIA violation beyond its mere significance. We curate two datasets: one with choice sets designed to elicit IIA violations, and another with randomly generated choice sets from the same item universe. Our tests confirmed significant IIA violations on both datasets, and notably, we find a comparable degree of violation between them. Further, we devise a new PPC test for population homogeneity. Results show that the population is indeed homogenous, suggesting that the IIA violations are driven by context effects---specifically, interactions within the choice sets. These results highlight the need for new similarity choice models that account for such context effects.

ID: 634

Budget Allocation Exploiting Label Correlation between Instances

Adithya Kulkarni, Mohna Chakraborty, Sihong Xie, Qi Li
[OpenReview]

Abstract

In this study, we introduce an innovative budget allocation method for graph instance annotation in crowdsourcing environments, where both the labels of instances and their correlations are unknown and need to be estimated simultaneously. We model the budget allocation task as a Markov Decision Process (MDP) and develop an optimization framework that minimizes the uncertainties associated with instance labeling and correlation estimation while adhering to budget constraints. To quantify uncertainty, we employ entropy and derive two strategies: OPTUENT-EXP and OPTUENT-OPT. Our reward function further considers the impact of a worker's label on the entire graph. We conducted extensive experiments using four real-world graph datasets, simulating worker labeling behavior to showcase the effectiveness of our approach. Experimental results demonstrate that our proposed approach can accurately estimate correlations between adjacent nodes while significantly reducing labeling costs. Moreover, across four real-world datasets, our proposed approach consistently outperforms existing baselines in moderate and high budget scenarios, highlighting its robustness and practical scalability.

ID: 635

HDP-Flow: Generalizable Bayesian Nonparametric Model for Time Series State Discovery

Sana Tonekaboni, Tina Behrouzi, Addison Weatherhead, Emily Fox, David Blei, Anna Goldenberg
[OpenReview]

Abstract

We introduce HDP-Flow, a Bayesian nonparametric (BNP) model for unsupervised state discovery in dynamic, non-stationary time series data. Unlike prior work that assumes fixed states, HDPFlow models evolving datasets with unknown and variable latent states. By integrating the adaptability of BNP models with the expressive power of normalizing flows, HDP-Flow effectively models dynamic, non-stationary patterns, while learning transferable states across datasets with wellcalibrated uncertainty. We propose a scalable variational algorithm to enable efficient inference, addressing the limitations of traditional sampling-based BNP methods. HDP-Flow outperforms existing approaches in latent state identification and provides probabilistic insight into state distributions and transition dynamics. Evaluating HDP-Flow across two wearable datasets demonstrates transferability of states across diverse sub-populations, validating its robustness and generalizability.

ID: 651

Improved Uncertainty Quantification in Physics-Informed Neural Networks Using Error Bounds and Solution Bundles

Pablo Flores, Olga Graf, Pavlos Protopapas, Karim Pichara
[OpenReview]

Abstract

Physics-Informed Neural Networks (PINNs) have been widely used to obtain solutions to various physical phenomena modeled as Differential Equations. As PINNs are not naturally equipped with mechanisms for Uncertainty Quantification, some work has been done to quantify the different uncertainties that arise when dealing with PINNs. In this paper, we use a two-step procedure to train Bayesian Neural Networks that provide uncertainties over the solutions to differential equation systems provided by PINNs. We use available error bounds over PINNs to formulate a heteroscedastic variance that improves the uncertainty estimation. Furthermore, we solve forward problems and utilize the obtained uncertainties when doing parameter estimation in inverse problems in cosmology.

ID: 657

Robust Optimization with Diffusion Models for Green Security

Lingkai Kong, Haichuan Wang, Yuqi Pan, Cheol Woo Kim, Mingxiao Song, Alayna Nguyen, Tonghan Wang, Haifeng Xu, Milind Tambe
[OpenReview]

Abstract

In green security, defenders must forecast adversarial behavior—such as poaching, illegal logging, and illegal fishing—to plan effective patrols. These behavior are often highly uncertain and complex. Prior work has leveraged game theory to design robust patrol strategies to handle uncertainty, but existing adversarial behavior models primarily rely on Gaussian processes or linear models, which lack the expressiveness needed to capture intricate behavioral patterns. To address this limitation, we propose a conditional diffusion model for adversary behavior modeling, leveraging its strong distribution-fitting capabilities. To the best of our knowledge, this is the first application of diffusion models in the green security domain. Integrating diffusion models into game-theoretic optimization, however, presents new challenges, including a constrained mixed strategy space and the need to sample from an unnormalized distribution to estimate utilities. To tackle these challenges, we introduce a mixed strategy of mixed strategies and employ a twisted Sequential Monte Carlo (SMC) sampler for accurate sampling. Theoretically, our algorithm is guaranteed to converge to an \(\epsilon\)-equilibrium with high probability using a finite number of iterations and samples. Empirically, we evaluate our approach on both synthetic and real-world poaching datasets, demonstrating its effectiveness.

ID: 662

Bayesian Optimization with Inexact Acquisition: Is Random Grid Search Sufficient?

Hwanwoo Kim, Chong Liu, Yuxin Chen
[OpenReview]

Abstract

Bayesian optimization (BO) is a widely used iterative algorithm for optimizing black-box functions. Each iteration requires maximizing an acquisition function, such as the upper confidence bound (UCB) or a sample path from the Gaussian process (GP) posterior, as in Thompson sampling (TS). However, finding an exact solution to these maximization problems is often intractable and computationally expensive. Reflecting such realistic situations, in this paper, we delve into the effect of inexact maximizers of the acquisition functions. Defining a measure of inaccuracy in acquisition solutions, we establish cumulative regret bounds for both GP-UCB and GP-TS without requiring exact solutions of acquisition function maximization. Our results show that under appropriate conditions on accumulated inaccuracy, inexact BO algorithms can still achieve sublinear cumulative regret. Motivated from such findings, we provide both theoretical justification and numerical validation for random grid search as an effective and computationally efficient acquisition function solver.

ID: 664

A Unified Data Representation Learning for Non-parametric Two-sample Testing

Xunye Tian, Liuhua Peng, Zhijian Zhou, Mingming Gong, Arthur Gretton, Feng Liu
[OpenReview]

Abstract

Learning effective data representations has been crucial in non-parametric two-sample testing. Common approaches will first split data into training and test sets and then learn data representations purely on the training set. However, recent theoretical studies have shown that, as long as the sample indexes are not used during the learning process, the whole data can be used to learn data representations, meanwhile ensuring control of Type-I errors. The above fact motivates us to use the test set (but without sample indexes) to facilitate the data representation learning in the testing. To this end, we propose a representation-learning two-sample testing (RL-TST) framework. RL-TST first performs purely self-supervised representation learning on the entire dataset to capture inherent representations (IRs) that reflect the underlying data manifold. A discriminative model is then trained on these IRs to learn discriminative representations (DRs), enabling the framework to leverage both the rich structural information from IRs and the discriminative power of DRs. Extensive experiments demonstrate that RL-TST outperforms representative approaches by simultaneously using data manifold information in the test set and enhancing test power via finding the DRs with the training set.

ID: 668

What is the Right Notion of Distance between Predict-then-Optimize Tasks?

Paula Rodriguez-Diaz, Lingkai Kong, Kai Wang, David Alvarez-Melis, Milind Tambe
[OpenReview]

Abstract

Comparing datasets is a fundamental task in machine learning, essential for various learning paradigms—from evaluating train and test datasets for model generalization to using dataset similarity for detecting data drift. While traditional notions of dataset distances offer principled measures of similarity, their utility has largely been assessed through prediction error minimization. However, in Predict-then-Optimize (PtO) frameworks, where predictions serve as inputs for downstream optimization tasks, model performance is measured through decision regret rather than prediction error. In this work, we propose OTD$^3$ (Optimal Transport Decision-aware Dataset Distance), a novel dataset distance that incorporates downstream decisions in addition to features and labels. We show that traditional feature-label distances lack informativeness in PtO settings, while OTD$^3$ more effectively captures adaptation success. We also derive a PtO-specific adaptation bound based on this distance. Empirically, we show that our proposed distance accurately predicts model transferability across three different PtO tasks from the literature. Code is available at https://github.com/paularodr/OTD3

ID: 672

Constraint-based Causal Discovery from a Collection of Conditioning Sets

Kenneth Lee, Bruno Ribeiro, Murat Kocaoglu
[OpenReview]

Abstract

In constraint-based causal discovery, the existing algorithms systematically use a series of conditional independence (CI) relations observed in the data to recover an equivalence class of causal graphs in the large sample limit. One limitation of these algorithms is that CI tests lose statistical power as conditioning set size increases with finite samples. Recent research proposes to limit the conditioning set size for robust causal discovery. However, the existing algorithms require exhaustive testing of all CI relations with conditioning set sizes up to a certain integer $k$. This becomes problematic in practice when variables with large support are present, as it makes CI tests less reliable due to near-deterministic relationships, thereby violating the faithfulness assumption. To address this issue, we propose a causal discovery algorithm that only uses CI tests where the conditioning sets are restricted to a given set of conditioning sets including the empty set $\mathcal{C}$. We call such set of CI relations ${\mathcal{I}}_{\mathcal{C}}$ conditionally closed. We define the notion of $\mathcal{C}$-Markov equivalence: two causal graphs are $\mathcal{C}$-Markov equivalent if they entail the same set of CI constraints from ${\mathcal{I}}_\mathcal{C}$. We propose a graphical representation of $\mathcal{C}$-Markov equivalence and characterize such equivalence between two causal graphs. Our proposed algorithm called the $\mathcal{C}$-PC algorithm is sound for learning the $\mathcal{C}$-Markov equivalence class. We demonstrate the utility of the proposed algorithm via synthetic and real-world experiments in scenarios where variables with large support or high correlation are present in the data.

ID: 674

Augmenting Online RL with Offline Data is All You Need: A Unified Hybrid RL Algorithm Design and Analysis

Ruiquan Huang, Donghao Li, Chengshuai Shi, Cong Shen, Jing Yang
[OpenReview]

Abstract

This paper investigates a hybrid learning framework for reinforcement learning (RL) in which the agent can leverage both an offline dataset and online interactions to learn the optimal policy. We present a unified algorithm and analysis and show that augmenting confidence-based online RL algorithms with the offline dataset outperforms any pure online or offline algorithm alone and achieves state-of-the-art results under two learning metrics, i.e., sub-optimality gap and online learning regret. Specifically, we show that our algorithm achieves a sub-optimality gap $\tilde{O}( \sqrt{1/(N\_0/ \mathtt{C}(\pi^\star| \rho)+N\_1} ) )$, where $\mathtt{C}(\pi^\star|\rho)$ is a new concentrability coefficient, $N\_0$ and $N\_1$ are the numbers of offline and online samples, respectively. For regret minimization, we show that it achieves a constant $\tilde{O}( \sqrt{N\_1/(N\_0/\mathtt{C}(\pi^{-}|\rho)+N\_1)} )$ speed-up compared to pure online learning, where $\mathtt{C}(\pi^-|\rho)$ is the concentrability coefficient over all sub-optimal policies. Our results also reveal an interesting separation on the desired coverage properties of the offline dataset for sub-optimality gap minimization and regret minimization. We further validate our theoretical findings in several experiments in special RL models such as linear contextual bandits and Markov decision processes (MDPs).

ID: 676

Weak to Strong Learning from Aggregate Labels

Yukti Makhija, Rishi Saket
[OpenReview]

Abstract

In learning from aggregate labels, the training data consists of sets or ``bags'' of feature-vectors (instances) along with an aggregate label for each bag derived from the (usually $\\{0,1\\}$-valued) labels of its constituent instances. In *learning from label proportions* (LLP), the aggregate label of a bag is the average of the instance labels, whereas in *multiple instance learning* (MIL) it is the OR i.e., disjunction. The goal is to train an instance-level predictor that maximizes the accuracy which is the fraction of *satisfied* bags i.e., those on which the model's induced labels are consistent with the target aggregate label. A weak learner in this context is one which has at a constant accuracy $ < 1$ on the training bags, while a strong learner's accuracy can be arbitrarily close to $1$. We study the problem of using a weak learner on such training bags with aggregate labels to obtain a strong learner. In a novel result, our work proves the impossibility of boosting in the LLP setting using weak learners of any accuracy $< 1$ by constructing a collection of bags for which such weak learners (for any weight assignment) exist, while not admitting any strong learner. A variant of this construction also rules out boosting in MIL for a non-trivial range of weak learner accuracy. In the LLP setting however, we show that a weak learner (with small accuracy) on large enough bags can in fact be used to obtain a strong learner for small bags, in polynomial time. We also provide more efficient, sampling based variant of our procedure with probabilistic guarantees which are empirically validated on three real and two synthetic datasets.

ID: 679

Collaborative Prediction: To Join or To Disjoin Datasets

Kyung Rok Kim, Yansong Wang, Xiaocheng Li, Guanting Chen
[OpenReview]

Abstract

With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even for simple prediction models. In this work, we study the problem of developing practical algorithms that select appropriate dataset to minimize population loss of our prediction model with high probability. Broadly speaking, we investigate when datasets from different sources can be effectively merged to enhance the predictive model's performance, and propose a practical algorithm with theoretical guarantees. By leveraging an oracle inequality and data-driven estimators, the algorithm reduces population loss with high probability. Numerical experiments demonstrate its effectiveness in both standard linear regression and broader deep learning applications.

ID: 687

Using Submodular Optimization to Approximate Minimum-Size Abductive Path Explanations for Tree-Based Models

Louenas Bounia
[OpenReview]

Abstract

One of the key challenges of Explainable Artificial Intelligence (XAI) is providing concise and understandable explanations for classification model predictions. An abductive explanation for a given instance is a minimal set of features that justify the prediction. These minimal explanations are valuable for their interpretability, as they eliminate redundant or irrelevant information. However, computing these explanations is difficult, even for simpler classifiers like decision trees. Finding a minimum-size abductive explanation in decision trees is an NP-complete problem, and this complexity extends to random forests for minimum-size majoritary reasons. In this work, we focus on finding minimal sets of features along the paths leading to the decision, called path-abductive explanations. We show that the problem of finding minimum-size path-abductive explanations in decision trees and minimum-size path-majoritary reasons in random forests is also NP-complete. To address this, we reformulate the problem as a submodular optimization task and propose a greedy algorithm with optimality guarantees. Our experiments demonstrate that this algorithm produces near-optimal explanations efficiently and offers a strong alternative for difficult instances, where exact methods based on SAT encodings are computationally expensive. This approach is especially useful in resource-limited environments where modern SAT solvers are not feasible.

ID: 700

Asymptotically Optimal Linear Best Feasible Arm Identification with Fixed Budget

Jie Bian, Vincent Y. F. Tan
[OpenReview]

Abstract

The challenge of identifying the optimal feasible arm within a fixed budget has attracted considerable interest in recent years. However, a notable gap remains in the literature: the exact exponential rate at which the error probability approaches zero has yet to be established, even in the relatively simple setting of $K$-armed bandits with Gaussian noise. In this paper, we address this gap by examining the problem within the context of linear bandits. We introduce a novel algorithm for best feasible arm identification that guarantees an exponential decay in the error probability. Remarkably, the decay rate—characterized by the exponent—matches the theoretical lower bound derived using information-theoretic principles. Our approach leverages a posterior sampling framework embedded within a game-based sampling rule involving a min-learner and a max-learner. This strategy shares its foundations with Thompson sampling, but is specifically tailored to optimize the identification process under fixed-budget constraints. Furthermore, we validate the effectiveness of our algorithm through comprehensive empirical evaluations across various problem instances with different levels of complexity. The results corroborate our theoretical findings and demonstrate that our method outperforms several benchmark algorithms in terms of both accuracy and efficiency.

ID: 703

A Trajectory-Based Bayesian Approach to Multi-Objective Hyperparameter Optimization with Epoch-Aware Trade-Offs

Wenyu Wang, Zheyi Fan, Szu Hui Ng
[OpenReview]

Abstract

Training machine learning models inherently involves a resource-intensive and noisy iterative learning procedure that allows epoch-wise monitoring of the model performance. However, the insights gained from the iterative learning procedure typically remain underutilized in multi-objective hyperparameter optimization scenarios. Despite the limited research in this area, existing methods commonly identify the trade-offs only at the end of model training, overlooking the fact that trade-offs can emerge at earlier epochs in cases such as overfitting. To bridge this gap, we propose an enhanced multi-objective hyperparameter optimization problem that treats the number of training epochs as a decision variable, rather than merely an auxiliary parameter, to account for trade-offs at an earlier training stage. To solve this problem and accommodate its iterative learning, we then present a trajectory-based multi-objective Bayesian optimization algorithm characterized by two features: 1) a novel acquisition function that captures the improvement along the predictive trajectory of model performances over epochs for any hyperparameter setting and 2) a multi-objective early stopping mechanism that determines when to terminate the training to maximize epoch efficiency. Experiments on synthetic simulations and hyperparameter tuning benchmarks demonstrate that our algorithm can effectively identify the desirable trade-offs while improving tuning efficiency.

ID: 711

Computationally Efficient Methods for Invariant Feature Selection with Sparsity

Jane Du, Arindam Banerjee
[OpenReview]

Abstract

Invariant Risk Minimization (IRM) (Arjovsky et al., 2020) proposes an optimization scheme that uses causal features to improve generalization. However, in most realizations, it does not have an explicit feature selection strategy. Prior investigation (Rosenfeld et al., 2020; Zhang et al., 2023) reveals failure cases when searching for causal features, and in light of these concerns, recent work has demonstrated the promise of using sparsity (Zhou et al., 2022; Fan et al., 2024) in IRM, and we make two specific contributions on that theme. First, for the original sparse IRM formulation, we present the first correct non-asymptotic analysis of the effectiveness of sparsity for selecting invariant features. We show that sparse IRM with $L_0$ constraints can select invariant features and ignore spurious and random features. We show that sample complexity depends polynomially on the number of invariant features and otherwise logarithmically on the ambient dimensionality. Second, we present the first invariant feature recovery guarantees with a computationally-efficient implementation of such sparse IRM based on iterative hard thresholding. Prior methods are limited to combinatorially searching over the space of all sparse models, but we present a different loss function. We show this new optimization implies recovery of invariant features under standard assumptions. We present empirical results on standard benchmark datasets to demonstrate the effectiveness and efficiency of the proposed sparse IRM models.

ID: 713

Pure and Strong Nash Equilibrium Computation in Compactly Representable Aggregate Games

Jared Soundy, Mohammad T. Irfan, Hau Chan
[OpenReview]

Abstract

Aggregate games model interdependent decision making when an agent’s utility depends on their own choice and the aggregation of everyone's choices. We define a compactly representable subclass of aggregate games we call additive aggregate games, which encompasses popular games like congestion games, anonymous games, Schelling games, etc. We study computational questions on pure Nash equilibrium (PNE) and pure strong Nash equilibrium (SNE). We show that PNE existence is NP-complete for very simple cases of additive aggregate games. We devise an efficient algorithmic scheme for deciding the existence of a PNE and computing one (if it exists) for bounded aggregate space. We also give an approximation algorithm for a special type of additive aggregate games. For SNE, we show that SNE recognition is co-NP-complete and SNE existence is $\Sigma^P_2$-complete, even for simple types of additive aggregate games. For broad classes, we provide several novel and efficient aggregate-space algorithms for recognizing an SNE and deciding the existence of an SNE. Finally, we connect our results to several popular classes of games and show how our computational schemes can shed new light on these games.

ID: 717

On Continuous Monitoring of Risk Violations under Unknown Shift

Alexander Timans, Rajeev Verma, Eric Nalisnick, Christian A. Naesseth
[OpenReview]

Abstract

Machine learning systems deployed in the real world must operate under dynamic and often unpredictable distribution shifts. This challenges the validity of statistical safety assurances on the system's risk established beforehand. Common risk control frameworks rely on fixed assumptions and lack mechanisms to continuously monitor deployment reliability. In this work, we propose a general framework for the real-time monitoring of risk violations in evolving data streams. Leveraging the `testing by betting' paradigm, we propose a sequential hypothesis testing procedure to detect violations of bounded risks associated with the model's decision-making mechanism, while ensuring control on the false alarm rate. Our method operates under minimal assumptions on the nature of encountered shifts, rendering it broadly applicable. We illustrate the effectiveness of our approach by monitoring risks in outlier detection and set prediction under a variety of shifts.

ID: 718

ODD: Overlap-aware Estimation of Model Performance under Distribution Shift

Aayush Mishra, Anqi Liu
[OpenReview]

Abstract

Reliable and accurate estimation of the error of an ML model in unseen test domains is an important problem for safe intelligent systems. Prior work uses \textit{disagreement discrepancy} (\disdis) to derive practical error bounds under distribution shifts. It optimizes for a maximally disagreeing classifier on the target domain to bound the error of a given source classifier. Although this approach offers a reliable and competitively accurate estimate of the target error, we identify a problem in this approach which causes the disagreement discrepancy objective to compete in the overlapping region between source and target domains. With an intuitive assumption that the target disagreement should be no more than the source disagreement in the overlapping region due to high enough support, we devise Overlap-aware Disagreement Discrepancy (\odd). Our \odd-based bound uses domain-classifiers to estimate domain-overlap and better predicts target performance than \disdis. We conduct experiments on a wide array of benchmarks to show that our method improves the overall performance-estimation error while remaining valid and reliable. Our code and results are available on \href{https://github.com/aamixsh/odd}{GitHub}.

ID: 719

Improving Graph Contrastive Learning with Community Structure

Xiang Chen, Kun Yue, Liang Duan, Lixing Yu
[OpenReview]

Abstract

Graph contrastive learning (GCL) has demonstrated remarkable success in training graph neural networks (GNNs) by distinguishing positive and negative node pairs without human labeling. However, existing GCL methods often suffer from two limitations: the repetitive message-passing mechanism in GNNs and the quadratic computational complexity of exhaustive node pair sampling in loss function. To address these issues, we propose an efficient and effective GCL framework that leverages community structure rather than relying on the intricate node-to-node adjacency information. Inspired by the concept of sparse low-rank approximation of graph diffusion matrices, our model delivers node messages to the corresponding communities instead of individual neighbors. By exploiting community structures, our method significantly improves GCL efficiency by reducing the number of node pairs needed for contrastive loss calculation. Furthermore, we theoretically prove that our model effectively captures essential structure information for downstream tasks. Extensive experiments conducted on real-world datasets illustrate that our method not only achieves the state-of-the-art performance but also substantially reduces time and memory consumption compared with other GCL methods. Our code is available at [https://github.com/chenx-hi/IGCL-CS](https://github.com/chenx-hi/IGCL-CS).

ID: 720

Reparameterizing Hybrid Markov Logic Networks to handle Covariate-Shift in Representations

Anup Shakya, Abisha Thapa Magar, Somdeb Sarkhel, Deepak Venugopal
[OpenReview]

Abstract

We utilize Hybrid Markov Logic Networks (HMLNs) to combine embeddings learned from a Deep Neural Network (DNN) with symbolic relational knowledge. Since a DNN may not always learn optimal embeddings, we develop a mixture model to reduce variance in the HMLN parameterization. Further, we perform inference in our model that is robust to covariate shifts that may occur in the DNN embeddings by reparameterizing the HMLN. We evaluate our approach on Graph Neural Networks and show that our approach outperforms state-of-the-art methods that combine relational knowledge with DNN embeddings when we introduce covariate shifts in the embeddings. Further, we demonstrate the utility of our approach in inferring latent student knowledge in a cognitive model called Deep Knowledge Tracing.

ID: 721

Aggregating Data for Optimal Learning

Sushant Agarwal, Yukti Makhija, Rishi Saket, Aravindan Raghuveer
[OpenReview]

Abstract

Multiple Instance Regression (MIR) and Learning from Label Proportions (LLP) are useful learning frameworks, where the training data is partitioned into disjoint sets or bags, and only an aggregate label, i.e., bag-label for each bag is available to the learner. In the case of MIR, the bag-label is the label of an undisclosed instance from the bag, while in LLP, the bag-label is the mean of the bag's labels. In this paper, we study for various loss functions in MIR and LLP, what is the optimal way to partition the dataset into bags such that the utility for downstream tasks like linear regression is maximized. We theoretically provide utility guarantees, and show that in each case, the optimal bagging strategy (approximately) reduces to finding an optimal clustering of the feature vectors and/or the labels with respect to natural objectives such as $k$-means. We also show that our bagging mechanisms can be made label-differentially private, incurring an additional utility error. We then generalize our results to the setting of Generalized Linear Models (GLMs). Finally, we experimentally validate our theoretical results.

ID: 724

Concept Forgetting via Label Annealing

Subhodip Panda, Ananda Theertha Suresh, Atri Guha, Prathosh AP
[OpenReview]

Abstract

The effectiveness of current machine learning models relies on their ability to grasp diverse concepts present in datasets. However, biased and noisy data can inadvertently cause these models to learn certain undesired concepts, undermining their ability to generalize and provide utility. Consequently, modifying a trained model to forget these concepts becomes imperative for their responsible deployment. We refer to this problem as *concept forgetting*. Our goal is to develop techniques for forgetting specific undesired concepts from a pre-trained classification model's prediction. To achieve this goal, we present an algorithm called **L**abel **AN**nealing (**LAN**). This iterative algorithm employs a two-stage method for each iteration. In the first stage, pseudo-labels are assigned to all the samples by annealing or redistributing the original labels based on the predictions of the model in the current iteration. During the second stage, the model is fine-tuned on this pseudo-labeled dataset generated from the first stage. We illustrate the effectiveness of the proposed algorithms across various models and datasets. Our method reduces *concept violation*, a metric that measures how much the model forgets specific concepts, by about 85.35% on the MNIST dataset, 73.25% on the CIFAR-10 dataset, and 69.46% on the CelebA dataset while maintaining high model accuracy.

ID: 725

Dynamic Maintenance of Kernel Density Estimation Data Structure: From Practice to Theory

Jiehao Liang, Zhao Song, Zhaozhuo Xu, Junze Yin, Danyang Zhuo
[OpenReview]

Abstract

Kernel density estimation (KDE) stands out as a challenging task in machine learning. The problem is defined in the following way: given a kernel function $f(x,y)$ and a set of points $\{x_1, x_2, \cdots, x_n \} \subset \mathbb{R}^d$, we would like to compute $\frac{1}{n}\sum_{i=1}^{n} f(x_i,y)$ for any query point $y \in \mathbb{R}^d$. Recently, there has been a growing trend of using data structures for efficient KDE. However, the proposed KDE data structures focus on static settings. The robustness of KDE data structures over dynamic changing data distributions is not addressed. In this work, we focus on the dynamic maintenance of KDE data structures with robustness to adversarial queries. Especially, we provide a theoretical framework of KDE data structures. In our framework, the KDE data structures only require subquadratic spaces. Moreover, our data structure supports the dynamic update of the dataset in sublinear time. Furthermore, we can perform adaptive queries with the potential adversary in sublinear time.

ID: 734

Lower Bounds on the Size of Markov Equivalence Classes

Erik L Jahn, Frederick Eberhardt, Leonard Schulman
[OpenReview]

Abstract

Causal discovery algorithms typically recover causal graphs only up to their Markov equivalence classes unless additional parametric assumptions are made. The sizes of these equivalence classes reflect the limits of what can be learned about the underlying causal graph from purely observational data. Under the assumptions of acyclicity, causal sufficiency, and a uniform model prior, Markov equivalence classes are known to be small on average. In this paper, we show that this is no longer the case when any of these assumptions is relaxed. Specifically, we prove exponentially large lower bounds for the expected size of Markov equivalence classes in three settings: sparse random directed acyclic graphs, uniformly random acyclic directed mixed graphs, and uniformly random directed cyclic graphs.

ID: 736

ELBO, regularized maximum likelihood, and their common one-sample approximation for training stochastic neural networks

Sina Däubener, Simon Damm, Asja Fischer
[OpenReview]

Abstract

Monte Carlo approximations are central to the training of stochastic neural networks in general, and Bayesian neural networks (BNNs) in particular. We observe that the common one-sample approximation of the standard training objective can be viewed both as maximizing the Evidence Lower Bound (ELBO) and as maximizing a regularized log-likelihood of a compound distribution. This latter approach differs from the ELBO only in the order of the logarithm and expectation, and is theoretically grounded in PAC-Bayes theory. We argue theoretically and demonstrate empirically that training with the regularized maximum likelihood increases prediction variance, enhancing performance in misspecified settings, adversarial robustness, and strengthening out-of-distribution (OOD) detection. Our findings help reconcile previous contradictions in the literature by providing a detailed analysis of how training objectives and Monte Carlo sample sizes affect uncertainty quantification in stochastic neural networks.

ID: 740

Stein Variational Evolution Strategies

Cornelius V. Braun, Robert Tjarko Lange, Marc Toussaint
[OpenReview]

Abstract

Efficient global optimization and sampling are fundamental challenges, particularly in fields such as robotics and reinforcement learning, where gradients may be unavailable or unreliable. In this context, jointly optimizing multiple solutions is a promising approach to avoid local optima. While Stein Variational Gradient Descent (SVGD) provides a powerful framework for sampling diverse solutions, its reliance on first-order information limits its applicability to differentiable objectives. Existing gradient-free SVGD variants often suffer from slow convergence, and poor scalability. To improve gradient-free sampling and optimization, we propose Stein Variational CMA-ES, a novel gradient-free SVGD-like method that combines the efficiency of evolution strategies with SVGD-based repulsion forces. We perform an extensive empirical evaluation across several domains, which shows that the integration of the ES update in SVGD significantly improves the performance on multiple challenging benchmark problems. Our findings establish SV-CMA-ES as a scalable method for zero-order sampling and blackbox optimization, bridging the gap between SVGD and evolution strategies.

ID: 741

Proximal Interacting Particle Langevin Algorithms

Paula Cordero Encinar, Francesca Romana Crucinio, O. Deniz Akyildiz
[OpenReview]

Abstract

We introduce a class of algorithms, termed proximal interacting particle Langevin algorithms (PIPLA), for inference and learning in latent variable models whose joint probability density is non-differentiable. Leveraging proximal Markov chain Monte Carlo techniques and interacting particle Langevin algorithms, we propose three algorithms tailored to the problem of estimating parameters in a non-differentiable statistical model. We prove non-asymptotic bounds for the parameter estimates produced by the different algorithms in the strongly log-concave setting and provide comprehensive numerical experiments on various models to demonstrate the effectiveness of the proposed methods. In particular, we demonstrate the utility of our family of algorithms for sparse Bayesian logistic regression, training of sparse Bayesian neural networks or neural networks with non-differentiable activation functions, image deblurring, and sparse matrix completion. Our theory and experiments together show that PIPLA family can be the de facto choice for parameter estimation problems in non-differentiable latent variable models.

ID: 742

SPvR: Structured Pruning via Ranking

Atif Hassan, Jiaul H. Paik, Swanand Khare
[OpenReview]

Abstract

Deep neural networks have achieved state-of-the-art performance in multiple domains but are increasingly resource-intensive, limiting their deployment on constrained devices. We introduce Structured Pruning via Ranking (SPvR), a novel structured pruning approach to address this challenge for classification tasks. SPvR prunes pre-trained networks in terms of function composition and network width while adhering to a user-specified parameter budget. Our method leverages local grouping and global ranking modules to generate smaller yet effective networks tailored to a given dataset and model. Finally, we train the pruned networks from scratch, instead of fine-tuning. Our evaluations demonstrate that SPvR significantly surpasses existing state-of-the-art pruning methods on benchmark datasets, using standard architectures. Even with a $90$% reduction in size, SPvR's sub-networks experience a minimal drop in test accuracy $(<1$%$)$ while on ImageNet1K, we outperform all baselines by achieving $<1$% Top-5 accuracy drop when pruning $70$% of ResNet50 parameters. Additionally, when compared to MobileNetV3, an SPvR pruned network improves the Top-1 accuracy by $3.3$% with $20$% less parameters. Furthermore, we empirically show that SPvR achieves reduced inference latency, underscoring its practical benefits for deploying neural networks on resource-constrained devices.

ID: 752

CP$^2$: Leveraging Geometry for Conformal Prediction via Canonicalization

Putri A Van der Linden, Alexander Timans, Erik J Bekkers
[OpenReview]

Abstract

We study the problem of *conformal prediction* (CP) under geometric data shifts, where data samples are susceptible to transformations such as rotations or flips. While CP endows prediction models with *post-hoc* uncertainty quantification and formal coverage guarantees, their practicality breaks under distribution shifts that deteriorate model performance. To address this issue, we propose integrating geometric information—such as geometric pose—into the conformal procedure to reinstate its guarantees and ensure robustness under geometric shifts. In particular, we explore recent advancements on pose *canonicalization* as a suitable information extractor for this purpose. Evaluating the combined approach across discrete and continuous shifts and against equivariant and augmentation-based baselines, we find that integrating geometric information with CP yields a principled way to address geometric shifts while maintaining broad applicability to black-box predictors.

ID: 754

Causal Inference amid Missingness-Specific Independences and Mechanism Shifts

Johan de Aguas, Leonard Henckel, Johan Pensar, Guido Biele
[OpenReview]

Abstract

The recovery of causal effects in structural models with missing data often relies on $m$-graphs, which assume that missingness mechanisms do not directly influence substantive variables. Yet, in many real-world settings, missing data can alter decision-making processes, as the absence of key information may affect downstream actions and states. To overcome this limitation, we introduce $lm$-SCMs and $lm$-graphs, which extend $m$-graphs by integrating a label set that represents relevant context-specific independencies (CSI), accounting for mechanism shifts induced by missingness. We define two causal effects within these systems: the Full Average Treatment Effect (FATE), which reflects the effect in a hypothetical scenario had no data been missing, and the Natural Average Treatment Effect (NATE), which captures the effect under the unaltered CSIs in the system. We propose recovery criteria for these queries and present doubly-robust estimators for a graphical model inspired by a real-world application. Simulations highlight key differences between these estimands and estimation methods. Findings from the application case suggest a small effect of ADHD treatment upon test achievement among Norwegian children, with a slight effect shift due to missing pre-tests scores.

ID: 755

SALSA: A Secure, Adaptive and Label-Agnostic Scalable Algorithm for Machine Unlearning

Owais Makroo, Atif Hassan, Swanand Khare
[OpenReview]

Abstract

Machine Learning as a Service (MLaaS) has simplified access to powerful machine learning models but faces challenges in complying with the “right to be forgotten” while resisting adversarial threats. Machine Unlearning (MU) addresses these issues by enabling selective data removal from models. However, existing methods are slow, label-dependent, vulnerable to black-box attacks, and computationally impractical for large-scale MLaaS deployments. We introduce SALSA, a Secure, Adaptive, Label-Agnostic, Scalable Algorithm for efficient and robust machine unlearning tailored to classification tasks in MLaaS. SALSA redistributes the class-wise predicted probabilities of data to be forgotten and optimizes a novel loss function that minimizes the divergence between redistributed and predicted probabilities while anchoring model parameters near their initialization. This ensures simultaneous unlearning and generalization. SALSA requires neither labels nor access to the remaining data, making it ideal for MLaaS environments. It is exceptionally fast, achieving at least $25\times$ faster unlearning, on average, than the fastest baseline, while consistently outperforming five state-of-the-art MU techniques across eight metrics on benchmark datasets. Experiments on synthetic data show that SALSA’s altered decision boundaries closely approximate exact unlearning. Rigorous evaluations against state-of-the-art black-box attacks demonstrate its resilience to security threats. Thus, SALSA redefines practical machine unlearning, offering a scalable and resilient solution for safeguarding privacy in modern MLaaS systems.

ID: 758

Hindsight Merging: Diverse Data Generation with Language Models

Veniamin Veselovsky, Benedikt Stroebl, Gianluca Bencomo, Dilip Arumugam, Lisa Schut, Arvind Narayanan, Thomas L. Griffiths
[OpenReview]

Abstract

Pre-training a language model equips it with a broad understanding of the world, while fine- tuning refines it into a helpful assistant. However, fine-tuning does not exclusively enhance task- specific behaviors but also suppresses some of the beneficial variability from pre-training. This reduction in diversity is partly due to the optimization process, which theoretically decreases model entropy in exchange for task performance. To counteract this, we introduce hindsight merging, a technique that combines a fine-tuned model with a previous training checkpoint using linear interpolation to restore entropy and improve performance. Hindsight-merged models retain strong instruction-following capabilities and alignment while displaying increased diversity present in the base model. Additionally, this results in improved inference scaling, achieving a consistent 20-50% increase in pass@10 relative to the instruction tuned model across a coding benchmark and series of models. Our findings suggest that hindsight merging is an effective strategy for generating diverse generations that follow instructions.

ID: 763

InfoDPCCA: Information-Theoretic Dynamic Probabilistic Canonical Correlation Analysis

Shiqin Tang, Shujian Yu
[OpenReview]

Abstract

Extracting meaningful latent representations from high-dimensional sequential data is a crucial challenge in machine learning, with applications spanning natural science and engineering. We introduce InfoDPCCA, a dynamic probabilistic Canonical Correlation Analysis (CCA) framework designed to model two interdependent sequences of observations. InfoDPCCA leverages a novel information-theoretic objective to extract a shared latent representation that captures the mutual structure between the data streams and balances representation compression and predictive sufficiency while also learning separate latent components that encode information specific to each sequence. Unlike prior dynamic CCA models, such as DPCCA, our approach explicitly enforces the shared latent space to encode only the mutual information between the sequences, improving interpretability and robustness. We further introduce a two-step training scheme to bridge the gap between information-theoretic representation learning and generative modeling, along with a residual connection mechanism to enhance training stability. Through experiments on synthetic and medical fMRI data, we demonstrate that InfoDPCCA excels as a tool for representation learning. Code of InfoDPCCA is available at https://github.com/marcusstang/InfoDPCCA.

ID: 773

On the Privacy Risks of Spiking Neural Networks: A Membership Inference Analysis

Junyi Guan, Abhijith Sharma, Chong Tian, Salem Lahlou
[OpenReview]

Abstract

Spiking Neural Networks (SNNs) are increasingly explored for their energy efficiency and robustness in real-world applications, yet their privacy risks remain largely unexamined. In this work, we investigate the susceptibility of SNNs to Membership Inference Attacks (MIAs)—a major privacy threat where an adversary attempts to determine whether a given sample was part of the training dataset. While prior work suggests that SNNs may offer inherent robustness due to their discrete, event-driven nature, we find that its resilience diminishes as latency (T) increases. Furthermore, we introduce an input dropout strategy under black box setting, that significantly enhances membership inference in SNNs. Our findings challenge the assumption that SNNs are inherently more secure, and even though they are expected to be better, our results reveal that SNNs exhibit privacy vulnerabilities that are equally comparable to Artificial Neural Networks (ANNs).

ID: 777

Expert-In-The-Loop Causal Discovery: Iterative Model Refinement Using Expert Knowledge

Ankur Ankan, Johannes Textor
[OpenReview]

Abstract

Many researchers construct directed acyclic graph (DAG) models manually based on domain knowledge. Although numerous causal discovery algorithms were developed to automatically learn DAGs and other causal models from data, these remain challenging to use due to their tendency to produce results that contradict domain knowledge, among other issues. Here we propose a hybrid, iterative structure learning approach that combines domain knowledge with data-driven insights to assist researchers in constructing DAGs. Our method leverages conditional independence testing to iteratively identify variable pairs where an edge is either missing or superfluous. Based on this information, we can choose to add missing edges with appropriate orientation based on domain knowledge or remove unnecessary ones. We also give a method to rank these missing edges based on their impact on the overall model fit. In a simulation study, we find that this iterative approach to leverage domain knowledge already starts outperforming purely data-driven structure learning if the orientation of new edge is correctly determined in at least two out of three cases. We present a proof-of-concept implementation using a large language model as a domain expert and a graphical user interface designed to assist human experts with DAG construction.

ID: 779

Error Bounds for Physics-Informed Neural Networks in Fokker-Planck PDEs

Chun-Wei Kong, Luca Laurenti, Jay McMahon, Morteza Lahijanian
[OpenReview]

Abstract

Stochastic differential equations are commonly used to describe the evolution of stochastic processes. The state uncertainty of such processes is best represented by the probability density function (PDF), whose evolution is governed by the Fokker-Planck partial differential equation (FP-PDE). However, it is generally infeasible to solve the FP-PDE in closed form. In this work, we show that physics-informed neural networks (PINNs) can be trained to approximate the solution PDF. Our main contribution is the analysis of PINN approximation error: we develop a theoretical framework to construct tight error bounds using PINNs. In addition, we derive a practical error bound that can be efficiently constructed with standard training methods. We discuss that this error-bound framework generalizes to approximate solutions of other linear PDEs. Empirical results on nonlinear, high-dimensional, and chaotic systems validate the correctness of our error bounds while demonstrating the scalability of PINNs and their significant computational speedup in obtaining accurate PDF solutions compared to the Monte Carlo approach.

ID: 780

Nonlinear Causal Discovery for Grouped Data

Konstantin Göbler, Tobias Windisch, Mathias Drton
[OpenReview]

Abstract

Inferring cause-effect relationships from observational data has gained significant attention in recent years, but most methods are limited to scalar random variables. In many important domains, including neuroscience, psychology, social science, and industrial manufacturing, the causal units of interest are groups of variables rather than individual scalar measurements. Motivated by these applications, we extend nonlinear additive noise models to handle random vectors, establishing a two-step approach for causal graph learning: First, infer the causal order among random vectors. Second, perform model selection to identify the best graph consistent with this order. We introduce effective and novel solutions for both steps in the vector case, demonstrating strong performance in simulations. Finally, we apply our method to real-world assembly line data with partial knowledge of causal ordering among variable groups.

ID: 782

Fast Non-convex Matrix Sensing with Optimal Sample Complexity

Jian-Feng Cai, Tong Wu, Ruizhe Xia
[OpenReview]

Abstract

We study the problem of recovering an unknown $d_1 \times d_2$ rank-$r$ matrix from $m$ random linear measurements. Convex methods achieve the optimal sample complexity $m = \Omega(r(d_1 + d_2))$ but are computationally expensive. Non-convex approaches, while more computationally efficient, often require suboptimal sample complexity $m = \Omega(r^2(d_1 + d_2))$. Recent advance achieves $m = \Omega(rd_1)$ for a non-convex approach but relies on the restrictive assumption of positive semidefinite (PSD) matrices and suffers from slow convergence in ill-conditioned settings. Bridging this gap, we show that Riemannian gradient descent (RGD) achieves both optimal sample complexity and computational efficiency without requiring the PSD assumption. Specifically, for Gaussian measurements, RGD exactly recovers the low-rank matrix with $m = \Omega(r(d_1 + d_2))$, matching the information-theoretic lower bound, and converges linearly to the global minimum with an arbitrarily small convergence rate.

ID: 786

ELF: Federated Langevin Algorithms with Primal, Dual and Bidirectional Compression

Avetik Karagulyan, Peter Richtárik
[OpenReview]

Abstract

Federated sampling algorithms have recently gained great popularity in the community of machine learning and statistics. This paper proposes a new federated sampling algorithm called Error Feedback Langevin algorithms (ELF). In particular, we analyze the combinations of EF21 and EF21-P with the federated Langevin Monte-Carlo. We propose three algorithms, P-ELF, D-ELF, and B-ELF, that use primal, dual, and bidirectional compressors. We analyze the proposed methods under Log-Sobolev inequality and provide non-asymptotic convergence guarantees. Simple experimental results support our theoretical findings.

ID: 790

Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers

Sungmin Han, Jeonghyun Lee, Sangkyun Lee
[OpenReview]

Abstract

Transformers have profoundly influenced AI research, but explaining their decisions remains challenging -- even for relatively simpler tasks such as classification -- which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that Contrast-CAT consistently outperforms state-of-the-art methods. Notably, under the MoRF setting, it achieves average improvements of $\times 1.30$ in AOPC and $\times 2.25$ in LOdds over the most competing methods, demonstrating its effectiveness in enhancing interpretability for transformer-based text classification.

ID: 798

Exploring Exploration in Bayesian Optimization

Leonard Papenmeier, Nuojin Cheng, Stephen Becker, Luigi Nardi
[OpenReview]

Abstract

A well-balanced exploration-exploitation trade-off is crucial for successful acquisition functions in Bayesian optimization. However, there is a lack of quantitative measures for exploration, making it difficult to analyze and compare different acquisition functions. This work introduces two novel approaches – observation traveling salesman distance and observation entropy – to quantify the exploration characteristics of acquisition functions based on their selected observations. Using these measures, we examine the explorative nature of several well-known acquisition functions across a diverse set of black-box problems, uncover links between exploration and empirical performance, and reveal new relationships among existing acquisition functions. Beyond enabling a deeper understanding of acquisition functions, these measures also provide a foundation for guiding their design in a more principled and systematic manner.

ID: 799

Partial-Label Learning with Conformal Candidate Cleaning

Tobias Fuchs, Florian Kalinke
[OpenReview]

Abstract

Real-world data is often ambiguous; for example, human annotation produces instances with multiple conflicting class labels. Partial-label learning (PLL) aims at training a classifier in this challenging setting, where each instance is associated with a set of candidate labels and one correct, but unknown, class label. A multitude of algorithms targeting this setting exists and, to enhance their prediction quality, several extensions that are applicable across a wide range of PLL methods have been introduced. While many of these extensions rely on heuristics, this article proposes a novel enhancing method that incrementally prunes candidate sets using conformal prediction. To work around the missing labeled validation set, which is typically required for conformal prediction, we propose a strategy that alternates between training a PLL classifier to label the validation set, leveraging these predicted class labels for calibration, and pruning candidate labels that are not part of the resulting conformal sets. In this sense, our method alternates between empirical risk minimization and candidate set pruning. We establish that our pruning method preserves the conformal validity with respect to the unknown ground truth. Our extensive experiments on artificial and real-world data show that the proposed approach significantly improves the test set accuracies of several state-of-the-art PLL classifiers.

ID: 801

Offline Changepoint Detection With Gaussian Processes

Janneke Verbeek, Tom Heskes, Yuliya Shapovalova
[OpenReview]

Abstract

This work proposes Segmenting changepoint Gaussian process regression (SegCPGP), an offline changepoint detection method that integrates Gaussian process regression with the changepoint kernel, the likelihood ratio test and binary search. We use the spectral mixture kernel to detect various types of changes without prior knowledge of their type. SegCPGP outperforms state-of-the-art methods when detecting various change types in synthetic datasets; in real world changepoint detection datasets, it performs on par with its competitors. While its hypothesis test shows slight miscalibration, we find SegCPGP remains reasonably reliable.

ID: 803

Causal Models for Growing Networks

Gecia Bravo-Hermsdorff, Kayvan Sadeghi, Lee M. Gunderson
[OpenReview]

Abstract

Real-world networks grow over time; statistical models based on node exchangeability are not appropriate. Instead of constraining the structure of the *distribution* of edges, we propose that the relevant symmetries refer to the *causal structure* between them. We first enumerate the 96 causal directed acyclic graph (DAG) models over pairs of nodes (dyad variables) in a growing network with finite ancestral sets that are invariant to node deletion. We then partition them into 21 classes with ancestral sets that are closed under node marginalization. Several of these classes are remarkably amenable to parallelization. As an example, we highlight a simple model that exhibits flexible power-law degree distributions and emergent phase transitions in sparsity, which we characterize analytically. With few parameters and much conditional independence, our proposed framework provides natural baseline models for causal inference in relational data.

ID: 807

MindFlayer SGD: Efficient Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times

Arto Maranjyan, Omar Shaikh Omar, Peter Richtárik
[OpenReview]

Abstract

We investigate the problem of minimizing the expectation of smooth nonconvex functions in a distributed setting with multiple parallel workers that are able to compute stochastic gradients. A significant challenge in this context is the presence of arbitrarily heterogeneous and stochastic compute times among workers, which can severely degrade the performance of existing parallel stochastic gradient descent (SGD) methods. While some parallel SGD algorithms achieve optimal performance under deterministic but heterogeneous delays, their effectiveness diminishes when compute times are random—a scenario not explicitly addressed in their design. To bridge this gap, we introduce MindFlayer SGD, a novel parallel SGD method specifically designed to handle stochastic and heterogeneous compute times. Through theoretical analysis and empirical evaluation, we demonstrate that MindFlayer SGD consistently outperforms existing baselines, particularly in environments with heavy-tailed noise. Our results highlight its robustness and scalability, making it a compelling choice for large-scale distributed learning tasks.

ID: 811

Multi-armed Bandits with Missing Outcomes

Ilia Mahrooghi, Mahshad Moradi, Sina Akbari, Negar Kiyavash
[OpenReview]

Abstract

While significant progress has been made in designing algorithms that minimize regret in online decision-making, real-world scenarios often introduce additional complexities, with missing outcomes perhaps among the most challenging ones. Overlooking this aspect or simply assuming random missingness invariably leads to biased estimates of the rewards and may result in linear regret. Despite the practical relevance of this challenge, no rigorous methodology currently exists for systematically handling missingness, especially when the missingness mechanism is not random. In this paper, we address this gap in the context of multi-armed bandits (MAB) with missing outcomes by analyzing the impact of different missingness mechanisms on achievable regret bounds. We introduce algorithms that account for missingness under both missing at random (MAR) and missing not at random (MNAR) models. Through both analytical and simulation studies, we demonstrate the drastic improvements in decision-making by accounting for missingness in these settings.

ID: 818

Approximate Bayesian Inference via Bitstring Representations

Aleksanteri Sladek, Martin Trapp, Arno Solin
[OpenReview]

Abstract

The machine learning community has recently put effort into quantized or low-precision arithmetics to scale large models. This paper proposes performing probabilistic inference in the quantized, discrete parameter space created by these representations, effectively enabling us to learn a continuous distribution using discrete parameters. We consider both 2D densities and quantized neural networks, where we introduce a tractable learning approach using probabilistic circuits. This method offers a scalable solution to manage complex distributions and provides clear insights into model behavior. We validate our approach with various models, demonstrating inference efficiency without sacrificing accuracy. This work advances scalable, interpretable machine learning by utilizing discrete approximations for probabilistic computations.

ID: 819

RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning

Atif Hassan, Swanand Khare, Jiaul H. Paik
[OpenReview]

Abstract

Dynamic data pruning techniques aim to reduce computational cost while minimizing information loss, by periodically selecting representative subsets of input data during model training. However, existing methods often struggle to maintain strong worst-group accuracy, particularly at high pruning rates, across balanced and imbalanced datasets. To address this challenge, we propose RCAP, a Robust, Class-Aware, Probabilistic dynamic dataset pruning algorithm for classification tasks. RCAP applies a closed-form solution to estimate the fraction of samples to be included in the training subset for each individual class. This fraction is adaptively adjusted in every epoch using class-wise aggregated loss. Thereafter, it employs an adaptive sampling strategy that prioritizes samples having high loss for populating the class-wise subsets. We evaluate RCAP on six diverse datasets ranging from class-balanced to highly imbalanced using five distinct models across three training paradigms: training from scratch, transfer learning, and fine-tuning. Our approach consistently outperforms state-of-the-art dataset pruning methods, achieving superior worst-group accuracy at all pruning rates. Remarkably, with only $10$% data, RCAP delivers $>1$% improvement in performance on class-imbalanced datasets compared to full data training, while providing an average $8.69\times$ speedup. The code can be accessed at https://github.com/atif-hassan/RCAP-dynamic-dataset-pruning

ID: 820

Beyond Sin-Squared Error: Linear Time Entrywise Uncertainty Quantification for Streaming PCA

Syamantak Kumar, Shourya Pandey, Purnamrita Sarkar
[OpenReview]

Abstract

We propose a novel statistical inference framework for streaming principal component analysis (PCA) using Oja’s algorithm, enabling the construction of confidence intervals for individual entries of the estimated eigenvector. Most existing works on streaming PCA focus on providing sharp sin-squared error guarantees. Recently, there has been some interest in uncertainty quantification for the sin-squared error. However, uncertainty quantification or sharp error guarantees for _entries of the estimated eigenvector_ in the streaming setting remains largely unexplored. We derive a sharp Bernstein-type concentration bound for elements of the estimated vector matching the optimal error rate up to logarithmic factors. We also establish a Central Limit Theorem for a suitably centered and scaled subset of the entries. To efficiently estimate the coordinate-wise variance, we introduce a provably consistent subsampling algorithm that leverages the median-of-means approach, empirically achieving similar accuracy to multiplier bootstrap methods while being significantly more computationally efficient. Numerical experiments demonstrate its effectiveness in providing reliable uncertainty estimates with a fraction of the computational cost of existing methods.

ID: 827

Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

Tanmay Goyal, Gaurav Sinha
[OpenReview]

Abstract

We study the Logistic Contextual Slate Bandit problem, where, at each round, an agent selects a slate of $N$ items from an exponentially large set (of size $2^{\Omega(N)}$) of candidate slates provided by the environment. A single binary reward, determined by a logistic model, is observed for the chosen slate. Our objective is to develop algorithms that maximize cumulative reward over $T$ rounds while maintaining low per-round computational costs. We propose two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that accomplish this goal. These algorithms achieve $N^{O(1)}$ per-round time complexity via local planning (independent slot selections), and low regret through global learning (joint parameter estimation). We provide theoretical and empirical evidence supporting these claims. Under a well-studied diversity assumption, we prove that Slate-GLM-OFU incurs only $\tilde{O}(\sqrt{T})$ regret. Extensive experiments across a wide range of synthetic settings demonstrate that our algorithms consistently outperform state-of-the-art baselines, achieving both the lowest regret and the fastest runtime. Furthermore, we apply our algorithm to select in-context examples in prompts of Language Models for solving binary classification tasks such as sentiment analysis. Our approach achieves competitive test accuracy, making it a viable alternative in practical scenarios.

ID: 830

Learning Algorithms for Multiple Instance Regression

Aaryan Gupta, Rishi Saket
[OpenReview]

Abstract

Multiple instance regression, introduced by Ray and Page [2001], is a generalisation of supervised regression in which the training data is available as a bag of feature-vectors (instances) and for each bag there is a bag-label which matches the label of one (unknown) primary instance from that bag. The goal is to compute a hypothesis regressor consistent with the underlying instance-labels. While most works on MIR focused on training models on such training data, computational learnability of MIR was only recently explored by Chauhan et al. [UAI 2024] who showed worst case intractability of properly learning *linear regressors* in MIR by showing a inapproximability bound. However, their work did not rule out efficient algorithms for this problem on natural distributions and randomly chosen labels. In this work we show that it is indeed possible to efficiently learn linear regressors in MIR when given access to random bags of uniformly randomly sampled primary instance chosen as the bag-label in which the feature vectors are independently sampled from Gaussian distributions. This is achieved by optimizing a certain bag-level loss which, via concentration bounds, yields a close approximation to the target linear regressor. Lastly, we show that the bag-level loss is also useful for learning general concepts (e.g. neural networks) in this setting: an optimizer of the loss on sampled bags is, w.h.p. a close approximation of a scaled version of the target function. We include experimental evaluations of our learning algorithms on synthetic and real-world datasets showing that our method outperforms the baseline MIR methods.

ID: 831

Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces

Avik Kar, Rahul Singh
[OpenReview]

Abstract

We study infinite-horizon average-reward reinforcement learning (RL) for Lipschitz MDPs, a broad class that subsumes several important classes such as linear and RKHS MDPs, function approximation frameworks, and develop an adaptive algorithm $\text{ZoRL}$ with regret bounded as $\mathcal{O}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$, where $d_{\text{eff.}}= 2d_\mathcal{S} + d_z + 3$, $d_\mathcal{S}$ is the dimension of the state space and $d_z$ is the zooming dimension. In contrast, algorithms with fixed discretization yield $d_{\text{eff.}} = 2(d_\mathcal{S} + d_\mathcal{A}) + 2$, $d_\mathcal{A}$ being the dimension of action space. $\text{ZoRL}$ achieves this by discretizing the state-action space adaptively and zooming into ''promising regions'' of the state-action space. $d_z$, a problem-dependent quantity bounded by the state-action space's dimension, allows us to conclude that if an MDP is benign, then the regret of $\text{ZoRL}$ will be small. The zooming dimension and $\text{ZoRL}$ are truly adaptive, i.e., the current work shows how to capture adaptivity gains for infinite-horizon average-reward RL. $\text{ZoRL}$ outperforms other state-of-the-art algorithms in experiments, thereby demonstrating the gains arising due to adaptivity.

ID: 832

Flow-Based Delayed Hawkes Process

Chao Yang, Wendi Ren, Shuang Li
[OpenReview]

Abstract

Multivariate Hawkes processes are classic temporal point process models for event data. These models are simple and parametric in nature, offering interpretability by capturing the triggering effects between event types. However, these parametric models often struggle with low model capacity, limiting their expressive power to capture heterogeneous data patterns influenced by latent variables. In this paper, we propose a simple yet powerful extension: the Flow-based Delayed Hawkes Process, which integrates Normalizing Flows as a generative model to parameterize the Hawkes process. By generating all model parameters through the flow-based network, our approach significantly improves flexibility and expressiveness while preserving interpretability. We provide theoretical guarantees by proving the identifiability of the model parameters and the consistency of the maximum likelihood estimator under mild assumptions. Extensive experiments on both synthetic and real-world datasets show that our model outperforms existing baselines in capturing intricate and heterogeneous event dynamics.

ID: 836

LoSAM: Local Search in Additive Noise Models with Mixed Mechanisms and General Noise for Global Causal Discovery

Sujai Hiremath, PROMIT GHOSAL, Kyra Gan
[OpenReview]

Abstract

Inferring causal relationships from observational data is crucial when experiments are costly or infeasible. Additive noise models (ANMs) enable unique directed acyclic graph (DAG) identification, but existing sample-efficient ANM methods often rely on restrictive assumptions on the data generating process, limiting their applicability to real-world settings. We propose local search in additive noise models, LoSAM, a topological ordering method for learning a unique DAG in ANMs with mixed causal mechanisms and general noise distributions. We introduce new causal substructures and criteria for identifying roots and leaves, enabling efficient top-down learning. We prove asymptotic consistency and polynomial runtime, ensuring scalability and sample efficiency. We test LoSAM on synthetic and real-world data, demonstrating state-of-the-art performance across all mixed mechanism settings.

ID: 837

Learning with Confidence

Oliver Ethan Richardson
[OpenReview]

Abstract

We characterize a notion of confidence that arises in learning or updating beliefs: the amount of trust one has in incoming information and its impact on the belief state. This *learner's confidence* can be used alongside (and is easily mistaken for) probability or likelihood, but it is fundamentally a different concept---one that captures many familiar concepts in the literature, including learning rates and number of training epochs, Shafer's weight of evidence, and Kalman gain. We formally axiomatize what it means to learn with confidence, give two canonical ways of measuring confidence on a continuum, and prove that confidence can always be represented in this way. Under additional assumptions, we derive more compact representations of confidence-based learning in terms of vector fields and loss functions. These representations induce an extended language of compound "parallel" observations. We characterize *Bayesian* learning as the special case of an optimizing learner whose loss representation is a linear expectation.

ID: 842

Conformal Prediction without Nonconformity Scores

Jonas Hanselle, Alireza Javanmardi, Tobias Florin Oberkofler, Yusuf Sale, Eyke Hüllermeier
[OpenReview]

Abstract

Conformal prediction (CP) is an uncertainty quantification framework that allows for constructing statistically valid prediction sets. Key to the construction of these sets is the notion of a nonconformity function, which assigns a real-valued score to individual data points: only those (hypothetical) data points contribute to a prediction set that sufficiently conform to the data. The point of departure of this work is the observation that CP predictions are invariant against (strictly) monotone transformations of the nonconformity function. In other words, it is only the ordering of the scores that matters, not their quantitative values. Consequently, instead of scoring individual data points, a conformal predictor only needs to be able to compare pairs of data points, deciding which of them is the more conforming one. This suggests an interesting connection between CP and preference learning, in particular learning-to-rank methods, and makes CP amenable to training data in the form of (qualitative) preferences. Elaborating on this connection, we propose methods for preference-based CP and show their usefulness in real-world classification tasks.

ID: 844

Residual Reweighted Conformal Prediction for Graph Neural Networks

Zheng Zhang, Jie Bao, Zhixin Zhou, nicolo colombo, Lixin Cheng, Rui Luo
[OpenReview]

Abstract

Graph Neural Networks (GNNs) excel at modeling relational data but face significant challenges in high-stakes domains due to unquantified uncertainty. Conformal prediction (CP) offers statistical coverage guarantees, but existing methods often produce overly conservative prediction intervals that fail to account for graph heteroscedasticity and structural biases. While residual reweighting CP variants address some of these limitations, they neglect graph topology, cluster-specific uncertainties, and risk data leakage by reusing training sets. To address these issues, we propose Residual Reweighted GNN (RR-GNN), a framework designed to generate minimal prediction sets with provable marginal coverage guarantees. RR-GNN introduces three major innovations to enhance prediction performance. First, it employs Graph-Structured Mondrian CP to partition nodes or edges into communities based on topological features, ensuring cluster-conditional coverage that reflects heterogeneity. Second, it uses Residual-Adaptive Nonconformity Scores by training a secondary GNN on a held-out calibration set to estimate task-specific residuals, dynamically adjusting prediction intervals according to node or edge uncertainty. Third, it adopts a Cross-Training Protocol, which alternates the optimization of the primary GNN and the residual predictor to prevent information leakage while maintaining graph dependencies. We validate RR-GNN on 15 real-world graphs across diverse tasks, including node classification, regression, and edge weight prediction. Compared to CP baselines, RR-GNN achieves improved efficiency over state-of-the-art methods, with no loss of coverage.

ID: 856

FeDCM: Federated Learning of Deep Causal Generative Models

Md Musfiqur Rahman, Murat Kocaoglu
[OpenReview]

Abstract

In many real-world settings, such as medicine and finance causal effect is a valuable metric for decision making. For many predictive tasks, causal mechanisms provide robust estimators while existing ML-driven predictors might be vulnerable to spurious correlations. In such settings, when data is decentralized and privacy must be preserved, federated learning plays an important role. However, causal inference in a federated learning setup is a largely unexplored research area. In this paper, we learn a proxy of the underlying structural causal model (SCM) with deep generative models from decentralized observational data sources possibly containing high-dimensional variables. Based on client preference or high dimensionality of variables, we modularize the SCM mechanisms and find the minimal subset appropriate for federated learning while having rest of the mechanisms trained on individual client’s local data. When all connected together, the proxy SCM, named as the federated deep causal generative model (FeDCM ), offers estimation of any identifiable causal effect. We perform extensive experiments to illustrate the utility and performance of our approach.


Conference proceedings will be published by PMLR


Sponsors