UAI 2024 - Accepted Papers

## ID: 3 |
## Learning Topological Representations with Bidirectional Graph Attention Network for Solving Job Shop Scheduling ProblemCong Zhang, Zhiguang Cao, Yaoxin Wu, Wen Song, Jing SunAbstract
Existing learning-based methods for solving job shop scheduling problems (JSSP) usually use off-the-shelf GNN models tailored to undirected graphs and neglect the rich and meaningful topological structures of disjunctive graphs (DGs). This paper proposes the topology-aware bidirectional graph attention network (TBGAT), a novel GNN architecture based on the attention mechanism, to embed the DG for solving JSSP in a local search framework. Specifically, TBGAT embeds the DG from a forward and a backward view, respectively, where the messages are propagated by following the different topologies of the views and aggregated via graph attention. Then, we propose a novel operator based on the message-passing mechanism to calculate the forward and backward topological sorts of the DG, which are the features for characterizing the topological structures and exploited by our model. In addition, we theoretically and experimentally show that TBGAT has linear computational complexity to the number of jobs and machines, respectively, strengthening our method's practical value. Besides, extensive experiments on five synthetic datasets and seven classic benchmarks show that TBGAT achieves new SOTA results by outperforming a wide range of neural methods by a large margin. All the code and data are publicly available online at https://github.com/zcaicaros/TBGAT. |

## ID: 11 |
## RE-SORT: Removing Spurious Correlation in Multilevel Interaction for CTR PredictionSongli Wu, Liang Du, Jia-Qi Yang, Yuai Wang, De-Chuan Zhan, SHUANG ZHAO, Zixun SunAbstract
Click-through rate (CTR) prediction is a critical task in recommendation systems, serving as the ultimate filtering step to sort items for a user. Most recent cutting-edge methods primarily focus on investigating complex implicit and explicit feature interactions; however, these methods neglect the spurious correlation issue caused by confounding factors, thereby diminishing the model's generalization ability. We propose a CTR prediction framework that REmoves Spurious cORrelations in mulTilevel feature interactions, termed RE-SORT, which has two key components. I. A multilevel stacked recurrent (MSR) structure enables the model to efficiently capture diverse nonlinear interactions from feature spaces at different levels. II. A spurious correlation elimination (SCE) module further leverages Laplacian kernel mapping and sample reweighting methods to eliminate the spurious correlations concealed within the multilevel features, allowing the model to focus on the true causal features. Extensive experiments conducted on four challenging CTR datasets, our production dataset, and an online A/B test demonstrate that the proposed method achieves state-of-the-art performance in both accuracy and speed. The utilized codes, models, and dataset will be released at https://github.com/RE-SORT. |

## ID: 12 |
## DistriBlock: Identifying adversarial audio samples by leveraging characteristics of the output distributionMatias Patricio Pizarro Bustamante, Dorothea Kolossa, Asja FischerAbstract
Adversarial attacks can mislead automatic speech recognition (ASR) systems into predicting an arbitrary target text, thus posing a clear security threat. To prevent such attacks, we propose DistriBlock, an efficient detection strategy applicable to any ASR system that predicts a probability distribution over output tokens in each time step. We measure a set of characteristAdversarial attacks can mislead automatic speech recognition (ASR) systems into predicting an arbitrary target text, thus posing a clear security threat. To prevent such attacks, we propose DistriBlock, an efficient detection strategy applicable to any ASR system that predicts a probability distribution over output tokens in each time step. We measure a set of characteristics of this distribution: the median, maximum, and minimum over the output probabilities, the entropy of the distribution, as well as the Kullback-Leibler and the Jensen-Shannon divergence with respect to the distributions of the subsequent time step. Then, by leveraging the characteristics observed for both benign and adversarial data, we apply binary classifiers, including simple threshold-based classification, ensembles of such classifiers, and neural networks. Through extensive analysis across different state-of-the-art ASR systems and language data sets, we demonstrate the supreme performance of this approach, with a mean area under the receiver operating characteristic curve for distinguishing target adversarial examples against clean and noisy data of 99\% and 97\%, respectively. To assess the robustness of our method, we show that adaptive adversarial examples that can circumvent DistriBlock are much noisier, which makes them easier to detect through filtering and creates another avenue for preserving the system's robustness.ics of this distribution: the median, maximum, and minimum over the output probabilities, the entropy of the distribution, as well as the Kullback-Leibler and the Jensen-Shannon divergence with respect to the distributions of the subsequent time step. Then, by leveraging the characteristics observed for both benign and adversarial data, we apply binary classifiers, including simple threshold-based classification, ensembles of such classifiers, and neural networks. Through extensive analysis across different state-of-the-art ASR systems and language data sets, we demonstrate the supreme performance of this approach, with a mean area under the receiver operating characteristic for distinguishing target adversarial examples against clean and noisy data of 99\% and 97\%, respectively. To assess the robustness of our method, we show that adaptive adversarial examples that can circumvent DistriBlock are much noisier, which makes them easier to detect through filtering and creates another avenue for preserving the system's robustness. |

## ID: 15 |
## Consistency Regularization for Domain Generalization with Logit Attribution MatchingHan Gao, Kaican Li, Weiyan Xie, Zhi LIN, Yongxiang Huang, Luning Wang, Caleb Chen Cao, Nevin L. ZhangAbstract
Domain generalization (DG) is about training models that generalize well under domain shift. Previous research on DG has been conducted mostly in single-source or multi-source settings. In this paper, we consider a third lesser-known setting where a training domain is endowed with a collection of pairs of examples that share the same semantic information. Such semantic sharing (SS) pairs can be created via data augmentation and then utilized for consistency regularization (CR). We present a theory showing CR is conducive to DG and propose a novel CR method called Logit Attribution Matching (LAM). We conduct experiments on five DG benchmarks and four pretrained models with SS pairs created by both generic and targeted data augmentation methods. LAM outperforms representative single/multi-source DG methods and various CR methods that leverage SS pairs. The code and data of this project are available at https://github.com/Gaohan123/LAM. |

## ID: 19 |
## Decentralized Online Learning in General-Sum Stackelberg GamesYaolong Yu, Haipeng ChenAbstract
We study an online learning problem in general-sum Stackelberg games, where players act in a decentralized and strategic manner. We study two settings depending on the type of information for the follower: (1) the $\textit{limited information}$ setting where the follower only observes its own reward, and (2) the $\textit{side information}$ setting where the follower has extra side information about the leader's reward. We show that for the follower, myopically best responding to the leader's action is the best strategy for the limited information setting, but not necessarily so for the side information setting -- the follower can manipulate the leader's reward signals with strategic actions, and hence induce the leader's strategy to converge to an equilibrium that is better off for itself. Based on these insights, we study decentralized online learning for both players in the two settings. Our main contribution is to derive $\textit{last iterate}$ convergence and sample complexity results in both settings. Notably, we design a new manipulation strategy for the follower in the latter setting, and show that it has an intrinsic advantage against the best response strategy. Our theories are also supported by empirical results. |

## ID: 25 |
## Graph Feedback Bandits with Similar Arms## [spotlight]Han Qi, Fei Guo, Li ZhuAbstract
In this paper, we study the stochastic multi-armed bandit problem with graph feedback. Motivated by the clinical trials and recommendation problem, we assume that two arms are connected if and only if they are similar (i.e., their means are close enough). We establish a regret lower bound for this novel feedback structure and introduce two UCB-based algorithms: D-UCB with problem-independent regret upper bounds and C-UCB with problem-dependent upper bounds. Leveraging the similarity structure, we also consider the scenario where the number of arms increases over time. Practical applications related to this scenario include Q\&A platforms (Reddit, Stack Overflow, Quora) and product reviews in Amazon and Flipkart. Answers (product reviews) continually appear on the website, and the goal is to display the best answers (product reviews) at the top. When the means of arms are independently generated from some distribution, we provide regret upper bounds for both algorithms and discuss the sub-linearity of bounds in relation to the distribution of means. Finally, we conduct experiments to validate the theoretical results. |

## ID: 26 |
## Low-rank Matrix Bandits with Heavy-tailed RewardsYue Kang, Cho-Jui Hsieh, Thomas LeeAbstract
In stochastic low-rank matrix bandit, the expected reward of an arm is equal to the inner product between its feature matrix and some unknown $d_1$ by $d_2$ low-rank parameter matrix $\Theta^*$ with rank $r \ll d_1\wedge d_2$. While all prior studies assume the payoffs are mixed with sub-Gaussian noises, in this work we loosen this strict assumption and consider the new problem of low-rank matrix bandit with heavy-tailed rewards (LowHTR), where the rewards only have finite $(1+\delta)$ moment for some $\delta \in (0,1]$. By utilizing the truncation on observed payoffs and the dynamic exploration, we propose a novel algorithm called LOTUS attaining the regret bound of order $\tilde O(d^\frac{3}{2}r^\frac{1}{2}T^\frac{1}{1+\delta}/\tilde{D}_{rr})$ without knowing $T$, which matches the state-of-the-art regret bound under sub-Gaussian noises~\citep{lu2021low,kang2022efficient} with $\delta = 1$. Moreover, we establish a lower bound of the order $\Omega(d^\frac{\delta}{1+\delta} r^\frac{\delta}{1+\delta} T^\frac{1}{1+\delta}) = \Omega(T^\frac{1}{1+\delta})$ for LowHTR, which indicates our LOTUS is nearly optimal in the order of $T$. In addition, we improve LOTUS so that it does not require knowledge of the rank $r$ with $\tilde O(dr^\frac{3}{2}T^\frac{1+\delta}{1+2\delta})$ regret bound, and it is efficient under the high-dimensional scenario. We also conduct simulations to demonstrate the practical superiority of our algorithm. |

## ID: 37 |
## Trusted re-weighting for label distribution learningZhuoran Zheng, Chen Wu, YEYING JIN, Xiuyi JiaAbstract
Label distribution learning (LDL) is a novel machine learning paradigm that aims to shift 0/1 labels into descriptive degrees to characterize the polysemy of instances. Since the description degree takes a value between 0 ∼ 1, it is difficult for the annotator to accurately annotate each label. Therefore, the predictive ability of numerous LDL algorithms may be degraded by the presence of noise in the label space. To address this problem, we propose a novel stability-trust LDL framework that aims to reconstruct the feature space of an arbitrary LDL dataset by using feature decoupling and prototype guidance. Specifically, first, we use prototype learning to select reliable cluster centers (representative vectors of label distributions) to filter out a set of clean samples (with labeled noise) on the original dataset. Then, we decouple the feature space (eliminating correlations among features) by modeling a weight assigner that is learned on this clean sample set, thus assigning weights to each sample of the original dataset. Finally, all existing LDL algorithms can be trained on this new re-weighted dataset for the goal of robust modeling. In addition, we create a new image dataset to support the training and testing of compared models. Experimental results demonstrate that the proposed framework boosts the performance of the LDL algorithm on datasets with label noise. |

## ID: 40 |
## Partial Identification with Proxy of Latent Confoundings via Sum-of-ratios Fractional ProgrammingZhiheng Zhang, Xinyan SuAbstract
Causal effect estimation is a crucial theoretical tool in uncertainty analysis. The challenge of unobservable confoundings has raised concerns regarding quantitative causality computation. To address this issue, proxy control has become popular, employing auxiliary variables W as proxies for the confounding variables U. However, proximal methods rely on strong assumptions, such as reversibility and completeness, that are challenging to interpret empirically and verify. Consequently, their applicability in real-world scenarios is limited, particularly when the proxies lack informativeness. In our paper, we have developed a novel optimization method named Partial Identification with Proxy of Latent Confoundings via Sum-of-Ratios Fractional Programming (PI-SFP). This method does not impose any additional restrictions upon proxies and only assumes the mild partial observability of the transition matrix P(W | U). We have theoretically proven the global convergence of PI-SFP to the valid bound of the causal effect and analyzed the conditions under which the bounds could be tight. Our synthetic and real-world experiments validate our theoretical framework. |

## ID: 41 |
## SMuCo: Reinforcement Learning for Visual Control via Sequential Multi-view Total CorrelationTong Cheng, Hang Dong, Lu Wang, Bo Qiao, Qingwei Lin, Saravan Rajmohan, Thomas MoscibrodaAbstract
The advent of abundant image data has catalyzed the advancement of visual control in reinforcement learning (RL) systems, leveraging multiple view- points to capture the same physical states, which could enhance control performance theoretically. However, integrating multi-view data into representation learning remains challenging. In this paper, we introduce SMuCo, an innovative multi-view reinforcement learning algorithm that constructs robust latent representations by optimizing multi- view sequential total correlation. This technique effectively captures task-relevant information and temporal dynamics while filtering out irrelevant data. Our method supports an unlimited number of views and demonstrates superior performance over leading model-free and model-based RL algorithms. Empirical results from the DeepMind Control Suite and the Sapien Basic Manipulation Task confirm SMuCo’s enhanced efficacy, significantly improving task performance across diverse scenarios and views. |

## ID: 42 |
## Inference in Probabilistic Answer Set Programs with Imprecise Probabilities via Optimization## [oral]Damiano Azzolini, Fabrizio RiguzziAbstract
Probabilistic answer set programming has recently been extended to manage imprecise probabilities by means of credal probabilistic facts and credal annotated disjunctions. This increases the expressivity of the language but, at the same time, the cost of inference. In this paper, we cast inference in probabilistic answer set programs with credal probabilistic facts and credal annotated disjunctions as a constrained nonlinear optimization problem where the function to optimize is obtained via knowledge compilation. Empirical results on different datasets with multiple configurations shows the effectiveness of our approach. |

## ID: 43 |
## Fast Reliability Estimation for Neural Networks with Adversarial Attack-Driven Importance SamplingKarim TIT, Teddy FuronAbstract
This paper introduces a novel approach to evaluate the reliability of Neural Networks (NNs) by integrating adversarial attacks with Importance Sampling (IS), enhancing the assessment's precision and efficiency. Leveraging adversarial attacks to guide IS, our method efficiently identifies vulnerable input regions, offering a more directed alternative to traditional Monte Carlo methods. While comparing our approach with classical reliability techniques like FORM and SORM, and with classical rare event simulation methods such as Cross-Entropy IS, we acknowledge its reliance on the effectiveness of adversarial attacks and its inability to handle very high-dimensional data such as ImageNet. Despite these challenges, our comprehensive empirical validations on the datasets the MNIST and CIFAR10 demonstrate the method's capability to accurately estimate NN reliability for a variety of models. Our research not only presents an innovative strategy for reliability assessment in NNs but also sets the stage for further work exploiting the connection between adversarial robustness and the field of statistical reliability engineering. |

## ID: 44 |
## Learning from Crowds with Dual-View K-Nearest NeighborJiao Li, Liangxiao Jiang, Xue Wu, Wenjun ZhangAbstract
In crowdsourcing scenarios, we can obtain multiple noisy labels from different crowd workers for each instance and then infer its integrated label via label integration. To achieve better performance, some recently published label integration methods have attempted to exploit the multiple noisy labels of inferred instances’ nearest neighbors via the K-nearest neighbor (KNN) algorithm. However, the used KNN algorithm searches inferred instances’ nearest neighbors only relying on the defined distance functions in the original attribute view and totally ignoring the valuable information hidden in the multiple noisy labels, which limits their performance. Motivated by multi-view learning, we define the multiple noisy labels as another label view of instances and propose to search inferred instances’ nearest neighbors using the joint information from both the original attribute view and the multiple noisy label view. To this end, we propose a novel label integration method called dual-view K-nearest neighbor (DVKNN). In DVKNN, we first define a new distance function to search the K-nearest neighbors of an inferred instance. Then, we define a fine-grained weight for each noisy label from each neighbor. Finally, we perform weighted majority voting (WMV) on all these noisy labels to obtain the integrated label of the inferred instance. Extensive experiments validate the effectiveness and rationality of DVKNN. |

## ID: 59 |
## BanditQ:Fair Bandits with Guaranteed RewardsAbhishek SinhaAbstract
Classic no-regret multi-armed bandit algorithms, including the Upper Confidence Bound (UCB), Hedge, and EXP3, are inherently unfair by design. Their unfairness stems from their objective of playing the most rewarding arm as frequently as possible while ignoring the rest. In this paper, we consider a fair prediction problem in the stochastic setting with a guaranteed minimum rate of accrual of rewards for each arm. We study the problem in both full-information and bandit feedback settings. Combining queueing-theoretic techniques with adversarial bandits, we propose a new online policy called BanditQ that achieves the target reward rates while conceding a regret and target rate violation penalty of at most $O(T^{\frac{3}{4}}).$ The regret bound in the full-information setting can be further improved to $O(\sqrt{T})$ under either a monotonicity assumption or when considering time-averaged regret. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem. The analysis of the BanditQ policy involves a new self-bounding inequality, which might be of independent interest. |

## ID: 65 |
## Efficient Monte Carlo Tree Search via On-the-Fly State-Conditioned Action Abstraction## [oral]Yunhyeok Kwak, Inwoo Hwang, Dooyoung Kim, Sanghack Lee, Byoung-Tak ZhangAbstract
Monte Carlo Tree Search (MCTS) has showcased its efficacy across a broad spectrum of decision-making problems. However, its performance often degrades under vast combinatorial action space, especially where an action is composed of multiple sub-actions. In this work, we propose an action abstraction based on the compositional structure between a state and sub-actions for improving the efficiency of MCTS under a factored action space. Our method learns a latent dynamics model with an auxiliary network that captures sub-actions relevant to the transition on the current state, which we call state-conditioned action abstraction. Notably, it infers such compositional relationships from high-dimensional observations without the known environment model. During the tree traversal, our method constructs the state-conditioned action abstraction for each node on-the-fly, reducing the search space by discarding the exploration of redundant sub-actions. Experimental results demonstrate the superior sample efficiency of our method compared to vanilla MuZero, which suffers from expansive action space. |

## ID: 76 |
## Towards Minimax Optimality of Model-based Robust Reinforcement Learning## [oral]Pierre Clavier, Erwan Le Pennec, Matthieu GeistAbstract
We study the sample complexity of obtaining an $\epsilon$-optimal policy in \emph{Robust} discounted Markov Decision Processes (RMDPs), given only access to a generative model of the nominal kernel. This problem is widely studied in the non-robust case, and it is known that any planning approach applied to an empirical MDP estimated with $\tilde{\mathcal{O}}(\frac{H^3 |S||A|}{\epsilon^2})$ samples provides an $\epsilon$-optimal policy, which is minimax optimal. Results in the robust case are much more scarce. For $sa$- (resp $s$-) rectangular uncertainty sets, until recently the best-known sample complexity was $\tilde{\mathcal{O}}(\frac{H^4 |S|^2|A|}{\epsilon^2})$ (resp. $\tilde{\mathcal{O}}(\frac{H^4 | S |^2| A |^2}{\epsilon^2})$), for specific algorithms and when the uncertainty set is based on the total variation (TV), the KL or the Chi-square divergences. In this paper, we consider uncertainty sets defined with an $L_p$-ball (recovering the TV case), and study the sample complexity of \emph{any} planning algorithm (with high accuracy guarantee on the solution) applied to an empirical RMDP estimated using the generative model. In the general case, we prove a sample complexity of $\tilde{\mathcal{O}}(\frac{H^4 | S || A |}{\epsilon^2})$ for both the $sa$- and $s$-rectangular cases (improvements of $| S |$ and $| S || A |$ respectively). When the size of the uncertainty is small enough, we improve the sample complexity to $\tilde{\mathcal{O}}(\frac{H^3 | S || A | }{\epsilon^2})$, recovering the lower-bound for the non-robust case for the first time and a robust lower-bound. Finally, we also introduce simple and efficient algorithms for solving the studied $L_p$ robust MDPs. |

## ID: 79 |
## Neighbor Similarity and Multimodal Alignment based Product Recommendation StudyZhiqiang Zhang, Yongqiang Jiang, Qian Gao, Zhipeng WangAbstract
Existing multimodal recommendation research still faces some challenges, such as not being able to fully mine the implicit relevance information of neighbor nodes, and the unreasonable weight allocation to imbalanced nodes. To address the aforementioned challenges, this paper introduces a new multimodal recommendation model called NSMAR+. Specifically, the model firstly constructs a neighbor similarity graph convolutional network to capture the implicit relevance information and reasonably assigns the attention weights through the graph attention mechanism. Secondly, the model introduces a modal alignment and fusion mechanism by using a multilayer perceptron (MLP) to map image and text features into a shared space for comparison and fusion. In addition, the model constructs a user co-interaction graph and an item semantic graph based on the original information and performs graph convolution operations to enhance the preference information of users and items, and to better capture the interactions and internal features between users and items. Finally, MLP is employed to aggregate user and item representations and predict personalized recommendation rankings. To validate the experiment’s efficacy, this paper compares with several leading multimodal recommendation models on public datasets with a performance improvement of 1% to 3%. The experimental outcomes indicate that the model in this paper has good superiority and accuracy in multimodal recommendation tasks. |

## ID: 86 |
## Differentiable Pareto-Smoothed Weighting for High-Dimensional Heterogeneous Treatment Effect EstimationYoichi Chikahara, Kansei UshiyamaAbstract
There is a growing interest in estimating heterogeneous treatment effects across individuals using their high-dimensional feature attributes. Achieving high performance in such high-dimensional heterogeneous treatment effect estimation is challenging because in this setup, it is usual that some features induce sample selection bias while others do not but are predictive of potential outcomes. To avoid losing such predictive feature information, existing methods learn separate feature representations using inverse probability weighting (IPW). However, due to their numerically unstable IPW weights, these methods suffer from estimation bias under a finite sample setup. To develop a numerically robust estimator by weighted representation learning, we propose a differentiable Pareto-smoothed weighting framework that replaces extreme weight values in an end-to-end fashion. Our experimental results show that by effectively correcting the weight values, our proposed method outperforms the existing ones, including traditional weighting schemes. Our code is available at [this https URL](https://github.com/ychika/DPSW). |

## ID: 90 |
## Calibrated and Conformal Propensity Scores for Causal Effect EstimationShachi Deshpande, Volodymyr KuleshovAbstract
Propensity scores are commonly used to balance observed covariates while estimating treatment effects. We argue that the probabilistic output of a learned propensity score model should be calibrated, i.e. a predictive treatment probability of 90% should correspond to 90% of individuals being assigned the treatment group. We propose simple recalibration techniques to ensure this property. We prove that calibration is a necessary condition for unbiased treatment effect estimation when using popular inverse propensity weighted and doubly robust estimators. We derive error bounds on causal effect estimates that directly relate to the quality of uncertainties provided by the probabilistic propensity score model and show that calibration strictly improves this error bound while also avoiding extreme propensity weights. We demonstrate improved causal effect estimation with calibrated propensity scores in several tasks including high-dimensional image covariates and genome-wide association studies (GWASs). Calibrated propensity scores improve the speed of GWAS analysis by more than two-fold by enabling the use of simpler models that are faster to train. |

## ID: 94 |
## Hybrid CtrlFormer: Learning Adaptive Search Space Partition for Hybrid Action Control via Transformer-based Monte Carlo Tree SearchJiashun Liu, Xiaotian Hao, Jianye HAO, YAN ZHENG, Yujing Hu, Changjie Fan, Tangjie Lv, Zhipeng HuAbstract
Hybrid action control tasks are common in the real world, which require controlling some discrete and continuous actions simultaneously. To solve these tasks, existing Deep Reinforcement learning (DRL) methods either directly build a separate policy for each type of action or simplify the hybrid action space into a discrete or continuous action control problem. However, these methods neglect the challenge of exploration resulting from the complexity of the hybrid action space. Thus, it is necessary to design more sample efficient algorithms. To this end, we propose a novel Hybrid Control Transformer (Hybrid CtrlFormer), to achieve better exploration and exploitation for the hybrid action control problems. The core idea is: 1) we construct a hybrid action space tree with the discrete actions at the higher level and the continuous parameter space at the lower level. Each parameter space is split into multiple subregions. 2) To simplify the exploration space, a Transformer-based Monte-Carlo tree search method is designed to efficiently evaluate and partition the hybrid action space into good and bad subregions along the tree. Our method achieves state-of-the-art performance and sample efficiency in a variety of environments with discrete-continuous action space. |

## ID: 98 |
## Online Policy Optimization for Robust Markov Decision ProcessJing Dong, Jingwei Li, Baoxiang Wang, Jingzhao ZhangAbstract
Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go. However, real-world deployment of end-to-end RL models is less common, as RL models can be very sensitive to perturbations in the environment. The robust Markov decision process (MDP) framework---in which the transition probabilities belong to an uncertainty set around a nominal model---provides one way to develop robust models. While previous analysis for robust MDP shows RL algorithms are effective assuming access to a generative model, it remains unclear whether RL can be efficient under a more realistic online setting, which requires a careful balance between exploration and exploitation. In this work, we consider online robust MDP by interacting with an unknown nominal system. We propose a robust optimistic policy optimization algorithm that is provably efficient. To address the additional uncertainty caused by an adversarial environment, our model features a new optimistic update rule derived via Fenchel conjugates. Our analysis establishes the first regret bound for online robust MDPs. |

## ID: 100 |
## ContextFlow++: Generalist-Specialist Flow-based Generative Models with Mixed-variable Context EncodingDenis A Gudovskiy, Tomoyuki Okuno, Yohei NakataAbstract
Normalizing flow-based generative models have been widely used in applications where the exact density estimation is of major importance. Recent research proposes numerous methods to improve their expressivity. However, conditioning on a context is largely overlooked area in the bijective flow research. Conventional conditioning with the vector concatenation is limited to only a few flow types. More importantly, this approach cannot support a practical setup where a set of context-conditioned (*specialist*) models are trained with the fixed pretrained general-knowledge (*generalist*) model. We propose ContextFlow++ approach to overcome these limitations using an additive conditioning with explicit generalist-specialist knowledge decoupling. Furthermore, we support discrete contexts by the proposed mixed-variable architecture with context encoders. Particularly, our context encoder for discrete variables is a surjective flow from which the context-conditioned continuous variables are sampled. Our experiments on rotated MNIST-R, corrupted CIFAR-10C, real-world ATM predictive maintenance and SMAP unsupervised anomaly detection benchmarks show that the proposed ContextFlow++ offers faster stable training and achieves higher performance metrics. Our code is publicly available at [github.com/gudovskiy/contextflow](https://github.com/gudovskiy/contextflow). |

## ID: 101 |
## Inference for Optimal Linear Treatment Regimes in Personalized Decision-making## [oral]Yuwen Cheng, Shu YangAbstract
Personalized decision-making, tailored to individual characteristics, is gaining significant attention. The optimal treatment regime aims to provide the best-expected outcome in the entire population, known as the value function. One approach to determine this optimal regime is by maximizing the Augmented Inverse Probability Weighting (AIPW) estimator of the value function. However, the derived treatment regime can be intricate and nonlinear, limiting their use. For clarity and interoperability, we emphasize linear regimes and determine the optimal linear regime by optimizing the AIPW estimator within set constraints. While the AIPW estimator offers a viable path to estimating the optimal regime, current methodologies predominantly focus on its asymptotic distribution, leaving a gap in studying the linear regime itself. However, there are many benefits to understanding the regime, as pinpointing significant covariates can enhance treatment effects and provide future clinical guidance. In this paper, we explore the asymptotic distribution of the estimated linear regime. Our results show that the parameter associated with the linear regime follows a cube-root convergence to a non-normal limiting distribution characterized by the maximizer of a centered Gaussian process with a quadratic drift. When making inferences for the estimated linear regimes with cube-root convergence in practical scenarios, the standard nonparametric bootstrap is invalid. As a solution, we facilitate the Cattaneo et al. (2020) bootstrap technique to provide a consistent distributional approximation for the estimated linear regimes, validated further through simulations and real-world data applications from the eICU Collaborative Research Database. |

## ID: 103 |
## Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal StatesZiqiao Wang, Yongyi MaoAbstract
Stochastic differential equations (SDEs) have been shown recently to characterize well the dynamics of training machine learning models with SGD. When the generalization error of the SDE approximation closely aligns with that of SGD in expectation, it provides two opportunities for understanding better the generalization behaviour of SGD through its SDE approximation. Firstly, viewing SGD as full-batch gradient descent with Gaussian gradient noise allows us to obtain trajectory-based generalization bound using the information-theoretic bound from Xu and Raginsky [2017]. Secondly, assuming mild conditions, we estimate the steady-state weight distribution of SDE and use information-theoretic bounds from Xu and Raginsky [2017] and Negrea et al. [2019] to establish terminal-state-based generalization bounds. Our proposed bounds have some advantages, notably the trajectory-based bound outperforms results in Wang and Mao [2022], and the terminal-state-based bound exhibits a fast decay rate comparable to stability-based bounds. |

## ID: 108 |
## Towards Bounding Causal Effects under Markov Equivalence## [oral]Alexis BellotAbstract
Predicting the effect of unseen interventions is a fundamental research question across the data sciences. It is well established that in general such questions cannot be answered definitively from observational data. This realization has fuelled a growing literature introducing various identifying assumptions, for example in the form of a causal diagram among relevant variables. In practice, this paradigm is still too rigid for many practical applications as it is generally not possible to confidently delineate the true causal diagram. In this paper, we consider the derivation of bounds on causal effects given only observational data. We propose to take as input a less informative structure known as a Partial Ancestral Graph, which represents a Markov equivalence class of causal diagrams and is learnable from data. In this more ``data-driven'' setting, we provide a systematic algorithm to derive bounds on causal effects that exploit the invariant properties of the equivalence class, and that can be computed analytically. We demonstrate our method with synthetic and real data examples. |

## ID: 112 |
## Guaranteeing Robustness Against Real-World Perturbations In Time Series Classification Using Conformalized Randomized SmoothingNicola Franco, Jakob Spiegelberg, Jeanette Miriam Lorenz, Stephan GünnemannAbstract
Certifying the robustness of machine learning models against domain shifts and input space perturbations is crucial for many applications, where high risk decisions are based on the model's predictions. Techniques such as randomized smoothing have partially addressed this issues with a focus on adversarial attacks in the past. In this paper, we generalize randomized smoothing to arbitrary transformations and extend it to conformal prediction. The proposed ansatz is demonstrated on a time series classifier connected to an automotive use case. We meticulously assess the robustness of smooth classifiers in environments subjected to various degrees and types of time series native perturbations and compare it against standard conformal predictors. The proposed method consistently offers superior resistance to perturbations, maintaining high classification accuracy and reliability. Additionally, we are able to bound the performance on new domains via calibrating generalization with configuration shifts in the training data. In combination, conformalized randomized smoothing may offer a model agnostic approach to construct robust classifiers tailored to perturbations in their respective applications - a crucial capability for AI assurance argumentation. |

## ID: 113 |
## Can we Defend Against the Unknown? An Empirical Study About Threshold Selection for Neural Network MonitoringKhoi Tran DANG, Kevin Delmas, Jeremie Guiochet, Joris GuerinAbstract
With the increasing use of neural networks in critical systems, runtime monitoring becomes essential to reject unsafe predictions during inference. Various techniques have emerged to establish rejection scores that maximize the separability between the distributions of safe and unsafe predictions. The efficacy of these approaches is mostly evaluated using threshold-agnostic metrics, such as the area under the receiver operating characteristic curve. However, in real-world applications, an effective monitor also requires identifying a good threshold to transform these scores into meaningful binary decisions. Despite the pivotal importance of threshold optimization, this problem has received little attention. A few studies touch upon this question, but they typically assume that the runtime data distribution mirrors the training distribution, which is a strong assumption as monitors are supposed to safeguard a system against potentially unforeseen threats. In this work, we present rigorous experiments on various image datasets to investigate: 1. The effectiveness of monitors in handling unforeseen threats, which are not available during threshold adjustments. 2. Whether integrating generic threats into the threshold optimization scheme can enhance the robustness of monitors. |

## ID: 116 |
## GeONet: a neural operator for learning the Wasserstein geodesicAndrew Gracyk, Xiaohui ChenAbstract
Optimal transport (OT) offers a versatile framework to compare complex data distributions in a geometrically meaningful way. Traditional methods for computing the Wasserstein distance and geodesic between probability measures require mesh-specific domain discretization and suffer from the curse-of-dimensionality. We present GeONet, a mesh-invariant deep neural operator network that learns the non-linear mapping from the input pair of initial and terminal distributions to the Wasserstein geodesic connecting the two endpoint distributions. In the offline training stage, GeONet learns the saddle point optimality conditions for the dynamic formulation of the OT problem in the primal and dual spaces that are characterized by a coupled PDE system. The subsequent inference stage is instantaneous and can be deployed for real-time predictions in the online learning setting. We demonstrate that GeONet achieves comparable testing accuracy to the standard OT solvers on simulation examples and the MNIST dataset with considerably reduced inference-stage computational cost by orders of magnitude. |

## ID: 117 |
## Metric Learning from Limited Pairwise Preference ComparisonsZhi Wang, Geelon So, Ramya Korlakai VinayakAbstract
We study metric learning from preference comparisons under the ideal point model, in which a user prefers an item over another if it is closer to their latent ideal item. These items are embedded into $\mathbb{R}^d$ equipped with an unknown Mahalanobis distance shared across users. While recent work shows that it is possible to simultaneously recover the metric and ideal items given $\mathcal{O}(d)$ pairwise comparisons per user, in practice we often have a limited budget of $o(d)$ comparisons. We study whether the metric can still be recovered, even though learning individual ideal items is now no longer possible. We show that, on the one hand, $o(d)$ comparisons may not reveal any information about the metric, even with infinitely many users. On the other hand, when comparisons are made over items that exhibit low-dimensional structure, each user can contribute to learning the metric restricted to a low-dimensional subspace so that the metric can be jointly identified. We present a divide-and-conquer approach that achieves this, and provide theoretical recovery guarantees and empirical validation. |

## ID: 120 |
## Bayesian Pseudo-Coresets via Contrastive DivergencePiyush Tiwary, Kumar Shubham, Vivek V Kashyap, Prathosh APAbstract
Bayesian methods provide an elegant framework for estimating parameter posteriors and quantification of uncertainty associated with probabilistic models. However, they often suffer from slow inference times. To address this challenge, Bayesian Pseudo-Coresets (BPC) have emerged as a promising solution. BPC methods aim to create a small synthetic dataset, known as pseudo-coresets, that approximates the posterior inference achieved with the original dataset. This approximation is achieved by optimizing a divergence measure between the true posterior and the pseudo-coreset posterior. Various divergence measures have been proposed for constructing pseudo-coresets, with forward Kullback-Leibler (KL) divergence being the most successful. However, using forward KL divergence necessitates sampling from the pseudo-coreset posterior, often accomplished through approximate Gaussian variational distributions. Alternatively, one could employ Markov Chain Monte Carlo (MCMC) methods for sampling, but this becomes challenging in high-dimensional parameter spaces due to slow mixing. In this study, we introduce a novel approach for constructing pseudo-coresets by utilizing contrastive divergence. Importantly, optimizing contrastive divergence eliminates the need for approximations in the pseudo-coreset construction process. Furthermore, it enables the use of finite-step MCMC methods, alleviating the requirement for extensive mixing to reach a stationary distribution. To validate our method's effectiveness, we conduct extensive experiments on multiple datasets, demonstrating its superiority over existing BPC techniques. Our implementation is available at https://github.com/backpropagator/BPC-CD . |

## ID: 134 |
## How Inverse Conditional Flows Can Serve as a Substitute for Distributional RegressionLucas Kook, Chris Kolb, Philipp Schiele, Daniel Dold, Marcel Arpogaus, Cornelius Fritz, Philipp F. M. Baumann, Philipp Kopper, Tobias Pielok, Emilio Dorigatti, David RügamerAbstract
Neural network representations of simple models, such as linear regression, are being studied increasingly to better understand the underlying principles of deep learning algorithms. However, neural representations of distributional regression models, such as the Cox model, have received little attention so far. We close this gap by proposing a framework for distributional regression using inverse flow transformations (DRIFT), which includes neural representations of the aforementioned models. We empirically demonstrate that the neural representations of models in DRIFT can serve as a substitute for their classical statistical counterparts in several applications involving continuous, ordered, time-series, and survival outcomes. We confirm that models in DRIFT empirically match the performance of several statistical methods in terms of estimation of partial effects, prediction, and aleatoric uncertainty quantification. DRIFT covers both interpretable statistical models and flexible neural networks opening up new avenues in both statistical modeling and deep learning. |

## ID: 140 |
## QuantProb: Generalizing Probabilities along with Predictions for a Pre-trained ClassifierAditya Challa, Soma S Dhavala, Snehanshu SahaAbstract
Quantification of Uncertainty in predictions is a challenging problem. In the classification settings, although deep learning based models generalize well, class probabilities often lack reliability. Calibration errors are used to quantify uncertainty, and several methods exist to minimize calibration error. We argue that between the choice of having a minimum calibration error on original distribution which increases across distortions or having a (possibly slightly higher) calibration error which is constant across distortions, we prefer the latter We hypothesize that the reason for unreliability of deep networks is - The way neural networks are currently trained, the probabilities do not generalize across small distortions. We observe that quantile based approaches can potentially solve this problem. We propose an innovative approach to decouple the construction of quantile representations from the loss function allowing us to compute quantile based probabilities without disturbing the original network. We achieve this by establishing a novel duality property between quantiles and probabilities, and an ability to obtain quantile probabilities from any pre-trained classifier. While post-hoc calibration techniques successfully minimize calibration errors, they do not preserve robustness to distortions. We show that, Quantile probabilities (QuantProb), obtained from Quantile representations, preserve the calibration errors across distortions, since quantile probabilities generalize better than the naive Softmax probabilities. |

## ID: 141 |
## Publishing Number of Walks and Katz Centrality under Local Differential PrivacyLouis BETZER, Vorapong Suppakitpaisarn, Quentin HillebrandAbstract
In our study, we present an algorithm for publishing the count of walks and Katz centrality under local differential privacy (LDP), complemented by a comprehensive theoretical analysis. While previous research in LDP has predominantly focused on counting subgraphs with a maximum of five nodes, our work extends this to larger subgraphs. The primary challenge in such an extension lies in managing the exponentially increasing noise associated with LDP as the size of the subgraph grows. Our solution involves an algorithm for publishing the count of walks originating from each node in the graph, which subsequently enables us to publish the Katz centrality of all nodes. This algorithm incorporates multiple communication rounds and employs a clipping technique. Through both theoretical and empirical evaluation, we demonstrate that our algorithm achieves has a relatively small bias and variance, showing significant improvements over both the randomized response method and non-clipping algorithms. Additionally, our approach to estimating Katz centrality successfully identifies up to 90\% of the nodes with the highest centrality values. |

## ID: 144 |
## On the Convergence of Hierarchical Federated Learning with Partial Worker ParticipationXiaohan Jiang, Hongbin ZhuAbstract
Hierarchical federated learning (HFL) has emerged as the architecture of choice for multi-level communication networks, mainly because of its data privacy protection and low communication cost. However, existing studies on the convergence analysis for HFL are limited to the assumptions of full worker participation and/or i.i.d. datasets across workers, both of which rarely hold in practice. Motivated by this, we in this work propose a unified convergence analysis framework for HFL covering both full and partial worker participation with non-i.i.d. data, non-convex objective function and stochastic gradient. We correspondingly develop a three-sided learning rates algorithm to mitigate data divergences issue, thereby realizing better convergence performance. Our theoretical results provide key insights of why partial participation of HFL is beneficial in significantly reducing the data divergences compared to standard FL. Besides, the convergence analysis allows certain individualization for each cluster in HFL indicating that adjusting the worker sampling ratio and round period can improve the convergence behavior. |

## ID: 145 |
## Optimistic Regret Bounds for Online Learning in Adversarial Markov Decision Processes## [oral]Sang Bin Moon, Abolfazl HashemiAbstract
The Adversarial Markov Decision Process (AMDP) is a learning framework that deals with unknown and varying tasks in decision-making applications like robotics and recommendation systems. A major limitation of the AMDP formalism, however, is pessimistic regret analysis results in the sense that although the cost function can change from one episode to the next, the evolution in many settings is not adversarial. To address this, we introduce and study a new variant of AMDP, which aims to minimize regret while utilizing a set of cost predictors. For this setting, we develop a new policy search method that achieves a sublinear optimistic regret with high probability, that is a regret bound which gracefully degrades with the estimation power of the cost predictors. Establishing such optimistic regret bounds is nontrivial given that (i) as we demonstrate, the existing importance-weighted cost estimators cannot establish optimistic bounds, and (ii) the feedback model of AMDP is different (and more realistic) than the existing optimistic online learning works. Our result, in particular, hinges upon developing a novel optimistically biased cost estimator that leverages cost predictors and enables a high-probability regret analysis without imposing restrictive assumptions. We further discuss practical extensions of the proposed scheme and demonstrate its efficacy numerically. |

## ID: 161 |
## Exploring High-dimensional Search Space via Voronoi Graph TraversingAidong Zhao, Xuyang Zhao, Tianchen Gu, Zhaori Bi, Xinwei Sun, Changhao Yan, Fan Yang, Dian Zhou, Xuan ZengAbstract
Bayesian optimization (BO) is a well-established methodology for optimizing costly black-box functions. However, the sparse observations in the high-dimensional search space pose challenges in constructing reliable Gaussian Process (GP) models, which leads to blind exploration of the search space. We propose a novel Voronoi Graph Traversing (VGT) algorithm to extend BO to ultra high-dimensional problems. VGT employs a Voronoi diagram to mesh the design space and transform it into an undirected Voronoi graph. VGT explores the search space by iteratively performing path selection, promising cell sampling, and graph expansion operations. We introduce a UCB-based global traversal strategy to select the path towards promising Voronoi cells. Then we perform local BO within the promising cell and train local GP with a neighboring subset. The intrinsic geometric boundaries and adjacency of the Voronoi graph assist in fine-tuning the trajectory of local BO sampling. We also present a subspace enhancement approach for the intrinsic low-dimensional problems. Experimental results, including both synthetic benchmarks and real-world applications, demonstrate the proposed approach's state-of-the-art performance for tackling ultra high-dimensional problems ranging from hundreds to one thousand dimensions. |

## ID: 162 |
## Localised Natural Causal Learning Algorithms for Weak Consistency ConditionsKai Z. Teh, Kayvan Sadeghi, Terry SooAbstract
By relaxing conditions for “natural” structure learning algorithms, a family of constraint-based algorithms containing all exact structure learning algorithms under the faithfulness assumption, we define localised natural structure learning algorithms (LoNS). We also provide a set of necessary and sufficient assumptions for consistency of LoNS, which can be thought of as a strict relaxation of the restricted faithfulness assumption. We provide a practical LoNS algorithm that runs in exponential time, which is then compared with related existing structure learning algorithms, namely PC/SGS and the relatively recent Sparsest Permutation algorithm. Simulation studies are also provided. |

## ID: 171 |
## Early-Exit Neural Networks with Nested Prediction SetsMetod Jazbec, Patrick Forré, Stephan Mandt, Dan Zhang, Eric NalisnickAbstract
Early-exit neural networks (EENNs) facilitate adaptive inference by producing predictions at multiple stages of the forward pass. In safety-critical applications, these predictions are only meaningful when complemented with reliable uncertainty estimates. Yet, due to their sequential structure, an EENN's uncertainty estimates should also be *consistent*: labels that are deemed improbable at one exit should not reappear within the confidence interval / set of later exits. We show that standard uncertainty quantification techniques, like Bayesian methods or conformal prediction, can lead to inconsistency across exits. We address this problem by applying anytime-valid confidence sequences (AVCSs) to the exits of EENNs. By design, AVCSs maintain consistency across exits. We examine the theoretical and practical challenges of applying AVCSs to EENNs and empirically validate our approach on both regression and classification tasks. |

## ID: 176 |
## Cost-Sensitive Uncertainty-Based Failure Recognition for Object Detection## [oral]Moussa Kassem Sbeyti, Michelle E. Karg, Christian Wirth, Nadja Klein, Sahin AlbayrakAbstract
Object detectors in real-world applications often fail to detect objects due to varying factors such as weather conditions and noisy input. Therefore, a process that mitigates false detections is crucial for both safety and accuracy. While uncertainty-based thresholding shows promise, previous works demonstrate an imperfect correlation between uncertainty and detection errors. This hinders ideal thresholding, prompting us to further investigate the correlation and associated cost with different types of uncertainty. We therefore propose a cost-sensitive framework for object detection tailored to user-defined budgets on the two types of errors, missing and false detections. We derive minimum thresholding requirements to prevent performance degradation and define metrics to assess the applicability of uncertainty for failure recognition. Furthermore, we automate and optimize the thresholding process to maximize the failure recognition rate w.r.t. the specified budget. Evaluation on three autonomous driving datasets demonstrates that our approach significantly enhances safety, particularly in challenging scenarios. Leveraging localization aleatoric uncertainty and softmax-based entropy only, our method boosts the failure recognition rate by 36-60\% compared to conventional approaches. Code is available at https://mos-ks.github.io/publications. |

## ID: 178 |
## Robust Entropy Search for Safe Efficient Bayesian OptimizationDorina Weichert, Alexander Kister, Sebastian Houben, Patrick Link, Gunar ErnisAbstract
The practical use of Bayesian Optimization (BO) in engineering applications imposes special requirements: high sampling efficiency on the one hand and finding a robust solution on the other hand. We address the case of adversarial robustness, where all parameters are controllable during the optimization process, but a subset of them is uncontrollable or even adversely perturbed at the time of application. To this end, we develop an efficient information-based acquisition function that we call Robust Entropy Search (RES). We empirically demonstrate its benefits in experiments on synthetic and real-life data. The results show that RES reliably finds robust optima, outperforming state-of-the-art algorithms. |

## ID: 179 |
## AutoDrop: Training Deep Learning Models with Automatic Learning Rate DropJing Wang, Yunfei Teng, Anna Ewa ChoromanskaAbstract
Modern deep learning (DL) architectures are trained using variants of the SGD algorithm and typically rely on the user to manually drop the learning rate when the training curve saturates. In this paper, we develop an algorithm, that we call AutoDrop, that realizes the learning rate drop automatically and stems from the properties of the learning dynamics of DL systems. Specifically, it is motivated by the observation that the angular velocity of the model parameters, i.e., the velocity of the changes of the convergence direction, for a fixed learning rate initially increases rapidly and then progresses towards soft saturation. At saturation, the optimizer slows down thus the angular velocity saturation is a good indicator for dropping the learning rate. After the drop, the angular velocity “resets” and follows the pattern described above, increasing again until saturation. AutoDrop is built on this idea and drops the learning rate whenever the angular velocity saturates. The method is simple to implement, computationally cheap, and by design avoids the short-horizon bias problem. We show that AutoDrop achieves favorable performance compared to many different baseline manual and automatic learning rate schedulers, and matches the SOTA performance on all our experiments. On the theoretical front, we claim two contributions: we formulate the learning rate behavior based on the angular velocity and provide general convergence theory for the learning rate schedulers that decrease the learning rate step-wise, rather than continuously as is commonly analyzed. |

## ID: 182 |
## On the Capacitated Facility Location Problem with Scarce Resources## [oral]Gennaro Auricchio, Harry J. Clough, Jie ZhangAbstract
This paper investigates the Mechanism Design aspects of the $m$-Capacitated Facility Location Problem where the total facility capacity is lower than the number of agents. Following \cite{aziz2020capacity}, the Social Welfare of the facility location is determined through a First-Come-First-Served (FCFS) game where agents compete after the facility positions are established. When the number of facilities is $m>1$, the Nash Equilibrium (NE) of the FCFS game is not unique, thus the utility of the agents and the notion of truthfulness are not well-defined. To address these issues, we consider absolutely truthful mechanisms, i.e. mechanisms able to prevent agents from misreporting regardless of the strategies played during the FCFS game. We pair this more stringent truthfulness requirement with the notion of Equilibrium Stable (ES) mechanism, i.e. mechanisms whose Social Welfare does not depend on the NE of the FCFS game. We show that the class of percentile mechanisms is absolutely truthful and characterize under which conditions they are ES. We then show that the approximation ratio of each ES percentile mechanism is bounded and determine its value. Notably, when all the facilities have the same capacity and the number of agents is large enough, it is possible to achieve an approximation ratio smaller than $1+\frac{1}{2m-1}$. We enhance our findings by empirically evaluating the mechanisms' performances when agents' true positions follows a distribution. |

## ID: 184 |
## How to Fix a Broken Confidence Estimator: Evaluating Post-hoc Methods for Selective Classification with Deep Neural NetworksLuís Felipe Prates Cattelan, Danilo SilvaAbstract
This paper addresses the problem of selective classification for deep neural networks, where a model is allowed to abstain from low-confidence predictions to avoid potential errors. We focus on so-called post-hoc methods, which replace the confidence estimator of a given classifier without modifying or retraining it, thus being practically appealing. Considering neural networks with softmax outputs, our goal is to identify the best confidence estimator that can be computed directly from the unnormalized logits. This problem is motivated by the intriguing observation in recent work that many classifiers appear to have a ``broken'' confidence estimator, in the sense that their selective classification performance is much worse than what could be expected by their corresponding accuracies. We perform an extensive experimental study of many existing and proposed confidence estimators applied to 84 pretrained ImageNet classifiers available from popular repositories. Our results show that a simple $p$-norm normalization of the logits, followed by taking the maximum logit as the confidence estimator, can lead to considerable gains in selective classification performance, completely fixing the pathological behavior observed in many classifiers. As a consequence, the selective classification performance of any classifier becomes almost entirely determined by its corresponding accuracy. Moreover, these results are shown to be consistent under distribution shift. |

## ID: 193 |
## Functional Wasserstein Variational Policy OptimizationJunyu Xuan, Mengjing Wu, Zihe Liu, Jie LuAbstract
Variational policy optimization has become increasingly attractive to the reinforcement learning community because of its strong capability in uncertainty modeling and environment generalization. However, almost all existing studies in this area rely on Kullback–Leibler (KL) divergence which is unfortunately ill-defined in several situations. In addition, the policy is parameterized and optimized in weight space, which may not only bring additional unnecessary bias but also make the policy learning harder due to the complicatedly dependent weight posterior. In the paper, we design a novel functional Wasserstein variational policy optimization (FWVPO) based on the Wasserstein distance between function distributions. Specifically, we firstly parameterize policy as a Bayesian neural network but from a function-space view rather than a weight-space view and then propose FWVPO to optimize and explore the functional policy posterior. We prove that our FWVPO is a valid variational Bayesian objective and also guarantees the monotonic expected reward improvement under certain conditions. Experimental results on multiple reinforcement learning tasks demonstrate the efficiency of our new algorithm in terms of both cumulative rewards and uncertainty modeling capability. |

## ID: 194 |
## Graph Contrastive Learning under Heterophily via Graph FiltersWenhan Yang, Baharan MirzasoleimanAbstract
Graph contrastive learning (CL) methods learn node representations in a self-supervised manner by maximizing the similarity between the augmented node representations obtained via a GNN-based encoder. However, CL methods perform poorly on graphs with heterophily, where connected nodes tend to belong to different classes. In this work, we address this problem by proposing an effective graph CL method, namely HLCL, for learning graph representations under heterophily. HLCL first identifies a homophilic and a heterophilic subgraph based on the cosine similarity of node features. It then uses a low-pass and a high-pass graph filter to aggregate representations of nodes connected in the homophilic subgraph and differentiate representations of nodes in the heterophilic subgraph. The final node representations are learned by contrasting both the augmented high-pass filtered views and the augmented low-pass filtered node views. Our extensive experiments show that HLCL outperforms state-of-the-art graph CL methods on benchmark datasets with heterophily, as well as large-scale real-world graphs, by up to 7%, and outperforms graph supervised learning methods on datasets with heterophily by up to 10%. |

## ID: 198 |
## Identification and Estimation of Conditional Average Partial Causal Effects via Instrumental Variable## [oral]Yuta Kawakami, manabu kuroki, Jin TianAbstract
There has been considerable recent interest in estimating heterogeneous causal effects. In this paper, we study conditional average partial causal effects (CAPCE) to reveal the heterogeneity of causal effects with continuous treatment. We provide conditions for identifying CAPCE in an instrumental variable setting. Notably, CAPCE is identifiable under a weaker assumption than required by a commonly used measure for estimating heterogeneous causal effects of continuous treatment. We develop three families of CAPCE estimators: sieve, parametric, and reproducing kernel Hilbert space (RKHS)-based, and analyze their statistical properties. We illustrate the proposed CAPCE estimators on synthetic and real-world data. |

## ID: 209 |
## Mitigating Overconfidence in Out-of-Distribution Detection by Capturing Extreme ActivationsMohammad Azizmalayeri, Ameen Abu-Hanna, Giovanni CinàAbstract
Detecting out-of-distribution (OOD) instances is crucial for the reliable deployment of machine learning models in real-world scenarios. OOD inputs are commonly expected to cause a more uncertain prediction in the primary task; however, there are OOD cases for which the model returns a highly confident prediction. This phenomenon, denoted as "overconfidence", presents a challenge to OOD detection. Specifically, theoretical evidence indicates that overconfidence is an intrinsic property of certain neural network architectures, leading to poor OOD detection. In this work, we address this issue by measuring extreme activation values in the penultimate layer of neural networks and then leverage this proxy of overconfidence to improve on several OOD detection baselines. We test our method on a wide array of experiments spanning synthetic data and real-world data, tabular and image datasets, multiple architectures such as ResNet and Transformer, different training loss functions, and include the scenarios examined in previous theoretical work. Compared to the baselines, our method often grants substantial improvements, with double-digit increases in OOD detection AUC, and it does not damage performance in any scenario. |

## ID: 214 |
## Functional Wasserstein Bridge Inference for Bayesian Deep Learning## [oral]Mengjing Wu, Junyu Xuan, Jie LuAbstract
Bayesian deep learning (BDL) is an emerging field that combines the strong function approximation power of deep learning with the uncertainty modeling capabilities of Bayesian methods. In addition to those virtues, however, there are accompanying issues brought by such a combination to the classical parameter-space variational inference, such as the nonmeaningful priors, intricate posteriors, and possible pathologies. In this paper, we propose a new function-space variational inference solution called Functional Wasserstein Bridge Inference (FWBI), which can assign meaningful functional priors and obtain well-behaved posterior. Specifically, we develop a Wasserstein distance-based bridge to avoid the potential pathological behaviors of Kullback–Leibler (KL) divergence between stochastic processes that arise in most existing functional variational inference approaches. The derived functional variational objective is well-defined and proved to be a lower bound of the model evidence. We demonstrate the improved predictive performance and better uncertainty quantification of our FWBI on several tasks compared with various parameter-space and function-space variational methods. |

## ID: 215 |
## Approximate Bayesian Computation with Path Signatures## [spotlight]Joel Dyer, Patrick Cannon, Sebastian M SchmonAbstract
Simulation models often lack tractable likelihood functions, making likelihood-free inference methods indispensable. Approximate Bayesian computation generates likelihood-free posterior samples by comparing simulated and observed data through some distance measure, but existing approaches are often poorly suited to time series simulators, for example due to an independent and identically distributed data assumption. In this paper, we propose to use path signatures in approximate Bayesian computation to handle the sequential nature of time series. We provide theoretical guarantees on the resultant posteriors and demonstrate competitive Bayesian parameter inference for simulators generating univariate, multivariate, and irregularly spaced sequences of non-iid data. |

## ID: 222 |
## Distributionally Robust Optimization as a Scalable Framework to Characterize Extreme Value DistributionsPatrick Kendal Kuiper, Ali Hasan, Wenhao Yang, Jose Blanchet, Vahid Tarokh, Yuting Ng, Hoda BidkhoriAbstract
The goal of this paper is to develop distributionally robust optimization (DRO) estimators, specifically for multidimensional Extreme Value Theory (EVT) statistics. EVT supports using semi-parametric models called max-stable distributions built from spatial Poisson point processes. While powerful, these models are only asymptotically valid for large samples. However, since extreme data is by definition scarce, the potential for model misspecification error is inherent to these applications, thus DRO estimators are natural. In order to mitigate over-conservative estimates while enhancing out-of-sample performance, we study DRO estimators informed by semi-parametric max-stable constraints in the space of point processes. We study both tractable convex formulations for some problems of interest (e.g. CVaR) and more general neural network based estimators. Both approaches are validated using synthetically generated data, recovering prescribed characteristics, and verifying the efficacy of the proposed techniques. Additionally, the proposed method is applied to a real data set of financial returns for comparison to a previous analysis. We established the proposed model as a novel formulation in the multivariate EVT domain, and innovative with respect to performance when compared to relevant alternate proposals. |

## ID: 223 |
## Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language ModelsXinyang Liu, Dongsheng Wang, Bowei Fang, Miaoge Li, Yishi Xu, Zhibin Duan, Bo Chen, Mingyuan ZhouAbstract
For downstream applications of vision-language pre-trained models, there has been significant interest in constructing effective prompts. Existing works on prompt engineering, which either require laborious manual designs or optimize the prompt tuning as a point estimation problem, may fail to describe diverse characteristics of categories and limit their applications. We introduce a Bayesian probabilistic resolution to prompt tuning, where the label-specific stochastic prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model. Importantly, we semantically regularize the tuning process by minimizing the statistic distance between the visual patches and linguistic prompts, which pushes the stochastic label representations to faithfully capture diverse visual concepts, instead of overfitting the training categories. We evaluate the effectiveness of our approach on four tasks: few-shot image recognition, base-to-new generalization, dataset transfer learning, and domain shifts. Extensive results on over 15 datasets show promising transferability and generalization performance of our proposed model, both quantitatively and qualitatively. |

## ID: 224 |
## Reflected Schrödinger Bridge for Constrained Generative Modeling## [oral]Wei Deng, Yu Chen, Nicole Tianjiao Yang, Hengrong Du, Qi Feng, Ricky T. Q. ChenAbstract
Diffusion models have become the go-to method for large-scale generative models in real-world applications. These applications often involve data distributions confined within bounded domains, typically requiring ad-hoc thresholding techniques for boundary enforcement. Reflected diffusion models aim to enhance generalizability by generating the data distribution through a backward process governed by reflected Brownian motion. However, reflected diffusion models may not easily adapt to diverse domains without the derivation of proper diffeomorphic mappings and do not guarantee optimal transport properties. To overcome these limitations, we introduce the Reflected Schrödinger Bridge algorithm—an entropy-regularized optimal transport approach tailored for generating data within diverse bounded domains. We derive elegant reflected forward-backward stochastic differential equations with Neumann and Robin boundary conditions, extend divergence-based likelihood training to bounded domains, and explore natural connections to entropic optimal transport for the study of approximate linear convergence—a valuable insight for practical training. Our algorithm yields robust generative modeling in diverse domains, and its scalability is demonstrated in real-world constrained generative modeling through standard image benchmarks. |

## ID: 225 |
## Label Consistency-based Worker Filtering for CrowdsourcingJiao Li, Liangxiao Jiang, Chaoqun Li, Wenjun ZhangAbstract
In crowdsourcing scenarios, we can obtain multiple noisy labels from different crowd workers on the Internet for each instance and then infer its unknown true label via a label integration method. However, noisy labels often have a serious negative impact on label integration. In this case, most existing works always focus on designing more complex label integration methods to infer unknown true labels more accurately from multiple noisy labels, but little attention has been paid to another perspective, i.e., purifying noisy labels before label integration. In this paper, we aim to purify noisy labels for existing label integration methods and propose a label consistency-based worker filtering (LCWF) algorithm. In LCWF, we consider that if all low-quality workers are filtered out and only high-quality workers remain, the label consistency should be high. Therefore, we utilize label consistency to filter out low-quality workers. Firstly, we directly transform the worker filtering problem into a discrete optimization problem and utilize label consistency to define the fitness function for this problem. Then, we search for the optimal solution to this problem by a genetic algorithm. Finally, we filter out all labels from low-quality workers according to the optimal solution we obtained. Experimental results on simulated and real-world datasets demonstrate that LCWF can effectively purify noisy labels and improve the integration accuracy of existing label integration methods. |

## ID: 226 |
## Cooperative Meta-Learning with Gradient AugmentationJongyun Shin, seungjin Han, Jangho KimAbstract
Model agnostic meta-learning (MAML) is one of the most widely used gradient-based meta-learning, consisting of two optimization loops: an inner loop and outer loop. MAML learns the new task from meta-initialization parameters with an inner update and finds the meta-initialization parameters in the outer loop. In general, the injection of noise into the gradient of the model for augmenting the gradient is one of the widely used regularization methods. In this work, we propose a novel cooperative meta-learning framework dubbed CML which leverages gradient-level regularization with gradient augmentation. We inject learnable noise into the gradient of the model for the model generalization. The key idea of CML is introducing the co-learner which has no inner update but the outer loop update to augment gradients for finding better meta-initialization parameters. Since the co-learner does not update in the inner loop, it can be easily deleted after meta-training. Therefore, CML infers with only meta-learner without additional cost and performance degradation. We demonstrate that CML is easily applicable to gradient-based meta-learning methods and CML leads to increased performance in few-shot regression, few-shot image classification and few-shot node classification tasks. Our codes are at https://github.com/JJongyn/CML. |

## ID: 227 |
## Base Models for Parabolic Partial Differential EquationsXingzi Xu, Ali Hasan, Jie Ding, Vahid TarokhAbstract
Parabolic partial differential equations (PDEs) appear in many disciplines to model the evolution of various mathematical objects, such as probability flows, value functions in control theory, and derivative prices in finance. It is often necessary to compute the solutions or a function of the solutions to a parametric PDE in multiple scenarios corresponding to different parameters of this PDE. This process often requires resolving the PDEs from scratch, which is time-consuming. To better employ existing simulations for the PDEs, we propose a framework for finding solutions to parabolic PDEs across different scenarios by meta-learning an underlying base distribution.%tasks. We build upon this base distribution to propose a method for computing solutions to parametric PDEs under different parameter settings. Finally, we illustrate the application of the proposed methods through extensive experiments in generative modeling, stochastic control, and finance. The empirical results suggest that the proposed approach improves generalization to solving new PDEs. |

## ID: 232 |
## Anomaly Detection with Variance Stabilized Density EstimationAmit Rozner, Barak Battash, Henry Li, Lior Wolf, Ofir LindenbaumAbstract
We propose a modified density estimation problem that is highly effective for detecting anomalies in tabular data. Our approach assumes that the density function is relatively stable (with lower variance) around normal samples. We have verified this hypothesis empirically using a wide range of real-world data. Then, we present a variance-stabilized density estimation problem for maximizing the likelihood of the observed samples while minimizing the variance of the density around normal samples. To obtain a reliable anomaly detector, we introduce a spectral ensemble of autoregressive models for learning the variance-stabilized distribution. We have conducted an extensive benchmark with 52 datasets, demonstrating that our method leads to state-of-the-art results while alleviating the need for data-specific hyperparameter tuning. Finally, we have used an ablation study to demonstrate the importance of each of the proposed components, followed by a stability analysis evaluating the robustness of our model. |

## ID: 233 |
## Statistical and Causal Robustness for Causal Null Hypothesis Tests## [oral]Junhui Yang, Rohit Bhattacharya, Youjin Lee, Ted WestlingAbstract
Prior work applying semiparametric theory to causal inference has primarily focused on deriving estimators that exhibit statistical robustness under a prespecified causal model that permits identification of a desired causal parameter. However, a fundamental challenge is correct specification of such a model, which usually involves making untestable assumptions. Evidence factors is an approach to combining hypothesis tests of a common causal null hypothesis under two or more candidate causal models. Under certain conditions, this yields a test that is valid if at least one of the underlying models is correct, which is a form of causal robustness. We propose a method of combining semiparametric theory with evidence factors. We develop a causal null hypothesis test based on joint asymptotic normality of $K$ asymptotically linear semiparametric estimators, where each estimator is based on a distinct identifying functional derived from each of $K$ candidate causal models. We show that this test provides both statistical and causal robustness in the sense that it is valid if at least one of the $K$ proposed causal models is correct, while also allowing for slower than parametric rates of convergence in estimating nuisance functions. We demonstrate the effectiveness of our method via simulations and applications to the Framingham Heart Study and Wisconsin Longitudinal Study. |

## ID: 238 |
## Learning relevant contextual variables within Bayesian optimizationJulien Martinelli, Ayush Bharti, Armi Tiihonen, S. T. John, Louis Filstroff, Sabina J. Sloman, Patrick Rinke, Samuel KaskiAbstract
Contextual Bayesian Optimization (CBO) efficiently optimizes black-box functions with respect to design variables, while simultaneously integrating _contextual_ information regarding the environment, such as experimental conditions. However, the relevance of contextual variables is not necessarily known beforehand. Moreover, contextual variables can sometimes be optimized themselves at additional cost, a setting overlooked by current CBO algorithms. Cost-sensitive CBO would simply include optimizable contextual variables as part of the design variables based on their cost. Instead, we adaptively select a subset of contextual variables to include in the optimization, based on the trade-off between their _relevance_ and the additional cost incurred by optimizing them compared to leaving them to be determined by the environment. We learn the relevance of contextual variables by sensitivity analysis of the posterior surrogate model while minimizing the cost of optimization by leveraging recent developments on early stopping for BO. We empirically evaluate our proposed Sensitivity-Analysis-Driven Contextual BO (_SADCBO_) method against alternatives on both synthetic and real-world experiments, together with extensive ablation studies, and demonstrate a consistent improvement across examples. |

## ID: 240 |
## Posterior Inference on Shallow Infinitely Wide Bayesian Neural Networks under Weights with Unbounded VarianceJorge Loria, Anindya BhadraAbstract
From the classical and influential works of Neal (1996), it is known that the infinite width scaling limit of a Bayesian neural network with one hidden layer is a Gaussian process, when the network weights have bounded prior variance. Neal's result has been extended to networks with multiple hidden layers and to convolutional neural networks, also with Gaussian process scaling limits. The tractable properties of Gaussian processes then allow straightforward posterior inference and uncertainty quantification, considerably simplifying the study of the limit process compared to a network of finite width. Neural network weights with unbounded variance, however, pose unique challenges. In this case, the classical central limit theorem breaks down and it is well known that the scaling limit is an $\alpha$-stable process under suitable conditions. However, current literature is primarily limited to forward simulations under these processes and the problem of posterior inference under such a scaling limit remains largely unaddressed, unlike in the Gaussian process case. To this end, our contribution is an interpretable and computationally efficient procedure for posterior inference, using a conditionally Gaussian representation, that then allows full use of the Gaussian process machinery for tractable posterior inference and uncertainty quantification in the non-Gaussian regime. |

## ID: 242 |
## Adaptive Time-Stepping Schedules for Diffusion ModelsYuzhu Chen, Fengxiang He, Shi Fu, Xinmei Tian, Dacheng TaoAbstract
This paper studies how to tune the stepping schedule in diffusion models, which is mostly fixed in current practice, lacking theoretical foundations and assurance of optimal performance at the chosen discretization points. In this paper, we advocate the use of adaptive time-stepping schedules and design two algorithms with an optimized sampling error bound $EB$: (1) for continuous diffusion, we treat $EB$ as the loss function to discretization points and run gradient descent to adjust them; and (2) for discrete diffusion, we propose a greedy algorithm that adjusts only one discretization point to its best position in each iteration. We conducted extensive experiments that show (1) improved generation ability in well-trained models, and (2) premature though usable generation ability in under-trained models. The code is submitted and will be released publicly. |

## ID: 243 |
## On Hardware-efficient Inference in Probabilistic CircuitsLingyun Yao, Martin Trapp, Jelin Leslin, Gaurav Singh, Peng Zhang, Karthekeyan Periasamy, Martin AndraudAbstract
Probabilistic circuits (PCs) offer a promising avenue to perform embedded reasoning under uncertainty. They support efficient and exact computation of various probabilistic inference tasks by design. Hence, hardware-efficient computation of PCs is highly interesting for edge computing applications. As computations in PCs are based on arithmetic with probability values, they are typically performed in the log domain to avoid underflow. Unfortunately, performing the log operation on hardware is costly. Hence, prior work has focused on computations in the linear domain, resulting in high resolution and energy requirements. This work proposes the first dedicated approximate computing framework for PCs that allows for low-resolution logarithm computations. We leverage Addition As Int, resulting in linear PC computation with simple hardware elements. Further, we provide a theoretical approximation error analysis and present an error compensation mechanism. Empirically, our method obtains up to 357× and 649× energy reduction on custom hardware for evidence and MAP queries respectively with little or no computational error. |

## ID: 262 |
## Transductive and Inductive Outlier Detection with Robust AutoencodersOfir Lindenbaum, Yariv Aizenbud, Yuval KlugerAbstract
Accurate detection of outliers is crucial for the success of numerous data analysis tasks. In this context, we propose the Probabilistic Robust AutoEncoder (PRAE) that can simultaneously remove outliers during training (transductive) and learn a mapping that can be used to detect outliers in new data (inductive). We first present the Robust AutoEncoder (RAE) objective that excludes outliers while including a subset of samples (inliers) that can be effectively reconstructed using an AutoEncoder (AE). RAE minimizes the autoencoder's reconstruction error while incorporating as many samples as possible. This could be formulated via regularization by subtracting an $\ell_0$ norm, counting the number of selected samples from the reconstruction term. As this leads to an intractable combinatorial problem, we propose two probabilistic relaxations of RAE, which are differentiable and alleviate the need for a combinatorial search. We prove that the solution to the PRAE problem is equivalent to the solution of RAE. We then use synthetic data to demonstrate that PRAE can accurately remove outliers in various contamination levels. Finally, we show that using PRAE for outlier detection leads to state-of-the-art results for inductive and transductive outlier detection. |

## ID: 264 |
## Bootstrap Your Conversions: Thompson Sampling for Partially Observable Delayed RewardsMarco Gigli, Fabio StellaAbstract
This paper presents a novel approach to address contextual bandit problems with partially observable, delayed feedback by introducing an approximate Thompson sampling technique. This is a common setting, with applications ranging from online marketing to vaccine trials. Leveraging Bootstrapped Thompson sampling (BTS), we obtain an approximate posterior distribution over delay distributions and conversion probabilities, thereby extending an Expectation-Maximisation (EM) model to the Bayesian domain. Unlike prior methodologies, our approach does not overlook uncertainty on delays. Within the EM framework, we employ the Kaplan-Meier estimator to place no restriction on delay distributions. Through extensive benchmarking against state-of-the-art techniques, our approach demonstrates superior performance across the majority of tested environments, with comparable performance in the remaining cases. Furthermore, our method offers practical implementation using off-the-shelf libraries, facilitating broader adoption. Our technique lays a foundation for extending to other bandit settings, such as non-contextual bandits or action-dependent delay distributions, promising wider applicability and versatility in real-world applications. |

## ID: 265 |
## Probabilistic reconciliation of mixed-type hierarchical time seriesLorenzo Zambon, Dario Azzimonti, Nicolò Rubattu, Giorgio CoraniAbstract
Hierarchical time series are collections of time series that are formed via aggregation, and thus adhere to some linear constraints. The forecasts for hierarchical time series should be coherent, i.e., they should satisfy the same constraints. In a probabilistic setting, forecasts are in the form of predictive distributions. Probabilistic reconciliation adjusts the predictive distributions, yielding a joint reconciled distribution that assigns positive probability only to coherent forecasts. There are methods for the reconciliation of hierarchies containing only Gaussian or only discrete predictive distributions; instead, the reconciliation of mixed hierarchies, i.e. mixtures of discrete and continuous time series, is still an open problem. We propose two different approaches to address this problem: mixed conditioning and top-down conditioning. We discuss their properties and we present experiments with datasets containing up to thousands of time series. |

## ID: 269 |
## Center-Based Relaxed Learning Against Membership Inference AttacksXingli Fang, Jung-Eun KimAbstract
Membership inference attacks (MIAs) are currently considered one of the main privacy attack strategies, and their defense mechanisms have also been extensively explored. However, there is still a gap between the existing defense approaches and ideal models in both performance and deployment costs. In particular, we observed that the privacy vulnerability of the model is closely correlated with the gap between the model's data-memorizing ability and generalization ability. To address it, we propose a new architecture-agnostic training paradigm called Center-based Relaxed Learning (CRL), which is adaptive to any classification model and provides privacy preservation by sacrificing a minimal or no loss of model generalizability. We emphasize that CRL can better maintain the model's consistency between member and non-member data. Through extensive experiments on common classification datasets, we empirically show that this approach exhibits comparable performance without requiring additional model capacity or data costs. |

## ID: 270 |
## Adjustment Identification Distance: A gadjid for Causal Structure LearningLeonard Henckel, Theo Würtzen, Sebastian WeichwaldAbstract
Evaluating graphs learned by causal discovery algorithms is difficult: The number of edges that differ between two graphs does not reflect how the graphs differ with respect to the identifying formulas they suggest for causal effects. We introduce a framework for developing causal distances between graphs which includes the structural intervention distance for directed acyclic graphs as a special case. We use this framework to develop improved adjustment-based distances as well as extensions to completed partially directed acyclic graphs and causal orders. We develop new reachability algorithms to compute the distances efficiently and to prove their low polynomial time complexity. In our package gadjid (open source at https://github.com/CausalDisco/gadjid), we provide implementations of our distances; they are orders of magnitude faster with proven lower time complexity than the structural intervention distance and thereby provide a success metric for causal discovery that scales to graph sizes that were previously prohibitive. |

## ID: 277 |
## Neural Optimal Transport with Lagrangian CostsAram-Alexandre Pooladian, Carles Domingo-Enrich, Ricky T. Q. Chen, Brandon AmosAbstract
We investigate the optimal transport problem between probability measures when the underlying cost function is understood to satisfy a least action principle, also known as a Lagrangian cost. These generalizations are useful when connecting observations from a physical system where the transport dynamics are influenced by the geometry of the system, such as obstacles (e.g., incorporating barrier functions in the Lagrangian), and allows practitioners to incorporate a priori knowledge of the underlying system such as non-Euclidean geometries (e.g., paths must be circular). Our contributions are of computational interest, where we demonstrate the ability to efficiently compute geodesics and amortize spline-based paths, which has not been done before, even in low dimensional problems. Unlike prior work, we also output the resulting Lagrangian optimal transport map without requiring an ODE solver. We demonstrate the effectiveness of our formulation on low-dimensional examples taken from prior work. The source code to reproduce our experiments is available at https://github.com/facebookresearch/lagrangian-ot. |

## ID: 278 |
## Model-Free Robust Reinforcement Learning with Sample Complexity AnalysisYudan Wang, Shaofeng Zou, Yue WangAbstract
Distributionally Robust Reinforcement Learning (DR-RL) aims to derive a policy optimizing the worst-case performance within a predefined uncertainty set. Despite extensive research, previous DR-RL algorithms have predominantly favored model-based approaches, with limited availability of model-free methods offering convergence guarantees or sample complexities. This paper proposes a model-free DR-RL algorithm leveraging the Multi-level Monte Carlo (MLMC) technique to close such a gap. Our innovative approach integrates a threshold mechanism that ensures finite sample requirements for algorithmic implementation, a significant departure from previous model-free algorithms. We adapt our algorithm to accommodate uncertainty sets defined by total variation, Chi-square divergence, and KL divergence, and provide finite sample analyses under all three cases. Remarkably, our algorithms represent the first model-free DR-RL approach featuring finite sample complexity for total variation and Chi-square divergence uncertainty sets, while also offering an improved sample complexity and broader applicability compared to existing model-free DR-RL algorithms for the KL divergence model. The complexities of our method establish the tightest results for all three uncertainty models in model-free DR-RL, underscoring the effectiveness and efficiency of our algorithm, and highlighting its potential for practical applications. |

## ID: 286 |
## Conditional Bayesian QuadratureZonghao Chen, Masha Naslidnyk, Arthur Gretton, Francois-Xavier BriolAbstract
We propose a novel approach for estimating conditional or parametric expectations in the setting where obtaining samples or evaluating integrands is costly. Through the framework of probabilistic numerical methods (such as Bayesian quadrature), our novel approach allows to incorporates prior information about the integrands especially the prior smoothness knowledge about the integrands and the conditional expectation. As a result, our approach provides a way of quantifying uncertainty and leads to a fast convergence rate, which is confirmed both theoretically and empirically on challenging tasks in Bayesian sensitivity analysis, computational finance and decision making under uncertainty. |

## ID: 288 |
## Characterizing Data Point Vulnerability as Average-Case RobustnessTessa Han, Suraj Srinivas, Himabindu LakkarajuAbstract
Studying the robustness of machine learning models is important to ensure consistent model behaviour across real-world settings. To this end, adversarial robustness is a standard framework, which views robustness of predictions through a binary lens: either a worst-case adversarial perturbation exists in the local region around an input, or it does not. However, this binary perspective does not account for the degrees of vulnerability, as data points with a larger number of misclassified examples in their neighborhoods are more vulnerable. In this work, we consider a complementary framework for robustness, called average-case robustness, which measures the fraction of points in a local region that provides consistent predictions. However, computing this quantity is hard, as standard Monte Carlo approaches are inefficient especially for high-dimensional inputs. In this work, we propose the first analytical estimators for average-case robustness for multi-class classifiers. We show empirically that our estimators are accurate and efficient for standard deep learning models and demonstrate their usefulness for identifying vulnerable data points, and well as quantifying robustness bias of models. Overall, our tools provide a complementary view to robustness, improving our ability to characterize model behaviour. |

## ID: 296 |
## Multi-fidelity Bayesian Optimization with Multiple Information Sources of Input-dependent FidelityMingzhou Fan, Byung-Jun Yoon, Edward Dougherty, Francis Alexander, Nathan Urban, Raymundo Arroyave, Xiaoning QianAbstract
By querying approximate surrogate models of different fidelity as available information sources, Multi-Fidelity Bayesian Optimization (MFBO) aims at optimizing unknown functions that are costly if not infeasible to evaluate. Existing MFBO methods often assume that approximate surrogates have consistently high/low fidelity across the input domain. However, approximate evaluations from the same surrogate can have different fidelity at different input regions due to data availability and model constraints, especially when considering machine learning surrogates. In this work, we investigate MFBO when multi-fidelity approximations have input-dependent fidelity. By explicitly capturing input dependency for multi-fidelity queries in Gaussian Process (GP), our new input-dependent MFBO~(iMFBO) with learnable noise models better captures the fidelity of each information source in an intuitive way. We further design a new acquisition function for iMFBO and prove that the queries selected by iMFBO have higher quality than those by naive MFBO methods, with the derived sub-linear regret bound. Experiments on both synthetic and real-world data demonstrate its superior empirical performance. |

## ID: 302 |
## Approximation Algorithms for Observer Aware MDPsShuwa Miura, Olivier Buffet, Shlomo ZilbersteinAbstract
We present approximation algorithms for Observer-Aware Markov Decision Processes (OAMDPs). OAMDPs model sequential decision-making problems in which rewards depend on the beliefs of an observer about the goals, intentions, or capabilities of the observed agent. The first proposed algorithm is a grid-based value iteration (Grid-VI), which discretizes the observer's belief into regular grids. Based on the same discretization, the second proposed algorithm is a variant of Real-Time Dynamic Programming (RTDP) called Grid-RTDP. Unlike Grid-Vi, Grid-RTDP focuses its updates on promising states using heuristic estimates. We provide theoretical guarantees of the proposed algorithms and demonstrate that Grid-RTDP has a good anytime performance comparable to the existing approach without performance guarantees. |

## ID: 303 |
## Discrete Probabilistic Inference as Control in Multi-path EnvironmentsTristan Deleu, Padideh Nouri, Nikolay Malkin, Doina Precup, Yoshua BengioAbstract
We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem, where the objective is to find a stochastic policy such that objects are sampled at the end of this sequential process proportionally to some predefined reward. While we could use maximum entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object. To address this issue, Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the whole Markov Decision Process (MDP). In this paper, we extend recent methods correcting the reward in order to guarantee that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward. Finally, we study empirically the performance of multiple MaxEnt RL and GFlowNet algorithms on multiple problems involving sampling from discrete distributions. |

## ID: 306 |
## To smooth a cloud or to pin it down: Expressiveness guarantees and insights on score matching in denoising diffusion modelsTeodora Reu, Francisco Vargas, Anna Kerekes, Michael M. BronsteinAbstract
Denoising diffusion models are a class of generative models that have recently achieved state-of-the-art results across many domains. Gradual noise is added to the data using a diffusion process, which transforms the data distribution into a Gaussian. Samples from the generative model are then obtained by simulating an approximation of the time reversal of this diffusion initialized by Gaussian samples. Recent research has explored the sampling error achieved by diffusion models under the assumption of an absolute error $\epsilon$ achieved via a neural approximation of the score. To the best of our knowledge, no work formally quantifies the error of such neural approximation to the score. In this paper, we close the gap and present quantitative error bounds for approximating the score of denoising diffusion models using neural networks leveraging ideas from stochastic control. Finally, through simulation, we explore some of the insights that arise from our results confirming that diffusion models based on the Ornstein-Uhlenbeck (OU) process require fewer parameters to better approximate the score than those based on the F\"{o}lmer drift / Pinned Brownian Motion." |

## ID: 308 |
## Pure Exploration in Asynchronous Federated BanditsZichen Wang, Chuanhao Li, chenyu song, Lianghui Wang, Quanquan Gu, Huazheng WangAbstract
We study the federated pure exploration problem of multi-armed bandits and linear bandits, where $M$ agents cooperatively identify the best arm via communicating with the central server. To enhance the robustness against latency and unavailability of agents that are common in practice, we propose the first federated asynchronous multi-armed bandit and linear bandit algorithms for pure exploration with fixed confidence. Our theoretical analysis shows the proposed algorithms achieve near-optimal sample complexities and efficient communication costs in a fully asynchronous environment. Moreover, experimental results based on synthetic and real-world data empirically elucidate the effectiveness and communication cost-efficiency of the proposed algorithms. |

## ID: 313 |
## Identifying Homogeneous and Interpretable Groups for Conformal PredictionNatalia Martinez, Dhaval C Patel, Chandra Reddy, Giridhar Ganapavarapu, Roman Vaculin, Jayant KalagnanamAbstract
Conformal prediction methods are a tool for uncertainty quantification of a model's prediction, providing a model-agnostic and distribution-free statistical wrapper that generates prediction intervals/sets for a given model with finite sample generalization guarantees. However, these guarantees hold only on average, or conditioned on the output values of the predictor or on a set of predefined groups, which a-priori may not relate to the prediction task at hand. We propose a method to learn a generalizable partition function of the input space (or representation mapping) into interpretable groups of varying sizes where the non-conformity scores - a measure of discrepancy between prediction and target - are as homogeneous as possible when conditioned to the group. The learned partition can be integrated with any of the group conditional conformal approaches to produce conformal sets with group conditional guarantees on the discovered regions. Since these learned groups are expressed as strictly a function of the input, they can be used for downstream tasks such as data collection or model selection. We show the effectiveness of our method in reducing worst case group coverage outcomes in a variety of datasets. |

## ID: 314 |
## Decentralized Two-Sided Bandit Learning in Matching MarketYiRui Zhang, Zhixuan FangAbstract
Two-sided matching under uncertainty has recently drawn much attention due to its wide applications. Existing works in matching bandits mainly focus on the one-sided learning setting and design algorithms with the objective of converging to stable matching with low regret. In this paper, we consider the more general two-sided learning setting, i.e. participants on both sides have to learn their preferences over the other side through repeated interactions. Inspired by the classical result that the optimal matching for the proposing side can be obtained using the Gale-Shapley algorithm, our inquiry stems from the curiosity about whether this result still holds in a two-sided learning setting. To handle this question, we formally introduce the two-sided learning setting, addressing strategies for both the arm and player sides without restrictive assumptions such as special preference structure and observation of winning players. Our results not only provide a positive answer to our inquiry but also offer a near-optimal upper bound, achieving $O(\log T)$ regret. |

## ID: 315 |
## GCVR: Reconstruction from Cross-View Enable Sufficient and Robust Graph Contrastive LearningQianlong Wen, Zhongyu Ouyang, Chunhui Zhang, Yiyue Qian, Chuxu Zhang, Yanfang YeAbstract
Among the existing self-supervised learning (SSL) methods for graphs, graph contrastive learning (GCL) frameworks usually automatically generate supervision by transforming the same graph into different views through graph augmentation operations. The computation-efficient augmentation techniques enable the prevalent usage of GCL to alleviate the supervision shortage issue. Despite the remarkable performance of those GCL methods, the InfoMax principle used to guide the optimization of GCL has been proven to be insufficient to avoid redundant information without losing important features. In light of this, we introduce the Graph Contrastive Learning with Cross-View Reconstruction (GCVR), aiming to learn robust and sufficient representation from graph data. Specifically, GCVR introduces a cross-view reconstruction mechanism based on conventional graph contrastive learning to elicit those essential features from raw graphs. Besides, we introduce an extra adversarial view perturbed from the original view in the contrastive loss to pursue the intactness of the graph semantics and strengthen the representation robustness. We empirically demonstrate that our proposed model outperforms the state-of-the-art baselines on graph classification tasks over multiple benchmark datasets. |

## ID: 317 |
## Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise## [spotlight]Yihao Xue, Kyle Whitecross, Baharan MirzasoleimanAbstract
Increasing the size of overparameterized neural networks has been a key in achieving state-of-the-art performance. This is captured by the double descent phenomenon, where the test loss follows a decreasing-increasing-decreasing pattern (or sometimes monotonically decreasing) as model width increases. However, the effect of label noise on the test loss curve has not been fully explored. In this work, we uncover an intriguing phenomenon where label noise leads to a \textit{final ascent} in the originally observed double descent curve. Specifically, under a sufficiently large noise-to-sample-size ratio, optimal generalization is achieved at intermediate widths. Through theoretical analysis, we attribute this phenomenon to the shape transition of test loss variance induced by label noise. Furthermore, we extend the final ascent phenomenon to model density and provide the first theoretical characterization showing that reducing density by randomly dropping trainable parameters improves generalization under label noise. We also thoroughly examine the roles of regularization and sample size. Surprisingly, we find that larger $\ell_2$ regularization and robust learning methods against label noise exacerbate the final ascent. We confirm the validity of our findings through extensive experiments on ReLu networks trained on MNIST, ResNets/ViT trained on CIFAR-10/100, and InceptionResNet-v2 trained on Stanford Cars with real-world noisy labels. |

## ID: 318 |
## Fair Active Learning in Low-Data RegimesRomain Camilleri, Andrew Wagenmaker, Kevin Jamieson, Lalit K Jain, Jamie Heather MorgensternAbstract
In critical machine learning applications, ensuring fairness is essential to avoid perpetuating social inequities. In this work, we address the challenges of reducing bias and improving accuracy in data-scarce environments, where the cost of collecting labeled data prohibits the use of large, labeled datasets. In such settings, active learning promises to maximize marginal accuracy gains of small amounts of labeled data. However, existing applications of active learning for fairness fail to deliver on this, typically requiring large labeled datasets, or failing to ensure the desired fairness tolerance is met on the population distribution. To address such limitations, we introduce an innovative active learning framework that combines an exploration procedure inspired by posterior sampling with a fair classification subroutine. We demonstrate that this framework performs effectively in very data-scarce regimes, maximizing accuracy while satisfying fairness constraints with high probability. We evaluate our proposed approach using well-established real-world benchmark datasets and compare it against state-of-the-art methods, demonstrating its effectiveness in producing fair models, and improvement over existing methods. |

## ID: 322 |
## Bias-aware Boolean Matrix Factorization Using Disentangled Representation LearningXiao Wang, Jia Wang, Tong Zhao, Yijie Wang, Nan Zhang, Yong Zang, Sha Cao, Chi ZhangAbstract
Boolean matrix factorization (BMF) has been widely utilized in fields such as recommendation systems, graph learning, text mining, and -omics data analysis. Traditional BMF methods decompose a binary matrix into the Boolean product of two lower-rank Boolean matrices plus homoscedastic random errors. However, real-world binary data typically involves biases arising from heterogeneous row- and column-wise signal distributions. Such biases can lead to suboptimal fitting and unexplainable predictions if not accounted for. In this study, we reconceptualize the binary data generation as the Boolean sum of three components: a binary pattern matrix, a background bias matrix influenced by heterogeneous row or column distributions, and random flipping errors. We introduce a novel Disentangled Representation Learning for Binary matrices (DRLB) method, which employs a dual auto-encoder network to reveal the true patterns. DRLB can be seamlessly integrated with existing BMF techniques to facilitate bias-aware BMF. Our experiments with both synthetic and real-world datasets show that DRLB significantly enhances the precision of traditional BMF methods while offering high scalability. Moreover, the bias matrix detected by DRLB accurately reflects the inherent biases in synthetic data, and the patterns identified in the bias-corrected real-world data exhibit enhanced interpretability. |

## ID: 327 |
## EntProp: High Entropy Propagation for Improving Accuracy and RobustnessShohei EnomotoAbstract
Deep neural networks (DNNs) struggle to generalize to out-of-distribution domains that are different from those in training despite their impressive performance. In practical applications, it is important for DNNs to have both high standard accuracy and robustness against out-of-distribution domains. One technique that achieves both of these improvements is disentangled learning with mixture distribution via auxiliary batch normalization layers (ABNs). This technique treats clean and transformed samples as different domains, allowing a DNN to learn better features from mixed domains. However, if we distinguish the domains of the samples based on entropy, we find that some transformed samples are drawn from the same domain as clean samples, and these samples are not completely different domains. To generate samples drawn from a completely different domain than clean samples, we hypothesize that transforming clean high-entropy samples to further increase the entropy generates out-of-distribution samples that are much further away from the in-distribution domain. On the basis of the hypothesis, we propose high entropy propagation~(EntProp), which feeds high-entropy samples to the network that uses ABNs. We introduce two techniques, data augmentation and free adversarial training, that increase entropy and bring the sample further away from the in-distribution domain. These techniques do not require additional training costs. Our experimental results show that EntProp achieves higher standard accuracy and robustness with a lower training cost than the baseline methods. In particular, EntProp is highly effective at training on small datasets. |

## ID: 336 |
## Quantum Kernelized BanditsYasunari Hikima, Kazunori.Murao, Sho Takemori, Yuhei UmedaAbstract
We consider the quantum kernelized bandit problem, where the player observes information of rewards through quantum circuits termed the quantum reward oracle, and the mean reward function belongs to a reproducing kernel Hilbert space (RKHS). We propose a UCB-type algorithm that utilizes the quantum Monte Carlo (QMC) method and provide regret bounds in terms of the decay rate of eigenvalues of the Mercer operator of the kernel. Our algorithm achieves $\widetilde{O}\left( T^{\frac{3}{1 + \beta_p}} \log\left(\frac{1}{\delta} \right)\right)$ and $\widetilde{O} \left( \log^{3(1 + \beta_e^{-1})/2} (T) \log\left(\frac{1 }{\delta} \right) \right)$ cumulative regret bounds with probability at least $1-\delta$ if the kernel has a $\beta_p$-polynomial eigendecay and $\beta_e$-exponential eigendecay, respectively. In particular, in the case of the exponential eigendecay, our regret bounds exponentially improve that of classical algorithms. Moreover, our results indicate that our regret bound is better than the lower bound in the classical kernelized bandit problem if the rate of decay is sufficiently fast. |

## ID: 341 |
## Random Linear Projections Loss for Hyperplane-Based Optimization in Neural NetworksShyam Venkatasubramanian, Ahmed Aloui, Vahid TarokhAbstract
Advancing loss function design is pivotal for optimizing neural network training and performance. This work introduces Random Linear Projections (RLP) loss, a novel approach that enhances training efficiency by leveraging geometric relationships within the data. Distinct from traditional loss functions that target minimizing pointwise errors, RLP loss operates by minimizing the distance between sets of hyperplanes connecting fixed-size subsets of feature-prediction pairs and feature-label pairs. Our empirical evaluations, conducted across benchmark datasets and synthetic examples, demonstrate that neural networks trained with RLP loss outperform those trained with traditional loss functions, achieving improved performance with fewer data samples, and exhibiting greater robustness to additive noise. We provide theoretical analysis supporting our empirical findings. |

## ID: 345 |
## Revisiting Convergence of AdaGrad with Relaxed AssumptionsYusu Hong, Junhong LinAbstract
In this study, we revisit the convergence of AdaGrad with momentum (covering AdaGrad as a special case) on non-convex smooth optimization problems. We consider a general noise model where the noise magnitude is controlled by the function value gap together with the gradient magnitude. This model encompasses a broad range of noises including bounded noise, sub-Gaussian noise, affine variance noise and the expected smoothness, and it has been shown to be more realistic in many practical applications. Our analysis yields a probabilistic convergence rate which, under the general noise, could reach at $\tilde{\mathcal{O}}(1/\sqrt{T})$. This rate does not rely on prior knowledge of problem-parameters and could accelerate to $\tilde{\mathcal{O}}(1/T)$ where $T$ denotes the total number iterations, when the noise parameters related to the function value gap and noise level are sufficiently small. The convergence rate thus matches the lower rate for stochastic first-order methods over non-convex smooth landscape up to logarithm terms [Arjevani et al., 2023]. We further derive a convergence bound for AdaGrad with momentum, considering the generalized smoothness where the local smoothness is controlled by a first-order function of the gradient norm. |

## ID: 346 |
## Memorization Capacity for Additive Fine-Tuning with Small ReLu NetworksJy-yong Sohn, Dohyun Kwon, Seoyeon An, Kangwook LeeAbstract
Fine-tuning large pre-trained models is a common practice in machine learning applications, yet its mathematical analysis remains largely unexplored. In this paper, we study fine-tuning through the lens of memorization capacity. Our new measure, the Fine-Tuning Capacity (FTC), is defined as the maximum number of samples a neural network can fine-tune, or equivalently, as the minimum number of neurons ($m$) needed to arbitrarily change $N$ labels among $K$ samples considered in the fine-tuning process. In essence, FTC extends the memorization capacity concept to the fine-tuning scenario. We analyze FTC for the additive fine-tuning scenario where the fine-tuned network is defined as the summation of the frozen pre-trained network $f$ and a neural network $g$ (with $m$ neurons) designed for fine-tuning. When $g$ is a ReLU network with either 2 or 3 layers, we obtain tight upper and lower bounds on FTC; we show that $N$ samples can be fine-tuned with $m=\Theta(N)$ neurons for 2-layer networks, and with $m=\Theta(\sqrt{N})$ neurons for 3-layer networks, no matter how large $K$ is. Our results recover the known memorization capacity results when $N = K$ as a special case. |

## ID: 352 |
## Approximate Kernel Density Estimation under Metric-based Local Differential PrivacyYi Zhou, Yanhao Wang, Long Teng, Qiang Huang, Cen ChenAbstract
Kernel Density Estimation (KDE) is a fundamental problem with broad machine learning applications. In this paper, we investigate the KDE problem under Local Differential Privacy (LDP), a setting in which users privatize data on their own devices before sending them to an untrusted server for analytics. To strike a balance between ensuring local privacy and preserving high-utility KDE results, we adopt a relaxed definition of LDP based on metrics (mLDP), which is suitable when data points are represented in a metric space and can be more distinguishable as their distances increase. To the best of our knowledge, approximate KDE under mLDP has not been explored in the existing literature. We propose the mLDP-KDE framework, which augments a locality-sensitive hashing-based sketch method to provide mLDP and answer any KDE query unbiasedly within an additive error with high probability in sublinear time and space. Extensive experimental results demonstrate that the mLDP-KDE framework outperforms several existing KDE methods under LDP and mLDP by achieving significantly better trade-offs between privacy and utility, with particularly remarkable advantages on large, high-dimensional data. |

## ID: 356 |
## Cold-start Recommendation by Personalized Embedding Region ElicitationHieu Trung Nguyen, Duy Nguyen, Khoa D Doan, Viet Anh NguyenAbstract
Rating elicitation is a success element for recommender systems to perform well at cold-starting, in which the systems need to recommend items to a newly arrived user with no prior knowledge about the user's preference. Existing elicitation methods employ a fixed set of items to learn the user's preference and then infer the users' preferences on the remaining items. Using a fixed seed set can limit the performance of the recommendation system since the seed set is unlikely optimal for all new users with potentially diverse preferences. This paper addresses this challenge using a 2-phase, personalized elicitation scheme. First, the elicitation scheme asks users to rate a small set of popular items in a ``burn-in'' phase. Second, it sequentially asks the user to rate adaptive items to refine the preference and the user's representation. Throughout the process, the system represents the user's embedding value not by a point estimate but by a region estimate. The value of information obtained by asking the user's rating on an item is quantified by the distance from the region center embedding space that contains with high confidence the true embedding value of the user. Finally, the recommendations are successively generated by considering the preference region of the user. We show that each subproblem in the elicitation scheme can be efficiently implemented. Further, we empirically demonstrate the effectiveness of the proposed method against existing rating-elicitation methods on several prominent datasets. |

## ID: 360 |
## Convergence Behavior of an Adversarial Weak Supervision Method## [spotlight]Steven An, Sanjoy DasguptaAbstract
Labeling data via rules-of-thumb and minimal label supervision is central to Weak Supervision, a paradigm subsuming subareas of machine learning such as crowdsourced learning and semi-supervised ensemble learning. By using this labeled data to train modern machine learning methods, the cost of acquiring large amounts of hand labeled data can be ameliorated. Approaches to combining the rules-of-thumb falls into two camps, reflecting different ideologies of statistical estimation. The most common approach, exemplified by the Dawid-Skene model, is based on probabilistic modeling. The other, developed in the work of Balsubramani-Freund and others, is adversarial and game-theoretic. We provide a variety of statistical results for the adversarial approach under log-loss: we characterize the form of the solution, relate it to logistic regression, demonstrate consistency, and give rates of convergence. On the other hand, we find that probabilistic approaches for the same model class can fail to be consistent. Experimental results are provided to corroborate the theoretical results. |

## ID: 361 |
## Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression## [spotlight]Lucas Clarté, Adrien Vandenbroucque, Guillaume Dalle, Bruno Loureiro, Florent Krzakala, Lenka ZdeborovaAbstract
We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, taking the limit where the number of samples $n$ and dimension $d$ of the covariates grow at a comparable rate: $\alpha=n/d$ fixed. Our findings are three-fold: i) resampling methods are fraught with problems in high dimensions and exhibit the double-descent-like behavior typical of these situations; ii) only when $\alpha$ is large enough do they provide consistent and reliable error estimations (we give convergence rates); iii) in the over-parametrized regime $\alpha<1$ relevant to modern machine learning practice, their predictions are not consistent, even with optimal regularization. |

## ID: 363 |
## A Global Markov Property for Solutions of Stochastic Difference Equations and the corresponding Full Time GraphsTom Hochsprung, Jakob Runge, Andreas GerhardusAbstract
Structural Causal Models (SCMs) are an important tool in causal inference. They induce a graph and if the graph is acyclic, a unique observational distribution. A standard result states that in this acyclic case, the induced observational distribution satisfies a d-separation global Markov property relative to the induced graph. Time series can also be modelled like SCMs: One just interprets the stochastic difference equations that a time series solves as structural equations. However, technical problems arise when time series "start" at minus infinity. In particular, a d-separation global Markov property for time series and the corresponding infinite graphs, the so-called full time graphs, has thus far only been shown for stable vector autoregressive processes with independent finite-second-moment noise. In this paper, we prove a much more general version of this Markov property. We discuss our assumptions and study violations of them. Doing so hints at several pitfalls at the intersection of time series analysis and causal inference. Moreover, we introduce a new projection procedure for these infinite graphs which might be of independent interest. |

## ID: 364 |
## Towards Scalable Bayesian Transformers: Investigating stochastic subset selection for NLPPeter J.T. Kampen, Gustav R. S. Als, Michael Riis AndersenAbstract
Bayesian deep learning provides a framework for quantifying uncertainty. However, the scale of modern neural networks applied in Natural Language Processing (NLP) limits the usability of Bayesian methods. Subnetwork inference aims to approximate the posterior by selecting a stochastic parameter subset for inference, thereby allowing scalable posterior approximations. Determining the optimal parameter space for subnetwork inference is far from trivial. In this paper, we study partially stochastic Bayesian neural networks in the context of transformer models for NLP tasks for the Laplace approximation (LA) and Stochastic weight averaging - Gaussian (SWAG). We propose heuristics for selecting which layers to include in the stochastic subset. We show that norm-based selection is promising for small subsets, and random selection is superior for larger subsets. Moreover, we propose Sparse-KFAC (S-KFAC), an extension of KFAC LA, which selects dense stochastic substructures of linear layers based on parameter magnitudes. S-KFAC retains performance while requiring substantially fewer stochastic parameters and, therefore, drastically limits memory footprint. |

## ID: 368 |
## Quantifying Representation Reliability in Self-Supervised Learning ModelsYoung-Jin Park, Hao Wang, Shervin Ardeshir, Navid AzizanAbstract
Self-supervised learning models extract general-purpose representations from data. Quantifying the reliability of these representations is crucial, as many downstream models rely on them as input for their own tasks. To this end, we introduce a formal definition of _representation reliability_: the representation for a given test point is considered to be reliable if the downstream models built on top of that representation can consistently generate accurate predictions for that test point. However, accessing downstream data to quantify the representation reliability is often infeasible or restricted due to privacy concerns. We propose an ensemble-based method for estimating the representation reliability without knowing the downstream tasks a priori. Our method is based on the concept of _neighborhood consistency_ across distinct pre-trained representation spaces. The key insight is to find shared neighboring points as anchors to align these representation spaces before comparing them. We demonstrate through comprehensive numerical experiments that our method effectively captures the representation reliability with a high degree of correlation, achieving robust and favorable performance compared with baseline methods. |

## ID: 371 |
## Performative Reinforcement Learning in Gradually Shifting EnvironmentsBen Rank, Stelios Triantafyllou, Debmalya Mandal, Goran RadanovicAbstract
When Reinforcement Learning (RL) agents are deployed in practice, they might impact their environment and change its dynamics. We propose a new framework to model this phenomenon, where the current environment depends on the deployed policy as well as its previous dynamics. This is a generalization of Performative RL (PRL) [Mandal et al., 2023]. Unlike PRL, our framework allows to model scenarios where the environment gradually adjusts to a deployed policy. We adapt two algorithms from the performative prediction literature to our setting and propose a novel algorithm called Mixed Delayed Repeated Retraining (MDRR). We provide conditions under which these algorithms converge and compare them using three metrics: number of retrainings, approximation guarantee, and number of samples per deployment. MDRR is the first algorithm in this setting which combines samples from multiple deployments in its training. This makes MDRR particularly suitable for scenarios where the environment's response strongly depends on its previous dynamics, which are common in practice. We experimentally compare the algorithms using a simulation-based testbed and our results show that MDRR converges significantly faster than previous approaches. |

## ID: 375 |
## Quantifying Local Model Validity using Active LearningSven Lämmle, Can Bogoclu, Robert Vosshall, Anselm Haselhoff, Dirk RoosAbstract
Real-world applications of machine learning models are often subject to legal or policy-based regulations. Some of these regulations require ensuring the validity of the model, i.e., the approximation error being smaller than a threshold. A global metric is generally too insensitive to determine the validity of a specific prediction, whereas evaluating local validity is costly since it requires gathering additional data. We propose learning the model error to acquire a local validity estimate while reducing the amount of required data through active learning. Using model validation benchmarks, we provide empirical evidence that the proposed method can lead to an error model with sufficient discriminative properties using a relatively small amount of data. Furthermore, an increased sensitivity to local changes of the validity bounds compared to alternative approaches is demonstrated. |

## ID: 377 |
## Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms.Prashansa Panda, Shalabh BhatnagarAbstract
Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting. We consider the long-run average cost criterion where both the objective and the constraint functions are suitable policy-dependent long-run averages of certain prescribed cost functions. We handle the inequality constraints using the Lagrange multiplier method. We prove that these algorithms are guaranteed to find a first-order stationary point (i.e., $\Vert \nabla L(\theta,\gamma)\Vert_2^2 \leq \epsilon$) of the performance (Lagrange) function $L(\theta,\gamma)$, with a sample complexity of $\mathcal{\tilde{O}}(\epsilon^{-2.5})$ in the case of both Constrained Actor Critic (C-AC) and Constrained Natural Actor Critic (C-NAC) algorithms. We also show the results of experiments on three different Safety-Gym environments. |

## ID: 380 |
## Last-iterate Convergence Separation between Extra-gradient and Optimism in Constrained Periodic GamesYi Feng, Ping Li, Ioannis Panageas, Xiao WangAbstract
Last-iterate behaviors of learning algorithms in repeated two-player zero-sum games have been extensively studied due to their wide applications in machine learning and related tasks. Typical algorithms that exhibit the last-iterate convergence property include optimistic and extra-gradient methods. However, most existing results establish these properties under the assumption that the game is time-independent. Recently, (Feng et al., 2023) studied the last-iterate behaviors of optimistic and extra-gradient methods in games with a time-varying payoff matrix, and proved that in an unconstrained periodic game, extra-gradient method converges to the equilibrium while optimistic method diverges. This finding challenges the conventional wisdom that these two methods are expected to behave similarly as they do in time-independent games. However, compared to unconstrained games, games with constrains are more common both in practical and theoretical studies. In this paper, we investigate the last-iterate behaviors of optimistic and extra-gradient methods in the constrained periodic games, demonstrating that similar separation results for last-iterate convergence also hold in this setting. |

## ID: 386 |
## Faster Perfect Sampling of Bayesian Network Structures## [spotlight]Juha Harviainen, Mikko KoivistoAbstract
Bayesian inference of a Bayesian network structure amounts to averaging over directed acyclic graphs (DAGs) on a given set of $n$ variables, each DAG weighted by its posterior probability. In practice, save some special inference tasks, one averages over a sample of DAGs generated perfectly or approximately from the posterior. For the hard problem of perfect sampling, we give an algorithm that runs in $O(2.829^n)$ expected time, getting below $O(3^n)$ for the first time. Our algorithm reduces the problem into two smaller sampling problems whose outputs are combined; followed by a simple rejection step, perfect samples are obtained. Subsequent samples can be generated considerably faster. Empirically, we observe speedups of several orders of magnitude over the state of the art. |

## ID: 395 |
## Using Autodiff to Estimate Posterior Moments, Marginals and SamplesSam Bowyer, Thomas Heap, Laurence AitchisonAbstract
Importance sampling is a popular technique in Bayesian inference: by reweighting samples drawn from a proposal distribution we are able to obtain samples and moment estimates from a Bayesian posterior over latent variables. Recent work, however, indicates that importance sampling scales poorly --- in order to accurately approximate the true posterior, the required number of importance samples grows is exponential in the number of latent variables [Chatterjee and Diaconis, 2018]. Massively parallel importance sampling works around this issue by drawing $K$ samples for each of the $n$ latent variables and reasoning about all $K^n$ combinations of latent samples. In principle, we can reason efficiently over $K^n$ combinations of samples by exploiting conditional independencies in the generative model. However, in practice this requires complex algorithms that traverse backwards through the graphical model, and we need separate backward traversals for each computation (posterior expectations, marginals and samples). Our contribution is to exploit the source term trick from physics to entirely avoid the need to hand-write backward traversals. Instead, we demonstrate how to simply and easily compute all the required quantities --- posterior expectations, marginals and samples --- by differentiating through a slightly modified marginal likelihood estimator. |

## ID: 402 |
## Pix2Code: Learning to Compose Neural Visual Concepts as Programs## [oral]Antonia Wüst, Wolfgang Stammer, Quentin Delfosse, Devendra Singh Dhami, Kristian KerstingAbstract
The challenge in learning abstract concepts from images in an unsupervised fashion lies in the required integration of visual perception and generalizable relational reasoning. Moreover, the unsupervised nature of this task makes it necessary for human users to be able to understand a model's learned concepts and potentially revise false behaviors. To tackle both the generalizability and interpretability constraints of visual concept learning, we propose Pix2Code, a framework that extends program synthesis to visual relational reasoning by utilizing the abilities of both explicit, compositional symbolic and implicit neural representations. This is achieved by retrieving object representations from images and synthesizing relational concepts as $\lambda$-calculus programs. We evaluate the diverse properties of Pix2Code on the challenging reasoning domains, Kandinsky Patterns, and CURI, testing its ability to identify compositional visual concepts that generalize to novel data and concept configurations. Particularly, in stark contrast to neural approaches, we show that Pix2Code's representations remain human interpretable and can easily be revised for improved performance. |

## ID: 406 |
## The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data## [spotlight]Alexander Decruyenaere, Heidelinde Dehaene, Paloma Rabaey, Christiaan Polet, Johan Decruyenaere, Stijn Vansteelandt, Thomas DemeesterAbstract
Recent advances in generative models facilitate the creation of synthetic data to be made available for research in privacy-sensitive contexts. However, the analysis of synthetic data raises a unique set of methodological challenges. In this work, we highlight the importance of inferential utility and provide empirical evidence against naive inference from synthetic data, whereby synthetic data are treated as if they were actually observed. Before publishing synthetic data, it is essential to develop statistical inference tools for such data. By means of a simulation study, we show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased. Despite the use of a previously proposed correction factor, this problem persists for deep generative models, in part due to slower convergence of estimators and resulting underestimation of the true standard error. We further demonstrate our findings through a case study. |

## ID: 409 |
## Targeted Reduction of Causal Models## [oral]Armin Kekić, Bernhard Schölkopf, Michel BesserveAbstract
Why does a phenomenon occur? Addressing this question is central to most scientific inquiries and often relies on simulations of scientific models. As models become more intricate, deciphering the causes behind phenomena in high-dimensional spaces of interconnected variables becomes increasingly challenging. Causal Representation Learning (CRL) offers a promising avenue to uncover interpretable causal patterns within these simulations through an interventional lens. However, developing general CRL frameworks suitable for practical applications remains an open challenge. We introduce _Targeted Causal Reduction_ (TCR), a method for condensing complex intervenable models into a concise set of causal factors that explain a specific target phenomenon. We propose an information theoretic objective to learn TCR from interventional data of simulations, establish identifiability for continuous variables under shift interventions and present a practical algorithm for learning TCRs. Its ability to generate interpretable high-level explanations from complex models is demonstrated on toy and mechanical systems, illustrating its potential to assist scientists in the study of complex phenomena in a broad range of disciplines. |

## ID: 411 |
## A Generalized Bayesian Approach to Distribution-on-Distribution RegressionTin Lok James NgAbstract
In recent years, there has been growing interest in distribution-on-distribution regression, a regression problem where both covariates and responses are represented as probability distributions. Despite various methodologies proposed to address this challenge, a notable absence has been a Bayesian approach, which offers benefits by allowing for the integration of prior knowledge and providing a formal means of quantifying uncertainty. However, a major challenge in employing a Bayesian approach lies in the complexity of fully specifying the data generating process. To overcome this obstacle, we adopt a generalized Bayesian approach and investigate the contraction rates of the resulting generalized (Gibbs) posterior distributions. We propose an MCMC algorithm to sample from the generalized posterior distribution and conduct simulation studies to validate the theoretical findings. Finally, we apply the model to a data application involving mortality data. |

## ID: 413 |
## Dirichlet Continual Learning: Tackling Catastrophic Forgetting in NLPMin Zeng, Haiqin Yang, Wei Xue, Qifeng Liu, Yike GuoAbstract
Catastrophic forgetting poses a significant challenge in continual learning (CL). In the context of Natural Language Processing, generative-based rehearsal CL methods have made progress in avoiding expensive retraining. However, generating pseudo samples that accurately capture the task-specific distribution remains a daunting task. In this paper, we propose Dirichlet Continual Learning (DCL), a novel generative-based rehearsal strategy designed specifically for CL. Different from the conventional use of Gaussian latent variable in Conditional Variational Autoencoder, DCL employs the flexibility of the Dirichlet distribution to model the latent variable. This allows DCL to effectively capture sentence-level features from previous tasks and guide the generation of pseudo samples. Additionally, we introduce Jensen-Shannon Knowledge Distillation, a robust logit-based knowledge distillation method that enhances knowledge transfer during pseudo-sample generation. Our extensive experiments show that DCL outperforms state-of-the-art methods in two typical tasks of task-oriented dialogue systems, demonstrating its efficacy. |

## ID: 417 |
## Latent Representation Entropy Density for Distribution Shift DetectionFabio Arnez, Daniel Alfonso Montoya Vasquez, Ansgar Radermacher, François TerrierAbstract
Distribution shift detection is paramount in safety-critical tasks that rely on Deep Neural Networks (DNNs). The detection task entails deriving a confidence score to assert whether a new input sample aligns with the training data distribution of the DNN model. While DNN predictive uncertainty offers an intuitive confidence measure, exploring uncertainty-based distribution shift detection with simple sample-based techniques has been relatively overlooked in recent years due to computational overhead and lower performance than plain post-hoc methods. This paper proposes using simple sample-based techniques for estimating uncertainty and employing the entropy density from intermediate representations to detect distribution shifts. We demonstrate the effectiveness of our method using standard benchmark datasets for out-of-distribution detection and across different common perception tasks with convolutional neural network architectures. Our scope extends beyond classification, encompassing image-level distribution shift detection for object detection and semantic segmentation tasks. Our results show that our method's performance is comparable to existing \textit{State-of-the-Art} methods while being computationally faster and lighter than other Bayesian approaches, affirming its practical utility. |

## ID: 420 |
## Fast Interactive Search under a Scale-Free Comparison OracleLars Henning Klein, Daniyar Chumbalov, Lucas Maystre, Matthias GrossglauserAbstract
A comparison-based search algorithm lets a user find a target item $t$ in a database by answering queries of the form, ``Which of items $i$ and $j$ is closer to $t$?'' Instead of formulating an explicit query (such as one or several keywords), the user navigates towards the target via a sequence of such (typically noisy) queries. We propose a scale-free probabilistic oracle model called $\gamma$-CKL for such similarity triplets $(i,j;t)$, which generalizes the CKL triplet model proposed in the literature. The generalization affords independent control over the discriminating power of the oracle and the dimension of the feature space containing the items. We develop a search algorithm with provably exponential rate of convergence under the $\gamma$-CKL oracle, thanks to a backtracking strategy that deals with the unavoidable errors in updating the belief region around the target. We evaluate the performance of the algorithm both over the posited oracle and over several real-world triplet datasets. We also report on a comprehensive user study, where human subjects navigate a database of face portraits. |

## ID: 424 |
## Invariant Causal Prediction with Local ModelsAlexander Mey, Rui M. CastroAbstract
We consider the task of identifying the causal parents of a target variable among a set of candidates from observational data. Our main assumption is that the candidate variables are observed in different environments which may, under certain assumptions, be regarded as interventions on the observed system. We assume a linear relationship between target and candidates, which can be different in each environment with the only restriction that the causal structure is invariant across environments. Within our proposed setting we provide sufficient conditions for identifiability of the causal parents and introduce a practical method called L-ICP ($\textbf{L}$ocalized $\textbf{I}$nvariant $\textbf{Ca}$usal $\textbf{P}$rediction), which is based on a hypothesis test for parent identification using a ratio of minimum and maximum statistics. We then show in a simplified setting that the statistical power of L-ICP converges exponentially fast in the sample size, and finally we analyze the behavior of L-ICP experimentally in more general settings. |

## ID: 430 |
## Zero Inflation as a Missing Data Problem: a Proxy-based ApproachTrung Q. Phung, Jaron J.R. Lee, Opeyemi Oladapo-Shittu, Eili Y. Klein, Ayse Pinar Gurses, Susan M. Hannum, Kimberly Weems, Jill A. Marsteller, Sara E. Cosgrove, Sara C. Keller, Ilya ShpitserAbstract
A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data). Existing methods for zero-inflated data either fit the observed data likelihood via parametric mixture models that explicitly represent excess zeros, or aim to replace excess zeros by imputed values. If the goal of the analysis relies on knowing true data realizations, a particular challenge with zero-inflated data is identifiability, since it is difficult to correctly determine which observed zeros are real and which are inflated. This paper views zero-inflated data as a general type of missing data problem, where the observability indicator for a potentially censored variable is itself unobserved whenever a zero is recorded. We show that, without additional assumptions, target parameters involving a zero-inflated variable are not identified. However, if a proxy of the missingness indicator is observed, a modification of the effect restoration approach of Kuroki and Pearl allows identification and estimation, given the proxy-indicator relationship is known. If this relationship is unknown, our approach yields a partial identification strategy for sensitivity analysis. Specifically, we show that only certain proxy-indicator relationships are compatible with the observed data distribution. We give an analytic bound for this relationship in cases with a categorical outcome, which is sharp in certain models. For more complex cases, sharp numerical bounds may be computed using methods in Duarte et al. [2023]. We illustrate our method via simulation studies and a data application on central line-associated bloodstream infections (CLABSIs). |

## ID: 431 |
## Normalizing Flows for Conformal Regression## [oral]Nicolò ColomboAbstract
Conformal Prediction (CP) algorithms estimate the uncertainty of a prediction model by calibrating its outputs on labeled data. The same calibration scheme usually applies to any model and data without modifications. The obtained prediction intervals are valid by construction but could be inefficient, i.e. unnecessarily big, if the prediction errors are not uniformly distributed over the input space. We present a general scheme to localize the intervals by training the calibration process. The standard prediction error is replaced by an optimized distance metric that depends explicitly on the object attributes. Learning the optimal metric is equivalent to training a Normalizing Flow that acts on the joint distribution of the errors and the inputs. Unlike the Error Re-weighting CP algorithm of Papadopoulos et al. (2008), the framework allows estimating the gap between nominal and empirical conditional validity. The approach is compatible with existing locally-adaptive CP strategies based on re-weighting the calibration samples and applies to any point-prediction model without retraining. |

## ID: 432 |
## Detecting critical treatment effect bias in small subgroupsPiersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, Fanny YangAbstract
Randomized trials are considered the gold standard for making informed decisions in medicine. However, they are often not representative of the patient population in clinical practice. Observational studies, on the other hand, cover a broader patient population but are prone to various biases. Thus, before using observational data for any downstream task, it is crucial to benchmark its treatment effect estimates against a randomized trial. We propose a novel strategy to benchmark observational studies on a subgroup level. First, we design a statistical test for the null hypothesis that the treatment effects -- conditioned on a subset of relevant features -- differ up to some tolerance value. Our test allows us to estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup. We validate our lower bound in a real-world setting and show that it leads to conclusions that align with established medical knowledge. |

## ID: 441 |
## Multi-layer random features and the approximation power of neural networksRustem TakhanovAbstract
A neural architecture with randomly initialized weights, in the infinite width limit, is equivalent to a Gaussian Random Field whose covariance function is the so-called Neural Network Gaussian Process kernel (NNGP). We prove that a reproducing kernel Hilbert space (RKHS) defined by the NNGP contains only functions that can be approximated by the architecture. To achieve a certain approximation error the required number of neurons in each layer is defined by the RKHS norm of the target function. Moreover, the approximation can be constructed from a supervised dataset by a random multi-layer representation of an input vector, together with training of the last layer's weights. For a 2-layer NN and a domain equal to an $n-1$-dimensional sphere in ${\mathbb R}^n$, we compare the number of neurons required by Barron's theorem and by the multi-layer features construction. We show that if eigenvalues of the integral operator of the NNGP decay slower than $k^{-n-\frac{2}{3}}$ where $k$ is an order of an eigenvalue, then our theorem guarantees a more succinct neural network approximation than Barron's theorem. We also make some computational experiments to verify our theoretical findings. Our experiments show that realistic neural networks easily learn target functions even when both theorems do not give any guarantees. |

## ID: 443 |
## Amortized Variational Inference: When and Why?Charles Margossian, David BleiAbstract
In a probabilistic latent variable model, factorized (or mean-field) variational inference (F-VI) fits a separate parametric distribution for each latent variable. Amortized variational inference (A-VI) instead learns a common inference function, which maps each observation to its corresponding latent variable's approximate posterior. Typically, A-VI is used as a cog in the training of variational autoencoders, however it stands to reason that A-VI could also be used as a general alternative to F-VI. In this paper we study when and why A-VI can be used for approximate Bayesian inference. We derive conditions on a latent variable model which are necessary, sufficient, and verifiable under which A-VI can attain F-VI's optimal solution, thereby closing the amortization gap. We prove these conditions are uniquely verified by simple hierarchical models, a broad class that encompasses many models in machine learning. We then show, on a broader class of models, how to expand the domain of AVI’s inference function to improve its solution, and we provide examples, e.g. hidden Markov models, where the amortization gap cannot be closed. |

## ID: 445 |
## Neural Active Learning Meets the Partial Monitoring FrameworkMaxime Heuillet, Ola Ahmad, Audrey DurandAbstract
We focus on the online-based active learning (OAL) setting where an agent operates over a stream of observations and trades-off between the costly acquisition of information (labelled observations) and the cost of prediction errors. We propose a novel foundation for OAL tasks based on partial monitoring, a theoretical framework specialized in online learning from partially informative actions. We show that previously studied binary and multi-class OAL tasks are instances of partial monitoring. We expand the real-world potential of OAL by introducing a new class of cost-sensitive OAL tasks. We propose NeuralCBP, the first PM strategy that accounts for predictive uncertainty with deep neural networks. Our extensive empirical evaluation on open source datasets shows that NeuralCBP has competitive performance against state-of-the-art baselines on multiple binary, multi-class and cost-sensitive OAL tasks. |

## ID: 446 |
## No-Regret Learning of Nash Equilibrium for Black-Box Games via Gaussian ProcessesMinbiao Han, Fengxue Zhang, Yuxin ChenAbstract
This paper investigates the challenge of learning in black-box games, where the underlying utility function is unknown to any of the agents. While there is an extensive body of literature on the theoretical analysis of algorithms for computing the Nash equilibrium with *complete information* about the game, studies on Nash equilibrium in *black-box* games are less common. In this paper, we focus on learning the Nash equilibrium when the only available information about an agent's payoff comes in the form of empirical queries. We provide a no-regret learning algorithm that utilizes Gaussian processes to identify equilibria in such games. Our approach not only ensures a theoretical convergence rate but also demonstrates effectiveness across a variety collection of games through experimental validation. |

## ID: 451 |
## Identifiability of total effects from abstractions of time series causal graphsCharles K. Assaad, Emilie Devijver, Eric Gaussier, Gregor Goessler, Anouar MeynaouiAbstract
We study the problem of identifiability of the total effect of an intervention from observational time series only given an abstraction of the causal graph of the system. Specifically, we consider two types of abstractions: the extended summary causal graph which conflates all lagged causal relations but distinguishes between lagged and instantaneous relations; and the summary causal graph which does not give any indication about the lag between causal relations. We show that the total effect is always identifiable in extended summary causal graphs and we provide necessary and sufficient graphical conditions for identifiability in summary causal graphs. Furthermore, we provide adjustment sets allowing to estimate the total effect whenever it is identifiable. |

## ID: 454 |
## Extremely Greedy Equivalence Search## [spotlight]Achille Nazaret, David BleiAbstract
The goal of causal discovery is to learn a directed acyclic graph from data. One of the most well-known methods for this problem is Greedy Equivalence Search (GES). GES searches for the graph by incrementally and greedily adding or removing edges to maximize a model selection criterion. It has strong theoretical guarantees on infinite data but can fail in practice on finite data. In this paper, we first identify some of the causes of GES's failure, finding that it can get blocked in local optima, especially in denser graphs. We then propose eXtremely Greedy Equivalent Search (XGES), which involves a new heuristic to improve the search strategy of GES while retaining its theoretical guarantees. In particular, XGES favors deleting edges early in the search over inserting edges, which reduces the possibility of the search ending in local optima. A further contribution of this work is an efficient algorithmic formulation of XGES (and GES). We benchmark XGES on simulated datasets with known ground truth. We find that XGES consistently outperforms GES in recovering the correct graphs, and it is 10 times faster. |

## ID: 455 |
## Learning Distributionally Robust Tractable Probabilistic Models in Continuous DomainsHailiang Dong, James Amato, Vibhav Giridhar Gogate, Nicholas RuozziAbstract
Tractable probabilistic models (TPMs) have attracted substantial research interest in recent years, particularly because of their ability to answer various reasoning queries in polynomial time. In this study, we focus on the distributionally robust learning of continuous TPMs and address the challenge of distribution shift at test time by tackling the adversarial risk minimization problem within the framework of distributionally robust learning. Specifically, we demonstrate that the adversarial risk minimization problem can be efficiently addressed when the model permits exact log-likelihood evaluation and efficient learning on weighted data. Our experimental results on several real-world datasets show that our approach achieves significantly higher log-likelihoods on adversarial test sets. Remarkably, we note that the model learned via distributionally robust learning can achieve higher average log-likelihood on the initial uncorrupted test set at times. |

## ID: 457 |
## Characterising Interventions in Causal Games## [oral]Manuj Mishra, James Fox, Michael J. WooldridgeAbstract
Causal games are probabilistic graphical models that enable causal queries to be answered in multi-agent settings. They extend causal Bayesian networks by specifying decision and utility variables to represent the agents' degrees of freedom and objectives. In multi-agent settings, whether each agent decides on their policy before or after knowing the causal intervention is important as this affects whether they can respond to the intervention by adapting their policy. Consequently, previous work in causal games imposed chronological constraints on permissible interventions. We relax this by outlining a sound and complete set of primitive causal interventions so the effect of any arbitrarily complex interventional query can be studied in multi-agent settings. We also demonstrate applications to the design of safe AI systems by considering causal mechanism design and commitment. |

## ID: 458 |
## Partial identification of the maximum mean discrepancy with mismeasured dataRon Nafshi, Maggie MakarAbstract
Nonparametric estimates of the distance between two distributions such as the Maximum Mean Discrepancy (MMD) are often used in machine learning applications. However, the majority of existing literature assumes that error-free samples from the two distributions of interest are available.We relax this assumption and study the estimation of the MMD under $\epsilon$-contamination, where a possibly non-random $\epsilon$ proportion of one distribution is erroneously grouped with the other. We show that under $\epsilon$-contamination, the typical estimate of the MMD is unreliable. Instead, we study partial identification of the MMD, and characterize sharp upper and lower bounds that contain the true, unknown MMD. We propose a method to estimate these bounds, and show that it gives estimates that converge to the sharpest possible bounds on the MMD as sample size increases, with a convergence rate that is faster than alternative approaches. Using three datasets, we empirically validate that our approach is superior to the alternatives: it gives tight bounds with a low false coverage rate. |

## ID: 461 |
## Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance SamplingImad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna KorbaAbstract
Off-policy learning (OPL) often involves minimizing a risk estimator based on importance weighting to correct bias from the logging policy used to collect data. However, this method can produce an estimator with a high variance. A common solution is to regularize the importance weights and learn the policy by minimizing an estimator with penalties derived from generalization bounds specific to the estimator. This approach, known as pessimism, has gained recent attention but lacks a unified framework for analysis. To address this gap, we introduce a comprehensive PAC-Bayesian framework to examine pessimism with regularized importance weighting. We derive a tractable PAC-Bayesian generalization bound that universally applies to common importance weight regularizations, enabling their comparison within a single framework. Our empirical results challenge common understanding, demonstrating the effectiveness of standard IW regularization techniques. |

## ID: 464 |
## Generalized Expected Utility as a Universal Decision Rule -- A Step Forward## [oral]Hélène Fargier, Pierre Pomeret-CoquotAbstract
In order to capture a larger range of decision rules, this paper extends the seminal work of [Friedman and Halpern, 1995, Chu and Halpern, 2003, 2004] about Generalized Expected Utility. We introduce the notion of algebraic mass function (and of algebraic Möbius transform) and provide a new algebraic expression for expected utility based on such functions. This utility, that we call "XEU", generalizes Chu and Halpern’s GEU to non-decomposable measures and allows for the representation of several rules that could not be captured up to this point, and noticeably, of the Choquet integral. A representation theorem is provided that shows that only a very weak condition is needed for a rule in order to be representable as a XEU. |

## ID: 465 |
## DataSP: A Differential All-to-All Shortest Path Algorithm for Learning Costs and Predicting Paths with ContextAlan Lahoud, Erik Schaffernicht, Johannes A. StorkAbstract
Learning latent costs of transitions on graphs from trajectories demonstrations under various contextual features is challenging but useful for path planning. Yet, existing methods either oversimplify cost assumptions or scale poorly with the number of observed trajectories. This paper introduces DataSP, a differentiable all-to-all shortest path algorithm to facilitate learning latent costs from trajectories. It allows to learn from a large number of trajectories in each learning step without additional computation. Complex latent cost functions from contextual features can be represented in the algorithm through a neural network approximation. We further propose a method to sample paths from DataSP in order to reconstruct/mimic observed paths' distributions. We prove that the inferred distribution follows the maximum entropy principle. We show that DataSP outperforms state-of-the-art differentiable combinatorial solver and classical machine learning approaches in predicting paths on graphs. |

## ID: 474 |
## Computing Low-Entropy Couplings for Large-Support DistributionsSamuel Sokota, Dylan Sam, Christian Schroeder de Witt, Spencer Compton, Jakob Nicolaus Foerster, J Zico KolterAbstract
Minimum-entropy coupling (MEC)---the process of finding a joint distribution with minimum entropy for given marginals---has applications in areas such as causality and steganography. However, existing algorithms are either computationally intractable for large-support distributions or limited to specific distribution types and sensitive to hyperparameter choices. This work addresses these limitations by unifying a prior family of iterative MEC (IMEC) approaches into a generalized partition-based formalism. From this framework, we derive a novel IMEC algorithm called ARIMEC, capable of handling arbitrary discrete distributions, and introduce a method to make IMEC robust to suboptimal hyperparameter settings. These innovations facilitate the application of IMEC to high-throughput steganography with language models, among other settings. |

## ID: 477 |
## MetaCOG: A Heirarchical Probabilistic Model for Learning Meta-Cognitive Visual RepresentationsMarlene Berke, Zhangir Azerbayev, Mario Belledonne, Zenna Tavares, Julian Jara-EttingerAbstract
Humans have the capacity to question what we see and to recognize when our vision is unreliable (e.g., when we realize that we are experiencing a visual illusion). Inspired by this capacity, we present MetaCOG: a hierarchical probabilistic model that can be attached to a neural object detector to monitor its outputs and determine their reliability. MetaCOG achieves this by learning a probabilistic model of the object detector’s performance via Bayesian inference—i.e., a meta-cognitive representation of the network’s propensity to hallucinate or miss different object categories. Given a set of video frames processed by an object detector, MetaCOG performs joint inference over the underlying 3D scene and the detector’s performance, grounding inference on a basic assumption of object permanence. Paired with three neural object detectors, we show that MetaCOG accurately recovers each detector’s performance parameters and improves the overall system’s accuracy. We additionally show that MetaCOG is robust to varying levels of error in object detector outputs, showing proof-of-concept for a novel approach to the problem of detecting and correcting errors in vision systems when ground-truth is not available. |

## ID: 484 |
## Bandits with Knapsacks and PredictionsDavide Drago, Andrea Celli, Marek EliasAbstract
We study the Bandits with Knapsacks problem with the aim of designing a learning-augmented online learning algorithm upholding better regret guarantees than the state-of-the-art primal-dual algorithms with worst-case guarantees, under both stochastic and adversarial inputs. In the adversarial case, we obtain better competitive ratios when the input predictions are accurate, while also maintaining worst-case guarantees for imprecise predictions. We introduce two algorithms tailored for the full and bandit feedback settings, respectively. Both algorithms integrate a static prediction with a worst-case no-$\alpha$-regret algorithm. This yields an optimized competitive ratio of $(\pi + (1 -\pi)/\alpha)^{-1}$ in scenarios where the prediction is perfect, and a competitive ratio of $\alpha/(1 - \pi)$ in the case of highly imprecise predictions, where $\pi \in (0,1)$ is chosen by the learner and $\alpha$ is Slater's parameter. We complement this analysis by studying the stochastic setting under full feedback. We provide an algorithm which guarantees a pseudo-regret of $\widetilde{O}(\sqrt{T})$ with poor predictions, and 0 pseudo-regret with perfect predictions. We also characterize the smoothness of the algorithm. |

## ID: 486 |
## Uncertainty Estimation with Recursive Feature MachinesDaniel Gedon, Amirhesam Abedsoltan, Thomas B. Schön, Mikhail BelkinAbstract
In conventional regression analysis, predictions are typically represented as point estimates derived from covariates. The Gaussian Process (GP) offer a kernel-based framework that predicts and quantifies associated uncertainties. However, kernel-based methods often underperform ensemble-based decision tree approaches in regression tasks involving tabular and categorical data. Recently, Recursive Feature Machines (RFMs) were proposed as a novel feature-learning kernel which strengthens the capabilities of kernel machines. In this study, we harness the power of these RFMs in a probabilistic GP-based approach to enhance uncertainty estimation through feature extraction within kernel methods. We employ this learned kernel for in-depth uncertainty analysis. On tabular datasets, our RFM-based method surpasses other leading uncertainty estimation techniques, including NGBoost and CatBoost-ensemble. Additionally, when assessing out-of-distribution performance, we found that boosting-based methods are surpassed by our RFM-based approach. |

## ID: 487 |
## Linearly Constrained Gaussian Processes are SkewGPs: application to Monotonic Preference Learning and DesirabilityAlessio Benavoli, Dario AzzimontiAbstract
We show that existing approaches to Linearly Constrained Gaussian Processes (LCGP) for regression, based on imposing constraints on a finite set of operational points, can be seen as Skew Gaussian Processes (SkewGPs). In particular, focusing on inequality constraints and building upon a recent unification of regression, classification, and preference learning through SkewGPs, we extend LCGP to handle monotonic preference learning and desirability, crucial for understanding and predicting human decision making. We demonstrate the efficacy of the proposed model on simulated and real data. |

## ID: 492 |
## Shedding Light on Large Generative Networks: Estimating Epistemic Uncertainty in Diffusion ModelsLucas Berry, Axel Brando, David MegerAbstract
Generative diffusion models, notable for their large parameter count (exceeding 100 million) and operation within high-dimensional image spaces, pose significant challenges for traditional uncertainty estimation methods due to computational demands. In this work, we introduce an innovative framework, Diffusion Ensembles for Capturing Uncertainty (DECU), designed for estimating epistemic uncertainty for diffusion models. The DECU framework introduces a novel method that efficiently trains ensembles of conditional diffusion models by incorporating a static set of pre-trained parameters, drastically reducing the computational burden and the number of parameters that require training. Additionally, DECU employs Pairwise-Distance Estimators (PaiDEs) to accurately measure epistemic uncertainty by evaluating the mutual information between model outputs and weights in high-dimensional spaces. The effectiveness of this framework is demonstrated through experiments on the ImageNet dataset, highlighting its capability to capture epistemic uncertainty, specifically in under-sampled image classes. |

## ID: 495 |
## $\chi$SPN: Characteristic Interventional Sum-Product Networks for Causal Inference in Hybrid DomainsHarsh Poonia, Moritz Willig, Zhongjie Yu, Matej Zečević, Kristian Kersting, Devendra Singh DhamiAbstract
Causal inference in hybrid domains, characterized by a mixture of discrete and continuous variables, presents a formidable challenge. We take a step towards this direction and propose \textbf{Ch}aracteristic \textbf{I}nterventional Sum-Product Network ($\chi$SPN) that is capable of estimating interventional distributions in presence of random variables drawn from mixed distributions. $\chi$SPN uses characteristic functions in the leaves of an interventional SPN (iSPN) thereby providing a unified view for discrete and continuous random variables through the Fourier–Stieltjes transform of the probability measures. A neural network is used to estimate the parameters of the learned iSPN using the intervened data. Our experiments on 3 synthetic heterogeneous datasets suggest that $\chi$SPN can effectively capture the interventional distributions for both discrete and continuous variables while being expressive and causally adequate. We also show that $\chi$SPN generalize to multiple interventions while being trained only on a single intervention data. |

## ID: 497 |
## Learning Accurate and Interpretable Decision Trees## [oral]Maria Florina Balcan, Dravyansh SharmaAbstract
Decision trees are a popular tool in machine learning and yield easy-to-understand models. Several techniques have been proposed in the literature for learning a decision tree classifier, with different techniques working well for data from different domains. In this work, we develop approaches to design decision tree learning algorithms given repeated access to data from the same domain. We propose novel parameterized classes of node splitting criteria in top-down algorithms, which interpolate between popularly used entropy and Gini impurity based criteria, and provide theoretical bounds on the number of samples needed to learn the splitting function appropriate for the data at hand. We also study the sample complexity of tuning prior parameters in Bayesian decision tree learning, and extend our results to decision tree regression. We further consider the problem of tuning hyperparameters in pruning the decision tree for classical pruning algorithms including min-cost complexity pruning. We also study the interpretability of the learned decision trees and introduce a data-driven approach for optimizing the explainability versus accuracy trade-off using decision trees. Finally, we demonstrate the significance of our approach on real world datasets by learning data-specific decision trees which are simultaneously more accurate and interpretable. |

## ID: 499 |
## Group Fairness in Predict-Then-Optimize Settings for Restless Bandits## [oral]Shresth Verma, YUNFAN ZHAO, Sanket Shah, Niclas Boehmer, Aparna Taneja, Milind TambeAbstract
Restless multi-arm bandits (RMABs) are a model for sequentially allocating a limited number of resources to agents modeled as Markov Decision Processes. RMABs have applications in cellular networks, anti-poaching, and in particular, healthcare. For such high-stakes use cases, allocations are often required to treat different groups of agents (e.g., defined by sensitive attributes) fairly. In addition to the fairness challenge, agents' transition probabilities are often unknown and need to be learned in real-world problems. Thus, group fairness in RMABs requires us to simultaneously learn transition probabilities and how much budget we allocate to each group. Overcoming this key challenge ignored by previous work, we develop a decision-focused-learning pipeline to solve equitable RMABs, using a novel budget allocation algorithm to prevent disparity between groups. Our results on both synthetic and real-world large-scale datasets demonstrate that incorporating fair planning into the learning step greatly improves equity with little sacrifice in utility. |

## ID: 500 |
## Causally Abstracted Multi-armed Bandits## [oral]Fabio Massimo Zennaro, Nicholas George Bishop, Joel Dyer, Yorgos Felekis, Ani Calinescu, Michael J. Wooldridge, Theodoros DamoulasAbstract
Multi-armed bandits (MAB) and causal MABs (CMAB) are established frameworks for decision-making problems. The majority of prior work typically studies and solves individual MAB and CMAB in isolation for a given problem and associated data. However, decision-makers are often faced with multiple related problems and multi-scale observations where joint formulations are needed in order to efficiently exploit the problem structures and data dependencies. Transfer learning for CMABs addresses the situation where models are defined on identical variables, although causal connections may differ. In this work, we extend transfer learning to setups involving CMABs defined on potentially different variables, with varying degrees of granularity, and related via an abstraction map. Formally, we introduce the problem of causally abstracted MABs (CAMABs) by relying on the theory of causal abstraction in order to express a rigorous abstraction map. We propose algorithms to learn in a CAMAB, and study their regret. We illustrate the limitations and the strengths of our algorithms on a real-world scenario related to online advertising. |

## ID: 501 |
## Sample Average Approximation for Black-Box Variational InferenceJavier Burroni, Justin Domke, Daniel SheldonAbstract
Black-box variational inference (BBVI) is a general-purpose approximate inference approach that converts inference to a stochastic optimization problem. However, the difficulty of solving the BBVI optimization problem reliably and robustly using stochastic gradient methods has limited its applicability. We present a novel optimization approach for BBVI using the sample average approximation (SAA). SAA converts stochastic problems to deterministic ones by optimizing over a fixed random sample, which enables optimization tools such as quasi-Newton methods and line search that bypass the difficulties faced by stochastic gradient methods. We design an approach called "SAA for VI" that solves a sequence of SAA problems with increasing sample sizes to reliably and robustly solve BBVI problems without problem-specific tuning. We focus on quasi-Newton methods, which are well suited to problems with up to hundreds of latent variables. Our experiments show that SAA for VI simplifies the VI problem and achieves faster performance than existing methods. |

## ID: 503 |
## Equilibrium Computation in Multidimensional Congestion Games: CSP and Learning Dynamics ApproachesMohammad T. Irfan, Hau Chan, Jared SoundyAbstract
We present algorithms of two flavors—one rooted in constraint satisfaction problems (CSPs) and the other in learning dynamics—to compute pure-strategy Nash equilibrium (PSNE) in k-dimensional congestion games (k-DCGs) and their variants. The two algorithmic approaches are driven by whether or not a PSNE is guaranteed to exist. We first show that deciding the existence of a PSNE in a k-DCG is NP-complete even when players have binary and unit demand vectors. For general cost functions (potentially non-monotonic), we devise a new CSP-inspired algorithmic framework for PSNE computation, leading to algorithms that run in polynomial time under certain assumptions while offering exponential savings over standard CSP algorithms. We further refine these algorithms for variants of k-DCGs. Our experiments demonstrate the effectiveness of this new CSP framework for hard, non-monotonic k-DCGs. We then provide learning dynamics-based PSNE computation algorithms for linear and exponential cost functions. These algorithms run in polynomial time under certain assumptions. For general cost, we give a learning dynamics algorithm for an (α, β)-approximate PSNE (for certain α and β). Lastly, we also devise polynomial-time algorithms for structured demands and cost functions. |

## ID: 504 |
## Domain Adaptation with Cauchy-Schwarz DivergenceWenzhe Yin, Shujian Yu, Yicong Lin, Jie Liu, Jan-Jakob Sonke, Efstratios GavvesAbstract
Domain adaptation aims to use training data from one or multiple source domains to learn a hypothesis that can be generalized to a different, but related, target domain. As such, having a reliable measure for evaluating the discrepancy of both marginal and conditional distributions is crucial. We introduce Cauchy-Schwarz (CS) divergence to the problem of unsupervised domain adaptation (UDA). The CS divergence offers a theoretically tighter generalization error bound than the popular Kullback-Leibler divergence. This holds for the general case of supervised learning, including multi-class classification and regression. Furthermore, we illustrate that the CS divergence enables a simple estimator on the discrepancy of both marginal and conditional distributions between source and target domains in the representation space, without requiring any distributional assumptions. We provide multiple examples to illustrate how the CS divergence can be conveniently used in both distance metric- or adversarial training-based UDA frameworks, resulting in compelling performance. The code of our paper is available at \url{https://github.com/ywzcode/CS-adv}. |

## ID: 505 |
## Offline Bayesian Aleatoric and Epistemic Uncertainty Quantification and Posterior Value Optimisation in Finite-State MDPsFilippo Valdettaro, Aldo A. FaisalAbstract
We address the challenge of quantifying Bayesian uncertainty and incorporating it in offline use cases of finite-state Markov Decision Processes (MDPs) with unknown dynamics. Our approach provides a principled method to disentangle epistemic and aleatoric uncertainty, and a novel technique to find policies that optimise Bayesian posterior expected value without relying on strong assumptions about the MDP’s posterior distribution. First, we utilise standard Bayesian reinforcement learning methods to capture the posterior uncertainty in MDP parameters based on available data. We then analytically compute the first two moments of the return distribution across posterior samples and apply the law of total variance to disentangle aleatoric and epistemic uncertainties. To find policies that maximise posterior expected value, we leverage the closed-form expression for value as a function of policy. This allows us to propose a stochastic gradient-based approach for solving the problem. We illustrate the uncertainty quantification and Bayesian posterior value optimisation performance of our agent in simple, interpretable gridworlds and validate it through ground-truth evaluations on synthetic MDPs. Finally, we highlight the real-world impact and computational scalability of our method by applying it to the AI Clinician problem, which recommends treatment for patients in intensive care units and has emerged as a key use case of finite-state MDPs with offline data. We discuss the challenges that arise with Bayesian modelling of larger scale MDPs while demonstrating the potential to apply our methods rooted in Bayesian decision theory into the real world. We make our code available at https://github.com/filippovaldettaro/finite-state-mdps. |

## ID: 511 |
## A Graph Theoretic Approach for Preference Learning with Feature Information## [oral]Aadirupa Saha, Arun RajkumarAbstract
We consider the problem of ranking a set of $n$ items given a sample of their pairwise preferences. It is well known from the classical results of sorting literature that without any further assumption, one requires a sample size of $\Omega(n \log n)$ with active selection of pairs whereas, for a random set pairwise preferences the bound could be as bad as $\Omega(n^2)$. However, what if the learner is exposed to additional knowledge of the items features and their pairwise preferences are known to be modelled in terms of their feature similarities -- can these bounds be improved? In particular, we introduce a new probabilistic preference model, called feature-Bradley-Terry-Luce (f-BTL) for the purpose, and present a new least squares based algorithm, fBTL-LS, which requires a sample complexity much lesser than $O(n\log n)$ random pairs to obtain a `good' ranking. The sample complexity of our proposed algorithms depends on the degree of feature correlation of the items that makes use of tools from classical graph matching theory, shedding light on the true complexity of the problem -- this was not possible before with existing matrix completion based tools. We also prove tightness of our results showing a matching information theoretic lower bound for the problem. Our theoretical results are corroborated with extensive experimental evaluations on varying datasets. |

## ID: 512 |
## Differentially Private No-regret Exploration in Adversarial Markov Decision ProcessesShaojie Bai, Lanting Zeng, Chengcheng Zhao, Xiaoming Duan, Mohammad Sadegh Talebi, Peng Cheng, Jiming ChenAbstract
We study learning adversarial Markov decision process (MDP) in the episodic setting under the constraint of differential privacy (DP). This is motivated by the widespread applications of reinforcement learning (RL) in non-stationary and even adversarial scenarios, where protecting users' sensitive information is vital. We first propose two efficient frameworks for adversarial MDPs, spanning full-information and bandit settings. Within each framework, we consider both Joint DP (JDP), where a central agent is trusted to protect the sensitive data, and Local DP (LDP), where the information is protected directly on the user side. Then, we design novel privacy mechanisms to privatize the stochastic transition and adversarial losses. By instantiating such privacy mechanisms to satisfy JDP and LDP requirements, we obtain near-optimal regret guarantees for both frameworks. To our knowledge, these are the first algorithms to tackle the challenge of private learning in adversarial MDPs. |

## ID: 525 |
## Unsupervised Feature Selection towards Pattern Discrimination PowerWangduk Seo, Jaesung LeeAbstract
The goal of unsupervised feature selection is to identify a feature subset based on the intrinsic characteristics of a given dataset without user-guided information such as class variables. To achieve this, score functions based on information measures can be used to identify essential features. The major research direction of conventional information-theoretic unsupervised feature selection is to minimize the entropy of the final feature subset. Although the opposite way, i.e., maximization of the joint entropy, can also lead to novel insights, studies in this direction are rare. For example, in the field of information retrieval, selected features that maximize the joint entropy of a feature subset can be effective discriminators for reaching the target tuple in the database. Thus, in this work, we first demonstrate how two feature subsets, each obtained by minimizing/maximizing the joint entropy, respectively, are different based on a toy dataset. By comparing these two feature subsets, we show that the maximization of the joint entropy enhances the pattern discrimination power of the feature subset. Then, we derive a score function by remedying joint entropy calculation; high-dimensional joint entropy calculation is circumvented by using the low-order approximation. The experimental results on 30 public datasets indicate that the proposed method yields superior performance in terms of pattern discrimination power-related measures. |

## ID: 526 |
## Multi-Relational Structural EntropyYuwei Cao, Hao Peng, Angsheng Li, Chenyu You, Zhifeng Hao, Philip S. YuAbstract
Structural Entropy (SE) measures the structural information contained in a graph. Minimizing or maximizing SE helps to reveal or obscure the intrinsic structural patterns underlying graphs in an interpretable manner, finding applications in various tasks driven by networked data. However, SE ignores the heterogeneity inherent in the graph relations, which is ubiquitous in modern networks. In this work, we extend SE to consider heterogeneous relations and propose the first metric for multi-relational graph structural information, namely, Multi-relational Structural Entropy (MrSE). To this end, we first cast SE through the novel lens of the stationary distribution from random surfing, which readily extends to multi-relational networks by considering the choices of both nodes and relation types simultaneously at each step. The resulting MrSE is then optimized by a new greedy algorithm to reveal the essential structures within a multi-relational network. Experimental results highlight that the proposed MrSE offers a more insightful interpretation of the structure of multi-relational graphs compared to SE. Additionally, it enhances the performance of two tasks that involve real-world multi-relational graphs, including node clustering and social event detection. |

## ID: 528 |
## Gradient descent in matrix factorization: Understanding large initializationHengchao Chen, Xin Chen, Mohamad Elmasri, Qiang SunAbstract
Gradient Descent (GD) has been proven effective in solving various matrix factorization problems. However, its optimization behavior with large initial values remains less understood. To address this gap, this paper presents a novel theoretical framework for examining the convergence trajectory of GD with a large initialization. The framework is grounded in signal-to-noise ratio concepts and inductive arguments. The results uncover an implicit incremental learning phenomenon in GD and offer a deeper understanding of its performance in large initialization scenarios. |

## ID: 534 |
## Knowledge Intensive Learning of Credal NetworksSaurabh Mathur, Alessandro Antonucci, Sriraam NatarajanAbstract
Bayesian networks are a popular class of directed probabilistic graphical models that allow for closed-form learning of the local parameters if complete data are available. However, learning the parameters is challenging when the data are sparse, incomplete, and uncertain. In this work, we present an approach to this problem based on credal networks, a generalization of Bayesian networks based on set-valued local parameters. We derive an algorithm to learn such set-valued parameters from data using qualitative knowledge in the form of monotonic influence statements. Our empirical evaluation shows that using qualitative knowledge reduces uncertainty about the parameters without significant loss in accuracy. |

## ID: 538 |
## Towards Representation Learning for Weighting Problems in Design-Based Causal InferenceOscar Clivio, Avi Feller, Christopher C. HolmesAbstract
Reweighting a distribution to minimize a distance to a target distribution is a powerful and flexible strategy for estimating a wide range of causal effects, but can be challenging in practice because optimal weights typically depend on knowledge of the underlying data generating process. In this paper, we focus on design-based weights, which do not incorporate outcome information; prominent examples include prospective cohort studies, survey weighting, and the weighting portion of augmented weighting estimators. In such applications, we explore the central role of representation learning in finding desirable weights in practice. Unlike the common approach of assuming a well-specified representation, we highlight the error due to the choice of a representation and outline a general framework for finding suitable representations that minimize this error. Building on recent work that combines balancing weights and neural networks, we propose an end-to-end estimation procedure that learns a flexible representation, while retaining promising theoretical properties. We show that this approach is competitive in a range of common causal inference tasks. |

## ID: 540 |
## Iterated INLA for State and Parameter Estimation in Nonlinear Dynamical SystemsRafael Anderka, Marc Peter Deisenroth, So TakaoAbstract
Data assimilation (DA) methods use priors arising from differential equations to robustly interpolate and extrapolate data. Popular techniques such as ensemble methods that handle high-dimensional, nonlinear PDE priors focus mostly on state estimation, however can have difficulty learning the parameters accurately. On the other hand, machine learning based approaches can naturally learn the state and parameters, but their applicability can be limited, or produce uncertainties that are hard to interpret. Inspired by the Integrated Nested Laplace Approximation (INLA) method in spatial statistics, we propose an alternative approach to DA based on iteratively linearising the dynamical model. This produces a Gaussian Markov random field at each iteration, enabling one to use INLA to infer the state and parameters. Our approach can be used for arbitrary nonlinear systems, while retaining interpretability, and is furthermore demonstrated to outperform existing methods on the DA task. By providing a more nuanced approach to handling nonlinear PDE priors, our methodology offers improved accuracy and robustness in predictions, especially where data sparsity is prevalent. |

## ID: 541 |
## Learning Causal Abstractions of Linear Structural Causal ModelsRiccardo Massidda, Sara Magliacane, Davide BacciuAbstract
The need for modelling causal knowledge at different levels of granularity arises in several settings. Causal Abstraction provides a framework for formalizing this problem by relating two Structural Causal Models at different levels of detail. Despite increasing interest in applying causal abstraction, e.g. in the interpretability of large machine learning models, the graphical and parametrical conditions under which a causal model can abstract another are not known. Furthermore, learning causal abstractions from data is still an open problem. In this work, we tackle both issues for linear causal models with linear abstraction functions. First, we characterize how the low-level coefficients and the abstraction function determine the high-level coefficients and how the high-level model constrains the causal ordering of low-level variables. Then, we apply our theoretical results to learn high-level and low-level causal models and their abstraction function from observational data. In particular, we introduce Abs-LiNGAM, a method that leverages the constraints induced by the learned high-level model and the abstraction function to speedup the recovery of the larger low-level model, under the assumption of non-Gaussian noise terms. In simulated settings, we show the effectiveness of learning causal abstractions from data and the potential of our method in improving scalability of causal discovery. |

## ID: 544 |
## CSS: Contrastive Semantic Similarities for Uncertainty Quantification of LLMsShuang Ao, Stefan Rueger, Advaith SiddharthanAbstract
Despite the impressive capability of large language models (LLMs), knowing when to trust their generations remains an open challenge. The recent literature on uncertainty quantification of natural language generation (NLG) utilizes a conventional natural language inference (NLI) classifier to measure the semantic dispersion of LLMs responses. These studies employ logits of NLI classifier for semantic clustering to estimate uncertainty. However, logits represent the probability of the predicted class and barely contain feature information for potential clustering. Alternatively, CLIP (Contrastive Language–Image Pre-training) performs impressively in extracting image-text pair features and measuring their similarity. To extend its usability, we propose Contrastive Semantic Similarity, the CLIP-based feature extraction module to obtain similarity features for measuring uncertainty for text pairs. We apply this method to selective NLG, which detects and rejects unreliable generations for better trustworthiness of LLMs. We conduct extensive experiments with three LLMs on several benchmark question-answering datasets with comprehensive evaluation metrics. Results show that our proposed method performs better in estimating reliable responses of LLMs than comparable baselines. |

## ID: 545 |
## Bayesian Active Learning in the Presence of Nuisance Parameters## [oral]Sabina J. Sloman, Ayush Bharti, Julien Martinelli, Samuel KaskiAbstract
In many settings, such as scientific inference, optimization, and transfer learning, the learner has a well-defined objective, which can be treated as estimation of a target parameter, and no intrinsic interest in characterizing the entire data-generating process. Usually, the learner must also contend with additional sources of uncertainty or variables --- with nuisance parameters. Bayesian active learning, or sequential optimal experimental design, can straightforwardly accommodate the presence of nuisance parameters, and so is a natural active learning framework for such problems. However, the introduction of nuisance parameters can lead to bias in the Bayesian learner's estimate of the target parameters, a phenomenon we refer to as negative interference. We characterize the threat of negative interference and how it fundamentally changes the nature of the Bayesian active learner's task. We show that the extent of negative interference can be extremely large, and that accurate estimation of the nuisance parameters is critical to reducing it. The Bayesian active learner is confronted with a dilemma: whether to spend a finite acquisition budget in pursuit of estimation of the target or of the nuisance parameters. Our setting encompasses Bayesian transfer learning as a special case, and our results shed light on the phenomenon of negative transfer between learning environments. |

## ID: 549 |
## Common Event Tethering to Improve Prediction of Rare Clinical Events## [spotlight]Quinn Lanners, Qin Weng, Marie-Louise Meng, Matthew M. EngelhardAbstract
Learning to predict rare medical events is difficult due to the inherent lack of signal in highly imbalanced datasets. Yet, oftentimes we also have access to surrogate or related outcomes that we believe share etiology or underlying risk factors with the event of interest. In this work, we propose the use of two variants of a well-known approach, regularized multi-label learning (MLL), that we hypothesize are uniquely suited to leverage this similarity and improve model performance in rare event settings. Whereas most analyses of MLL emphasize improved performance across all event types, our analyses quantify benefits to rare event prediction offered by our approach when a more common, related event is available to enhance learning. We begin by deriving asymptotic properties and providing theoretical insight into the convergence rates of our proposed estimators. We then provide simulation results highlighting how characteristics of the data generating process, including the event similarity and event rate, affect our proposed models' performance. We conclude by showing real-world benefit of our approach in two clinical settings: prediction of rare cardiovascular morbidities in the setting of preeclampsia; and early prediction of autism from the electronic health record. |

## ID: 550 |
## One Shot Inverse Reinforcement Learning for Stochastic Linear BanditsEtash Kumar Guha, Jim Thannikary James, Krishna Acharya, Vidya Muthukumar, Ashwin PananjadyAbstract
The paradigm of inverse reinforcement learning (IRL) is used to specify the reward function of an agent purely from its actions and is critical for value alignment and AI safety. While IRL is successful in practice, theoretical guarantees remain nascent. Motivated by the need for IRL in large action spaces with limited data, we consider as a first step the problem of learning from a single sequence of actions (i.e., a demonstration) of a stochastic linear bandit algorithm. When the demonstrator employs the Phased Elimination algorithm, we develop a simple inverse learning procedure that estimates the linear reward function consistently in the time horizon with just a \textit{single} demonstration. In particular, we show that our inverse learner approximates the true reward parameter within a error of $\mathcal{O}(T^{-\frac{\omega - 1}{2\omega }})$ (where $T$ is the length of the demonstrator's trajectory and $\omega$ is a constant that depends on the geometry of the action set). We complement this result with an information-theoretic lower bound for any inverse learning procedure. We corroborate our theoretical results with simulations on synthetic data and a demonstration constructed from the MovieLens dataset. |

## ID: 561 |
## Local Discovery by Partitioning: Polynomial-Time Causal Discovery Around Exposure-Outcome PairsJacqueline R. M. A. Maasch, Weishen Pan, Shantanu Gupta, Volodymyr Kuleshov, Kyra Gan, Fei WangAbstract
Causal discovery is crucial for causal inference in observational studies, as it can enable the identification of *valid adjustment sets* (VAS) for unbiased effect estimation. However, global causal discovery is notoriously hard in the nonparametric setting, with exponential time and sample complexity in the worst case. To address this, we propose *local discovery by partitioning* (LDP): a local causal discovery method that is tailored for downstream inference tasks without requiring parametric and pretreatment assumptions. LDP is a constraint-based procedure that returns a VAS for an exposure-outcome pair under latent confounding, given sufficient conditions. The total number of independence tests performed is worst-case quadratic with respect to the cardinality of the variable set. Asymptotic theoretical guarantees are numerically validated on synthetic graphs. Adjustment sets from LDP yield less biased and more precise average treatment effect estimates than baseline discovery algorithms, with LDP outperforming on confounder recall, runtime, and test count for VAS discovery. Notably, LDP ran at least $1300\times$ faster than baselines on a benchmark. |

## ID: 562 |
## End-to-End Learning for Fair Multiobjective Optimization Under UncertaintyMy H Dinh, James Kotary, Ferdinando FiorettoAbstract
Many decision processes in artificial intelligence and operations research are modeled by parametric optimization problems whose defining parameters are unknown and must be inferred from observable data. The Predict-Then-Optimize (PtO) paradigm in machine learning aims to maximize downstream decision quality by training the parametric inference model end-to-end with the subsequent constrained optimization. This requires backpropagation through the optimization problem using approximation techniques specific to the problem's form, especially for nondifferentiable linear and mixed-integer programs. This paper extends the PtO methodology to optimization problems with nondifferentiable Ordered Weighted Averaging (OWA) objectives, known for their ability to ensure properties of fairness and robustness in decision models. Through a collection of training techniques and proposed application settings, it shows how optimization of OWA functions can be effectively integrated with parametric prediction for fair and robust optimization under uncertainty. |

## ID: 564 |
## α-Former: Local-Feature-Aware (L-FA) TransformerZhi Xu, Bin Sun, Yue Bai, Yun FuAbstract
Despite the success of current segmentation models powered by the transformer, the camouflaged instance segmentation (CIS) task remains a challenge due to the similarity between the target and the background. To address this issue, we propose a novel approach called the local-feature-aware transformer ($\alpha$-Former), inspired by how humans find the camouflaged instance in a given photograph. We use traditional computer vision descriptors to simulate how humans find the unnatural boundary in a given photograph. Then, the information extracted by traditional descriptors can be employed as prior knowledge to enhance the neural network's performance. Moreover, due to the non-learnable characteristics of traditional descriptors, we designed a learnable binary filter to simulate the traditional descriptors. In order to aggregate the information from the backbone and binary filter, we introduce an adapter to merge local features into the transformer framework. Additionally, we introduce an edge-aware feature fusion module to improve boundary results in the segmentation model. Using the proposed transformer-based encoder-decoder architecture, our $\alpha$-Former surpasses state-of-the-art performance on the COD10K and NC4K datasets. |

## ID: 569 |
## Response Time Improves Gaussian Process Models for Perception and PreferencesMichael Shvartsman, Benjamin Letham, Eytan Bakshy, Stephen L KeeleyAbstract
Models for human choice prediction in preference learning and perception science often use binary response data, requiring many samples to accurately learn latent utilities or perceptual intensities. The response time (RT) to make each choice captures additional information about the decision process, but existing models incorporating RTs for choice prediction do so in a fully parametric way or over discrete inputs. At the same time, state-of-the-art Gaussian process (GP) models of perception and preferences operate on choices only, ignoring RTs. We propose two approaches for incorporating RTs into GP preference and perception models. The first is based on stacking GP models, and the second uses a novel differentiable approximation to the likelihood of the diffusion decision model (DDM), the de-facto standard model for choice RTs. Our RT-choice GPs enable better latent value estimation and held-out choice prediction relative to baselines, which we demonstrate on three real-world multivariate datasets covering both human psychophysics and preference learning. |

## ID: 570 |
## Understanding Pathologies of Deep Heteroskedastic Regression## [oral]Eliot Wong-Toi, Alex James Boyd, Vincent Fortuin, Stephan MandtAbstract
Deep, overparameterized regression models are notorious for their tendency to overfit. This problem is exacerbated in heteroskedastic models, which predict both mean and residual noise for each data point. At one extreme, these models fit all training data perfectly, eliminating residual noise entirely; at the other, they overfit the residual noise while predicting a constant, uninformative mean. We observe a lack of middle ground, suggesting a phase transition dependent on model regularization strength. Empirical verification supports this conjecture by fitting numerous models with varying mean and variance regularization. To explain the transition, we develop a theoretical framework based on a statistical field theory, yielding qualitative agreement with experiments. As a practical consequence, our analysis simplifies hyperparameter tuning from a two-dimensional to a one-dimensional search, substantially reducing the computational burden. Experiments on diverse datasets, including UCI datasets and the large-scale ClimSim climate dataset, demonstrate significantly improved performance in various calibration tasks. |

## ID: 571 |
## Support Recovery in Sparse PCA with General Missing Data## [oral]Hanbyul Lee, Qifan Song, Jean HonorioAbstract
We analyze a sparse PCA algorithm for incomplete and noisy data without any specific model assumption on the data missing scheme. We utilize a graphical approach to characterize general missing patterns, which enables us to analyze the effect of structural properties of missing patterns on the solvability of sparse PCA problem. The sparse PCA method we focus on is a semidefinite relaxation of the $\ell_1$-regularized PCA problem. We provide theoretical justification that the support of the sparse leading eigenvector can be recovered with high probability using the algorithm, under certain conditions. The conditions involve the spectral gap between the largest and second-largest eigenvalues of the true data matrix, the magnitude of the noise, and the structural properties of the missing pattern. The concepts of algebraic connectivity and irregularity are used to describe the properties in a graphical way. We empirically justify our theorem with synthetic data analysis. We show that the SDP algorithm outperforms other sparse PCA approaches especially when the observation pattern has good structural properties. As a by-product of our analysis, we provide two theorems to handle general missing schemes, which can be applied to other problems related to incomplete data matrices. |

## ID: 575 |
## A General Identification Algorithm For Data Fusion Problems Under Systematic Selection## [oral]Jaron J.R. Lee, AmirEmad Ghassami, Ilya ShpitserAbstract
Causal inference is made challenging by confounding, selection bias, and other complications. A common approach to addressing these difficulties is the inclusion of auxiliary data on the superpopulation of interest. Such data may measure a different set of variables, or be obtained under different experimental conditions than the primary dataset. Analysis based on multiple datasets must carefully account for similarities between datasets, while appropriately accounting for differences. In addition, selection of experimental units into different datasets may be systematic; similar difficulties are encountered in missing data problems. Existing methods for combining datasets either do not consider this issue, or assume simple selection mechanisms. In this paper, we provide a general approach, based on graphical causal models, for causal inference from data on the same superpopulation that is obtained under different experimental conditions. Our framework allows both arbitrary unobserved confounding, and arbitrary selection processes into different experimental regimes in our data. We describe how systematic selection processes may be organized into a hierarchy similar to censoring processes in missing data: selected completely at random (SCAR), selected at random (SAR), and selected not at random (SNAR). In addition, we provide a general identification algorithm for interventional distributions in this setting. |

## ID: 581 |
## Privacy-Aware Randomized Quantization via Linear ProgrammingZhongteng Cai, Xueru Zhang, Mohammad Mahdi KhaliliAbstract
Differential privacy mechanisms such as the Gaussian or Laplace mechanism have been widely used in data analytics for preserving individual privacy. However, they are mostly designed for continuous outputs and are unsuitable for scenarios where discrete values are necessary. Although various quantization mechanisms were proposed recently to generate discrete outputs under differential privacy, the outcomes are either biased or have an inferior accuracy-privacy trade-off. In this paper, we propose a family of quantization mechanisms that is unbiased and differentially private. It has a high degree of freedom and we show that some existing mechanisms can be considered as special cases of ours. To find the optimal mechanism, we formulate a linear optimization that can be solved efficiently using linear programming tools. Experiments show that our proposed mechanism can attain a better privacy-accuracy trade-off compared to baselines. |

## ID: 582 |
## Decision-Focused Evaluation of Worst-Case Distribution ShiftKevin Ren, Yewon Byun, Bryan WilderAbstract
Recent studies have shown that performance on downstream optimization tasks often diverges from standard accuracy-based losses, highlighting that the loss function of a predictive model should align with the decision task of the downstream optimizer. Despite this observation, no work— to our knowledge—has yet examined the impact of this divergence for distribution shift. In this paper, we demonstrate that worst-case distribution shifts identified by traditional average accuracy-based metrics fundamentally differ from those for the downstream decision task at hand. We introduce a novel framework that employs a hierarchical model structure to identify worst-case distribution shifts in predictive resource allocation settings. This task is more difficult than in standard distribution shift settings because of combinatorial interactions, where decisions depend on the joint presence of individuals in the allocation task. We show that the problem can be reformulated as a submodular optimization problem, enabling efficient approximations, to capture shifts both within and across instances of the optimization problem. |

## ID: 583 |
## Hidden Population Estimation with Indirect Inference and Auxiliary Information## [spotlight]Justin David Naggar Weltz, Eric Laber, Alexander VolfovskyAbstract
Many populations defined by illegal or stigmatized behavior are difficult to sample using conventional survey methodology. Respondent Driven Sampling (RDS) is a participant referral process frequently employed in this context to collect information. This sampling methodology can be modeled as a stochastic process that explores the graph of a social network, generating a partially observed subgraph between study participants. The methods currently used to impute the missing edges in this subgraph exhibit biased downstream estimation. We leverage auxiliary participant information and concepts from indirect inference to ameliorate these issues and improve estimation of the hidden population size. These advances result in smaller bias and higher precision in the estimation of the study participant arrival rate, the sample subgraph, and the population size. Lastly, we use our method to estimate the number of People Who Inject Drugs (PWID) in the Kohtla-Jarve region of Estonia. |

## ID: 591 |
## Active Learning Framework for Incomplete NetworksTung Khong, Cong Tran, Cuong PhamAbstract
Significant progression has been made in active learning algorithms for graph networks in various tasks. However real-world applications frequently involve incomplete graphs with missing links, which pose the challenge that existing approaches might not adequately address. This paper presents an active learning approach tailored specifically for handling incomplete graphs, termed ALIN. Our algorithm employs graph neural networks (GNN) to generate node embeddings and calculates losses for both node classification and link prediction tasks. The losses are combined with appropriate weights and iteratively updating the GNN, ALIN efficiently queries nodes in batches, thereby achieving a balance between training feedbacks and resource utilization. Our empirical experiments have shown ALIN can surpass state-of-the-art baselines on Cora, Citeseer, Pubmed, and Coauthor-CS datasets. |

## ID: 596 |
## Masking the Unknown: Leveraging Masked Samples for Enhanced Data AugmentationXun Yao, Zijian Huang, Xinrong Hu, JACK Yang, Yi GuoAbstract
Data Augmentation (DA) has become a widely adopted strategy for addressing data scarcity in numerous NLP tasks, especially in scenarios with limited resources or imbalanced classes. However, many existing augmentation techniques rely on randomness or additional resources, presenting challenges in both performance and practical implementation. Furthermore, there is a lack of exploration into what constitutes effective augmentation. In this paper, we systematically evaluate existing DA methods across a comprehensive range of text-classification benchmarks. The empirical analysis highlights that the most significant change resulting from augmentation is observed in the data variance. This observation inspires the proposed approach, termed Mask-for-Data Augmentation (M4DA), which strategically masks tokens from original samples for augmentation. Specifically, M4DA consists of a Variance-Oriented Masker Module (VMM), which ensures an increase in data variances, and a Complexity-Enhanced Selection Module (CSM), designed to select the augmented sample with the highest semantic complexity. The effectiveness of the proposed method is empirically validated across various text-classification benchmarks, including scenarios with limited or full resources and imbalanced classes. Experimental results demonstrate considerable improvements over state-of-the-arts. |

## ID: 597 |
## On Convergence of Federated Averaging Langevin DynamicsWei Deng, Qian Zhang, Yian Ma, Zhao Song, Guang LinAbstract
We propose a federated averaging Langevin algorithm (FA-LD) for uncertainty quantification and mean predictions with distributed clients. In particular, we generalize beyond normal posterior distributions and consider a general class of models. We develop theoretical guarantees for FA-LD for strongly log-concave distributions with non-i.i.d data and study how the injected noise and the stochastic-gradient noise, the heterogeneity of data, and the varying learning rates affect the convergence. Such an analysis sheds light on the optimal choice of local updates to minimize the communication cost. Important to our approach is that the communication efficiency does not deteriorate with the injected noise in the Langevin algorithms. In addition, we examine in our FA-LD algorithm both independent and correlated noise used over different clients. We observe that there is a trade-off between the pairs among communication, accuracy, and data privacy. As local devices may become inactive in federated networks, we also show convergence results based on different averaging schemes where only partial device updates are available. In such a case, we discover an additional bias that does not decay to zero. |

## ID: 599 |
## Optimization Framework for Semi-supervised Attributed Graph CoarseningManoj Kumar, Subhanu Halder, Archit Kane, Ruchir Gupta, Sandeep KumarAbstract
In data-intensive applications, graphs serve as foundational structures across various domains. However, the increasing size of datasets poses significant challenges to performing downstream tasks. To address this problem, techniques such as graph coarsening, condensation, and summarization have been developed to create a coarsened graph while preserving important properties of the original graph by considering both the graph matrix and the feature or attribute matrix of the original graph as inputs. However, existing graph coarsening techniques often neglect the label information during the coarsening process, which can result in a lower-quality coarsened graph and limit its suitability for downstream tasks. To overcome this limitation, we introduce the Label-Aware Graph Coarsening (LAGC) algorithm, a semi-supervised approach that incorporates the graph matrix, feature matrix, and some of the node label information to learn a coarsened graph. Our proposed formulation is a non-convex optimization problem that is efficiently solved using block successive upper bound minimization(BSUM) technique, and it is provably convergent. Our extensive results demonstrate that the LAGC algorithm outperforms the existing state-of-the-art method by a significant margin. |

## ID: 607 |
## Power Mean Estimation in Stochastic Monte-Carlo Tree SearchTuan Quang Dam, Odalric-Ambrym Maillard, Emilie KaufmannAbstract
Monte-Carlo Tree Search (MCTS) is a widely-used online planning strategy that effectively combines Monte-Carlo sampling with forward tree search to make optimal decisions in various real-world scenarios. Its success hinges on the application of the Upper Confidence bound for Trees (UCT) algorithm, an extension of the Upper Confidence bound (UCB) method used in multi-arm bandits. However, theoretical investigations of UCT have been found incomplete due to an error in the estimated "logarithmic" bonus term used for action selection in the tree. This issue was addressed by introducing a "polynomial" exploration bonus, which effectively balances the exploration-exploitation trade-off. However, most theoretical studies have focused on deterministic Markov Decision Processes (MDPs), neglecting stochastic settings. In this paper, we introduce Stochastic-Power-UCT, an MCTS algorithm explicitly designed for stochastic MDPs using power mean as the value estimator. We conduct a comprehensive study of the polynomial convergence assurance for selecting the optimal action and estimating the value at the root node within our methodology. Our findings demonstrate that Stochastic-Power-UCT shares the same convergence rate as UCT, with UCT being a special case of Stochastic-Power-UCT. Furthermore, we validate our theoretical results through empirical assessments across diverse stochastic MDP environments, providing empirical evidence to support our method's theoretical claims |

## ID: 609 |
## Recursively-Constrained Partially Observable Markov Decision Processes## [oral]Qi Heng Ho, Tyler Becker, Benjamin Kraske, Zakariya Laouar, Martin S. Feather, Federico Rossi, Morteza Lahijanian, Zachary N SunbergAbstract
Many sequential decision problems involve optimizing one objective function while imposing constraints on other objectives. Constrained Partially Observable Markov Decision Processes (C-POMDP) model this case with transition uncertainty and partial observability. In this work, we first show that C-POMDPs violate the optimal substructure property over successive decision steps and thus may exhibit behaviors that are undesirable for some (e.g., safety critical) applications. Additionally, online re-planning in C-POMDPs is often ineffective due to the inconsistency resulting from this violation. To address these drawbacks, we introduce the Recursively-Constrained POMDP (RC-POMDP), which imposes additional history-dependent cost constraints on the C-POMDP. We show that, unlike C-POMDPs, RC-POMDPs always have deterministic optimal policies and that optimal policies obey Bellman's principle of optimality. We also present a point-based dynamic programming algorithm for RC-POMDPs. Evaluations on benchmark problems demonstrate the efficacy of our algorithm and show that policies for RC-POMDPs produce more desirable behaviors than policies for C-POMDPs. |

## ID: 613 |
## On Overcoming Miscalibrated Conversational Priors in LLM-based ChatBotsChristine Herlihy, Jennifer Neville, Tobias Schnabel, Adith SwaminathanAbstract
We explore the use of Large Language Model (LLM-based) chatbots to power recommender systems. We observe that the chatbots respond poorly when they encounter under-specified requests (e.g., they make incorrect assumptions, hedge with a long response, or refuse to answer). We conjecture that such miscalibrated response tendencies (i.e., conversational priors) can be attributed to LLM fine-tuning by annotators --- single-turn annotations may not capture multi-turn conversation utility, and the annotators' preferences may not even be representative of users interacting with a recommender system. We first analyze public LLM chat logs to conclude that query under-specification is common. Next, we study synthetic recommendation problems with known but latent item utilities, and frame them as Partially Observed Decision Processes (PODP). We find that pre-trained LLMs can be sub-optimal for PODPs and derive better policies that clarify under-specified queries when appropriate. Then, we re-calibrate LLMs by prompting them with learned control messages to approximate the improved policy. Finally, we show empirically that our lightweight learning approach effectively uses logged conversation data to re-calibrate the response strategies of LLM-based chatbots for recommendation tasks. |

## ID: 619 |
## Efficient Interactive Maximization of BP and Weakly Submodular ObjectivesAdhyyan Narang, Omid Sadeghi, Lillian J. Ratliff, Maryam Fazel, Jeff BilmesAbstract
In the context of online interactive machine learning with combinatorial objectives, we extend purely submodular prior work to more general non-submodular objectives. This includes: (1) those that are additively decomposable into a sum of two terms (a monotone submodular and monotone supermodular term, known as a BP decomposition); and (2) those that are only weakly submodular. In both cases, this allows representing not only competitive (submodular) but also complementary (supermodular) relationships between objects, enhancing this setting to a broader range of applications (e.g., movie recommendations, medical treatments, etc.) where this is beneficial. In the two-term case, moreover, we study not only the more typical monolithic feedback approach but also a novel framework where feedback is available separately for each term. With real-world practicality and scalability in mind, we integrate \Nystrom{} sketching techniques to significantly improve the computational complexity, including for the purely submodular case. In the Gaussian process contextual bandits setting, we show sub-linear theoretical regret bounds in all cases. We also empirically show good applicability to recommendation systems and data subset selection. |

## ID: 621 |
## ILP-FORMER: Solving Integer Linear Programming with Sequence to Multi-Label LearningShufeng Kong, Caihua Liu, Carla P GomesAbstract
Integer Linear Programming (ILP) is an essential class of combinatorial optimization problems (COPs). Its inherent NP-hardness has fostered considerable efforts towards the development of heuristic strategies. An emerging approach involves leveraging data-driven methods to automatically learn these heuristics. For example, using deep (reinforcement) learning to recurrently reoptimize an initial solution with Large Neighborhood Search (LNS) has demonstrated exceptional performance across numerous applications. A pivotal challenge within LNS lies in identifying an optimal subset of variables for reoptimization at each stage. Existing methods typically learn a policy to select a subset, either by maintaining a fixed cardinality or by decomposing the subset into independent binary decisions for each variable. However, such strategies overlook the modeling of LNS’s sequential processes and fail to explore the correlations inherent in variable selection. To overcome these shortcomings, we introduce ILP-FORMER, an innovative model that reimagines policy learning as a sequence-to-multi-label classification (MLC) problem. Our approach uniquely integrates a causal transformer encoder to capture the sequential nature of LNS. Additionally, we employ an MLC decoder with contrastive learning to exploit the correlations in variable selection. Our extensive experiments confirm that ILP-FORMER delivers state-of-the-art anytime performance on several ILP benchmarks. Furthermore, ILP-FORMER exhibits impressive generalization capabilities when dealing with larger problem instances. |

## ID: 623 |
## Adaptive Softmax Trees for Many-Class ClassificationRasul Kairgeldin, Magzhan Gabidolla, Miguel Á. Carreira-PerpiñánAbstract
NLP tasks such as language models or document classification involve classification problems with thousands of classes. In these situations, it is difficult to get high predictive accuracy and the resulting model can be huge in number of parameters and inference time. A recent, successful approach is the softmax tree (ST): a decision tree having sparse hyperplane splits at the decision nodes (which make hard, not soft, decisions) and small softmax classifiers at the leaves. Inference here is very fast because only a small subset of class probabilities need to be computed, yet the model is quite accurate. However, a significant drawback is that it assumes a complete tree, whose size grows exponentially with depth. We propose a new algorithm to train a ST of arbitrary structure. The tree structure itself is learned optimally by interleaving steps that grow the structure with steps that optimize the parameters of the current structure. This makes it possible to learn STs that can grow much deeper but in an irregular way, adapting to the data distribution. The resulting STs improve considerably the predictive accuracy while reducing the model size and inference time even further, as demonstrated in datasets with thousands of classes. In addition, they are interpretable to some extent. |

## ID: 626 |
## Optimizing Language Models for Human Preferences is a Causal Inference ProblemVictoria Lin, Eli Ben-Michael, Louis-Philippe MorencyAbstract
As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from *direct outcome datasets*, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that language model optimization should be viewed as a *causal problem* to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method—*causal preference optimization* (CPO)—that solves an unbiased surrogate objective for the problem. We further extend CPO with *doubly robust* CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions. |

## ID: 632 |
## A Homogenization Approach for Gradient-Dominated Stochastic OptimizationJiyuan Tan, Chenyu Xue, Chuwen Zhang, Qi Deng, Dongdong Ge, Yinyu YeAbstract
Gradient dominance property is a condition weaker than strong convexity, yet sufficiently ensures global convergence even in non-convex optimization. This property finds wide applications in machine learning, reinforcement learning (RL), and operations management. In this paper, we propose the stochastic homogeneous second-order descent method (SHSODM) for stochastic functions enjoying gradient dominance property based on a recently proposed homogenization approach. Theoretically, we provide its sample complexity analysis, and further present an enhanced result by incorporating variance reduction techniques. Our findings show that SHSODM matches the best-known sample complexity achieved by other second-order methods for gradient-dominated stochastic optimization but without cubic regularization. Empirically, since the homogenization approach only relies on solving extremal eigenvector problem at each iteration instead of Newton-type system, our methods gain the advantage of cheaper computational cost and robustness in ill-conditioned problems. Numerical experiments on several RL tasks demonstrate the better performance of SHSODM compared to other off-the-shelf methods. |

## ID: 636 |
## FedAST: Federated Asynchronous Simultaneous TrainingBaris Askin, Pranay Sharma, Carlee Joe-Wong, Gauri JoshiAbstract
Federated Learning (FL) enables edge devices or clients to collaboratively train machine learning (ML) models without sharing their private data. Much of the existing work in FL focuses on efficiently learning a model for a single task. In this paper, we study simultaneous training of multiple FL models using a common set of clients. The few existing simultaneous training methods employ synchronous aggregation of client updates, which can cause significant delays because large models and/or slow clients can bottleneck the aggregation. On the other hand, a naive asynchronous aggregation is adversely affected by stale client updates. We propose FedAST, a buffered asynchronous federated simultaneous training algorithm that overcomes bottlenecks from slow models and adaptively allocates client resources across heterogeneous tasks. We provide theoretical convergence guarantees for FedAST for smooth non-convex objective functions. Extensive experiments over multiple real-world datasets demonstrate that our proposed method outperforms existing simultaneous FL approaches, achieving up to 46.0% reduction in time to train multiple tasks to completion. |

## ID: 638 |
## Probabilities of Causation for Continuous and Vector VariablesYuta Kawakami, manabu kuroki, Jin TianAbstract
*Probabilities of causation* (PoC) are valuable concepts for explainable artificial intelligence and practical decision-making. PoC are originally defined for scalar binary variables. In this paper, we extend the concept of PoC to continuous treatment and outcome variables, and further generalize PoC to capture causal effects between multiple treatments and multiple outcomes. In addition, we consider PoC for a sub-population and PoC with multi-hypothetical terms to capture more sophisticated counterfactual information useful for decision-making. We provide a nonparametric identification theorem for each type of PoC we introduce. Finally, we illustrate the application of our results on a real-world dataset about education. |

## ID: 639 |
## General Markov Model for Solving Patrolling GamesAndrzej Nagórko, Michał Tomasz Godziszewski, Marcin Waniek, Barbara Rosiak, Małgorzata Róg, Tomasz Paweł MichalakAbstract
Safeguarding critical infrastructure has recently emerged as a global challenge. To address complex security concerns raised by broadening array of threats, effective mobile security forces are essential. A key aspect involves designing optimal patrolling strategies for mobile units. Two bodies of research dealt with this: stochastic patrolling and partially observable stochastic games. Alas, the first approach makes too-far-reaching simplifying assumption and the second one is more expressive but computationally challenging. The model proposed in this paper is inspired by partially observable stochastic games so that it is general enough to enable comprehensive modeling of attacker-defender interactions but a the same time remains computationally friendly. With our proposed robust SHIELD algorithm, we are able to find a defense strategy where the probability of apprehending the attacker can be nearly doubled compared to the state of the art. |

## ID: 644 |
## Identifying Causal Changes Between Linear Structural Equation ModelsVineet Malik, Kevin Bello, Asish Ghoshal, Jean HonorioAbstract
Learning the structures of structural equation models (SEMs) as directed acyclic graphs (DAGs) from data is crucial for representing causal relationships in various scientific domains. Instead of estimating individual DAG structures, it is often preferable to directly estimate changes in causal relations between conditions, such as changes in genetic expression between healthy and diseased subjects. This work studies the problem of directly estimating the difference between two linear SEMs, i.e. *without estimating the individual DAG structures*, given two sets of samples drawn from the individual SEMs. We consider general classes of linear SEMs where the noise distributions are allowed to be Gaussian or non-Gaussian and have different noise variances across the variables in the individual SEMs. We rigorously characterize novel conditions related to the topological layering of the structural difference that lead to the *identifiability* of the difference DAG (DDAG). Moreover, we propose an *efficient* algorithm to identify the DDAG via sequential re-estimation of the difference of precision matrices. A surprising implication of our results is that causal changes can be identifiable even between *non-identifiable* models such as Gaussian SEMs with unequal noise variances. Synthetic experiments are presented to validate our theoretical results and to show the scalability of our method. |

## ID: 646 |
## Products, Abstractions and Inclusions of Causal SpacesSimon Buchholz, Junhyung Park, Bernhard SchölkopfAbstract
Causal spaces have recently been introduced as a measure-theoretic framework to encode the notion of causality. While it has some advantages over established frameworks, such as structural causal models, the theory is so far only developed for single causal spaces. In many mathematical theories, not least the theory of probability spaces of which causal spaces are a direct extension, combinations of objects and maps between objects form a central part. In this paper, taking inspiration from such objects in probability theory, we propose the definitions of products of causal spaces, as well as (stochastic) transformations between causal spaces. In the context of causality, these quantities can be given direct semantic interpretations as causally independent components, abstractions and extensions. |

## ID: 656 |
## Enhancing Patient Recruitment Response in Clinical Trials: an Adaptive Learning FrameworkXinying Fang, Shouhao ZhouAbstract
Patient recruitment remains a key challenge in contemporary clinical trials, often leading to trial failures due to insufficient recruitment rates. To address this issue, we introduce a novel adaptive learning framework that integrates machine learning methods to facilitate evidence-informed recruitment. Through dynamic testing, predictive learning, and adaptive pruning of recruitment plans, the proposed framework ensures superiority over the conventional random assignment approach. We discuss the practical considerations for implementing this framework and conduct a simulation study to assess the overall response rates and chances of improvement. The findings suggest that the proposed approach can substantially enhance patient recruitment efficiency. By systematically optimizing recruitment plan allocation, this adaptive learning framework shows promise in addressing recruitment challenges across broad clinical research settings, potentially transforming how patient recruitment is managed in clinical trials. |

## ID: 668 |
## Non-stationary Domain Generalization: Theory and AlgorithmThai-Hoang Pham, Xueru Zhang, Ping ZhangAbstract
Although recent advances in machine learning have shown its success to learn from independent and identically distributed (IID) data, it is vulnerable to out-of-distribution (OOD) data in an open world. Domain generalization (DG) deals with such an issue and it aims to learn a model from multiple source domains that can be generalized to unseen target domains. Existing studies on DG have largely focused on stationary settings with homogeneous source domains. However, in many applications, domains may evolve along a specific direction (e.g., time, space). Without accounting for such non-stationary patterns, models trained with existing methods may fail to generalize on OOD data. In this paper, we study domain generalization in non-stationary environment. We first examine the impact of environmental non-stationarity on model performance and establish the theoretical upper bounds for the model error at target domains. Then, we propose a novel algorithm based on adaptive invariant representation learning, which leverages the non-stationary pattern to train a model that attains good performance on target domains. Experiments on both synthetic and real data validate the proposed algorithm. |

## ID: 683 |
## Quantization of Large Language Models with an Overdetermined BasisDaniil Merkulov, Daria Cherniuk, Alexander Rudikov, Ivan Oseledets, Ekaterina Muravleva, Aleksandr Mikhalev, Boris KashinAbstract
In this paper, we introduce an algorithm for data quantization based on the principles of Kashin representation. This approach hinges on decomposing any given vector, matrix, or tensor into two factors. The first factor maintains a small infinity norm, while the second exhibits a similarly constrained norm when multiplied by an orthogonal matrix. Surprisingly, the entries of factors after decomposition are well-concentrated around several peaks, which allows us to efficiently replace them with corresponding centroids for quantization purposes. We study the theoretical properties of the proposed approach and rigorously evaluate our compression algorithm in the context of next-word prediction tasks, employing models like OPT of varying sizes. Our findings demonstrate that Kashin Quantization achieves competitive quality in model performance while ensuring superior data compression, marking a significant advancement in the field of data quantization. |

## ID: 686 |
## Label-wise Aleatoric and Epistemic Uncertainty QuantificationYusuf Sale, Paul Hofman, Timo Löhr, Lisa Wimmer, Thomas Nagler, Eyke HüllermeierAbstract
We present a novel approach to uncertainty quantification in classification tasks based on label-wise decomposition of uncertainty measures. This label-wise perspective allows uncertainty to be quantified at the individual class level, thereby improving cost-sensitive decision-making and helping understand the sources of uncertainty. Furthermore, it allows to define total, aleatoric, and epistemic uncertainty on the basis of non-categorical measures such as variance, going beyond common entropy-based measures. In particular, variance-based measures address some of the limitations associated with established methods that have recently been discussed in the literature. We show that our proposed measures adhere to a number of desirable properties. Through empirical evaluation on a variety of benchmark data sets -- including applications in the medical domain where accurate uncertainty quantification is crucial -- we establish the effectiveness of label-wise uncertainty quantification. |

## ID: 687 |
## Revisiting Kernel Attention with Correlated Gaussian Process RepresentationLong Minh Bui, Tho Tran Huu, Duy Dinh, Tan Minh Nguyen, Trong Nghia HoangAbstract
Transformers have increasingly become the de facto method to model sequential data with state-of-the-art performance. Due to its widespread use, being able to estimate and calibrate its modeling uncertainty is important to understand and design robust transformer models. To achieve this, previous works have used Gaussian processes (GPs) to perform uncertainty calibration for the attention units of transformers and attained notable successes. However, such approaches have to confine the transformers to the space of symmetric attention to ensure the necessary symmetric requirement of their GP's kernel specification, which reduces the representation capacity of the model. To mitigate this restriction, we propose the Correlated Gaussian Process Transformer (CGPT), a new class of transformers whose self-attention units are modeled as cross-covariance between two correlated GPs (CGPs). This allows asymmetries in attention and can enhance the representation capacity of GP-based transformers. We also derive a sparse approximation for CGP to make it scale better. Our empirical studies show that both CGP-based and sparse CGP-based transformers achieve better performance than state-of-the-art GP-based transformers on a variety of benchmark tasks. |

## ID: 689 |
## End-to-end Conditional Robust OptimizationAbhilash Reddy Chenreddy, Erick DelageAbstract
The field of Contextual Optimization (CO) integrates machine learning and optimization to solve decision making problems under uncertainty. Recently, a risk sensitive variant of CO, known as Conditional Robust Optimization (CRO), combines uncertainty quantification with robust optimization in order to promote safety and reliability in high stake applications. Exploiting modern differentiable optimization methods, we propose a novel end-to-end approach to train a CRO model in a way that accounts for both the empirical risk of the prescribed decisions and the quality of conditional coverage of the contextual uncertainty set that supports them. While guarantees of success for the latter objective are impossible to obtain from the point of view of conformal prediction theory, high quality conditional coverage is achieved empirically by ingeniously employing a logistic regression differentiable layer within the calculation of coverage quality in our training loss.We show that the proposed training algorithms produce decisions that outperform the traditional estimate then optimize approaches. |

## ID: 691 |
## BEARS Make Neuro-Symbolic Models Aware of their Reasoning Shortcuts## [spotlight]Emanuele Marconato, Samuele Bortolotti, Emile van Krieken, Antonio Vergari, Andrea Passerini, Stefano TesoAbstract
Neuro-Symbolic (NeSy) predictors that conform to symbolic knowledge – encoding, e.g., safety constraints – can be affected by Reasoning Shortcuts (RSs): They learn concepts consistent with the symbolic knowledge by exploiting unintended semantics. RSs compromise reliability and generalization and, as we show in this paper, they are linked to NeSy models being overconfident about the predicted concepts. Unfortunately, the only trustworthy mitigation strategy requires collecting costly dense supervision over the concepts. Rather than attempting to avoid RSs altogether, we propose to ensure NeSy models are aware of the semantic ambiguity of the concepts they learn, thus enabling their users to identify and distrust low-quality concepts. Starting from three simple desiderata, we derive bears (BE Aware of Reasoning Shortcuts), an ensembling technique that calibrates the model’s concept-level confidence without compromising prediction accuracy, thus encouraging NeSy architectures to be uncertain about concepts affected by RSs. We show empirically that bears improves RS-awareness of several state-of-the-art NeSy models, and also facilitates acquiring informative dense annotations for mitigation purposes. |

## ID: 696 |
## Walking the Values in Bayesian Inverse Reinforcement Learning## [spotlight]Ondrej Bajgar, Alessandro Abate, Konstantinos Gatsis, Michael A OsborneAbstract
The goal of Bayesian inverse reinforcement learning (IRL) is recovering a posterior distribution over reward functions using a set of demonstrations from an expert optimizing for a reward unknown to the learner. The resulting posterior over rewards can then be used to synthesize an apprentice policy that performs well on the same or a similar task. A key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood, often defined in terms of Q values: vanilla Bayesian IRL needs to solve the costly forward planning problem -- going from rewards to the Q values -- at every step of the algorithm, which may need to be done thousands of times. We propose to solve this by a simple change: instead of focusing on primarily sampling in the space of rewards, we can focus on primarily working in the space of Q-values, since the computation required to go from Q-values to reward is radically cheaper. Furthermore, this reversion of the computation makes it easy to compute the gradient allowing efficient sampling using Hamiltonian Monte Carlo. We propose ValueWalk -- a new Markov chain Monte Carlo method based on this insight -- and illustrate its advantages on several tasks. |

## ID: 703 |
## Learning to Rank for Active Learning via Multi-Task Bilevel OptimizationZixin Ding, Si Chen, Ruoxi Jia, Yuxin ChenAbstract
Active learning is a promising paradigm for reducing labeling costs by strategically requesting labels to improve model performance. However, existing active learning methods often rely on expensive acquisition functions, extensive model retraining, and multiple rounds of interaction with annotators. To address these limitations, we propose a novel approach for active learning, which aims to select batches of unlabeled instances through a learned surrogate model for data acquisition. A key challenge in this approach is to develop an acquisition function that generalizes well, as the history of data, which forms part of the utility function's input, grows over time. Our novel algorithmic contribution is a multi-task bilevel optimization framework that predicts the relative utility---measured by the validation accuracy---of different training sets, and ensures the learned acquisition function generalizes effectively. For cases where validation accuracy is expensive to evaluate, we introduce efficient interpolation-based surrogate models to estimate the utility function, reducing the evaluation cost. We demonstrate the performance of our approach through extensive experiments on standard active classification benchmarks. |

## ID: 708 |
## Bounding causal effects with leaky instrumentsDavid Watson, Gecia Bravo-Hermsdorff, Lee M. Gunderson, Jordan Penn, Afsaneh Mastouri, Ricardo SilvaAbstract
Instrumental variables (IVs) are a popular and powerful tool for estimating causal effects in the presence of unobserved confounding. However, classical approaches rely on strong assumptions such as the $\textit{exclusion criterion}$, which states that instrumental effects must be entirely mediated by treatments. This assumption often fails in practice. When IV methods are improperly applied to data that do not meet the exclusion criterion, estimated causal effects may be badly biased. In this work, we propose a novel solution that provides $\textit{partial}$ identification in linear systems given a set of $\textit{leaky instruments}$, which are allowed to violate the exclusion criterion to some limited degree. We derive a convex optimization objective that provides provably sharp bounds on the average treatment effect under some common forms of information leakage, and implement inference procedures to quantify the uncertainty of resulting estimates. We demonstrate our method in a set of experiments with simulated data, where it performs favorably against the state of the art. An accompanying $\texttt{R}$ package, $\texttt{leakyIV}$, is available from $\texttt{CRAN}$. |

## ID: 709 |
## Generalization and Learnability in Multiple Instance RegressionKushal Chauhan, Rishi Saket, Lorne Applebaum, Ashwinkumar Badanidiyuru, Chandan Giri, Aravindan RaghuveerAbstract
Multiple instance regression (MIR) was introduced by Ray and Page (2001) as an analogue of multiple instance learning (MIL) in which we are given bags of feature-vectors (instances) and for each bag there is a bag-label which matches the label of one (unknown) primary instance from that bag. The goal is to compute a hypothesis regressor consistent with the underlying instance-labels. A natural approach is to find the best primary instance assignment and regressor optimizing the mse loss on the bags though no formal generalization guarantees were known. Our work is the first to prove generalization error bounds for MIR when the bags are drawn i.i.d. at random. Essentially, with high probability any MIR regressor with low error on sampled bags also has low error on the underlying instance-label distribution. We next study the complexity of linear regression on MIR bags, shown to be NP-hard in general by Ray and Page (2001), who however left open the possibility of arbitrarily good approximations. Significantly strengthening previous work, we prove a strong inapproximability bound: even if there exists zero bag-loss MIR linear regressor on a collection of $2$-sized bags with labels in $[-1,1]$, it is NP-hard to find an MIR linear regressor with bag-loss $< C$ for some absolute constant $C > 0$. Our work also proposes a model training method for MIR based on a novel weighted assignment loss, geared towards handling overlapping bags which have not received much attention previously. We conduct empirical evaluations on synthetic and real-world datasets showing that our method outperforms the baseline MIR methods. |

## ID: 710 |
## Beyond Dirichlet-based Models: When Bayesian Neural Networks Meet Evidential Deep LearningHanjing Wang, Qiang JiAbstract
Bayesian neural networks (BNNs) excel in uncertainty quantification (UQ) by estimating the posterior distribution of model parameters, yet face challenges due to the high computational demands of Bayesian inference. Evidential deep learning methods address this by treating target distribution parameters as random variables with a learnable conjugate distribution, enabling efficient UQ. However, there's debate over whether these methods can accurately estimate epistemic uncertainty due to their single-network, sampling-free nature. In this paper, we combine the strengths of both approaches by distilling BNN knowledge into a Dirichlet-based model, endowing it with a Bayesian perspective and theoretical guarantees. Additionally, we introduce two enhancements to further improve the integration of Bayesian UQ with Dirichlet-based models. To relax the heavy computational load with BNNs, we introduce a self-regularized training strategy using Laplacian approximation (LA) for self-distillation. To alleviate the conjugate prior assumption, we employ an expressive normalizing flow for refining the model in a post-processing manner, where a few training iterations can enhance model performance. The experimental results have demonstrated the effectiveness of our proposed methods in both UQ accuracy and robustness. |

## ID: 720 |
## Stein Random Feature RegressionHouston Warren, Rafael Oliveira, Fabio RamosAbstract
In large-scale regression problems, random Fourier features (RFFs) have significantly enhanced the computational scalability and flexibility of Gaussian processes (GPs) by defining kernels through their spectral density, from which a finite set of Monte Carlo samples can be used to form an approximate low-rank GP. However, the efficacy of RFFs in kernel approximation and Bayesian kernel learning depends on the ability to tractably sample the kernel spectral measure and the quality of the generated samples. We introduce Stein random features (SRF), leveraging Stein variational gradient descent, which can be used to both generate high-quality RFF samples of known spectral densities as well as flexibly and efficiently approximate traditionally non-analytical spectral measure posteriors. SRFs require only the evaluation of log-probability gradients to perform both kernel approximation and Bayesian kernel learning that results in superior performance over traditional approaches. We empirically validate the effectiveness of SRFs by comparing them to baselines on kernel approximation and well-known GP regression problems. |

## ID: 724 |
## Value-Based Abstraction Functions for Abstraction SamplingBobak Pezeshki, Kalev Kask, Alexander Ihler, Rina DechterAbstract
Monte Carlo methods are powerful tools for solving problems involving complex probability distributions. Despite their versatility, these methods often suffer from inefficiencies, especially when dealing with rare events. As such, importance sampling emerged as a prominent technique for alleviating these challenges. Recently, a new scheme called Abstraction Sampling was developed that incorporated stratification to importance sampling over graphical models. However, existing work only explored a limited set of abstraction functions that guide stratification. This study introduces three new classes of abstraction functions combined with seven distinct partitioning schemes, resulting in twenty-one new abstraction functions, each motivated by theory and intuition from both search and sampling domains. An extensive empirical analysis on over 400 problems compares these new schemes highlighting several well-performing candidates. |

## ID: 730 |
## Vertical Validation: Evaluating Implicit Generative Models for Graphs on Thin Support RegionsMai Elkady, Thu Bui, Bruno Ribeiro, David I. InouyeAbstract
There has been a growing excitement that implicit graph generative models could be used to design or discover new molecules for medicine or material design. Because these molecules have not been discovered, they naturally lie in unexplored or scarcely supported regions of the distribution of known molecules. However, prior evaluation methods for implicit graph generative models have focused on validating statistics computed from the thick support (e.g., mean and variance of a graph property). Therefore, there is a mismatch between the goal of generating novel graphs and the evaluation methods. To address this evaluation gap, we design a novel evaluation method called Vertical Validation (VV) that systematically creates thin support regions during the train-test splitting procedure and then reweights generated samples so that they can be compared to the held-out test data. This procedure can be seen as a generalization of the standard train-test procedure except that the splits are dependent on sample features. We demonstrate that our method can be used to perform model selection if performance on thin support regions is the desired goal. As a side benefit, we also show that our approach can better detect overfitting as exemplified by memorization. |

## ID: 736 |
## Causal Discovery with Deductive Reasoning: One Less ProblemJonghwan Kim, Inwoo Hwang, Sanghack LeeAbstract
Constraint-based causal discovery algorithms aim to extract causal relationships between variables of interest by using conditional independence tests (CITs). However, CITs with large conditioning sets often lead to unreliable results due to their low statistical power, propagating errors throughout the course of causal discovery. As the reliability of CITs is crucial for their practical applicability, recent approaches rely on either tricky heuristics or complicated routines with high computational costs to tackle inconsistent test results. Against this background, we propose a principled, simple, yet effective method, coined \textsc{deduce-dep}, which corrects unreliable conditional independence statements by replacing them with deductively reasoned results from lower-order CITs. An appealing property of \textsc{deduce-dep} is that it can be seamlessly plugged into existing constraint-based methods and serves as a modular subroutine. In particular, we showcase the integration of \textsc{deduce-dep} into representative algorithms such as HITON-PC and PC, illustrating its practicality. Empirical evaluation demonstrates that our method properly corrects unreliable CITs, leading to improved performance in causal structure learning. |

## ID: 738 |
## On the Inductive Biases of Demographic Parity-based Fair Learning AlgorithmsHaoyu LEI, Amin Gohari, Farzan FarniaAbstract
Fair supervised learning algorithms assigning labels with little dependence on a sensitive attribute have attracted great attention in the machine learning community. While the demographic parity (DP) notion has been frequently used to measure a model's fairness in training fair classifiers, several studies in the literature suggest potential impacts of enforcing DP in fair learning algorithms. In this work, we analytically study the effect of standard DP-based regularization methods on the conditional distribution of the predicted label given the sensitive attribute. Our analysis shows that an imbalanced training dataset with a non-uniform distribution of the sensitive attribute could lead to a classification rule biased toward the sensitive attribute outcome holding the majority of training data. To control such inductive biases in DP-based fair learning, we propose a sensitive attribute-based distributionally robust optimization (SA-DRO) method improving robustness against the marginal distribution of the sensitive attribute. Finally, we present several numerical results on the application of DP-based learning methods to standard centralized and distributed learning problems. The empirical findings support our theoretical results on the inductive biases in DP-based fair learning algorithms and the debiasing effects of the proposed SA-DRO method. The project code is available at [github.com/lh218/Fairness-IB.git](https://github.com/lh218/Fairness-IB.git). |

## ID: 739 |
## Polynomial Semantics of Tractable Probabilistic Circuits## [oral]Oliver Broadrick, Honghua Zhang, Guy Van den BroeckAbstract
Probabilistic circuits compute multilinear polynomials that represent probability distributions. They are tractable models that support efficient marginal inference. However, various polynomial semantics have been considered in the literature (e.g., network polynomials, likelihood polynomials, generating functions, Fourier transforms, and characteristic polynomials). The relationships between these polynomial encodings of distributions is largely unknown. In this paper, we prove that for binary distributions, each of these probabilistic circuit models is equivalent in the sense that any circuit for one of them can be transformed into a circuit for any of the others with only a polynomial increase in size. They are therefore all tractable for marginal inference on the same class of distributions. Finally, we explore the natural extension of one such polynomial semantics, called probabilistic generating circuits, to categorical random variables, and establish that marginal inference becomes #P-hard. |

## ID: 746 |
## Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability ObjectivesQi Heng Ho, Martin S. Feather, Federico Rossi, Zachary N Sunberg, Morteza LahijanianAbstract
Partially Observable Markov Decision Processes (POMDPs) are powerful models for sequential decision making under transition and observation uncertainties. This paper studies the challenging yet important problem in POMDPs known as the (indefinite-horizon) Maximal Reachability Probability Problem (MRPP), where the goal is to maximize the probability of reaching some target states. This is also a core problem in model checking with logical specifications and is naturally undiscounted (discount factor is one). Inspired by the success of point-based methods developed for discounted problems, we study their extensions to MRPP. Specifically, we focus on trial-based heuristic search value iteration techniques and present a novel algorithm that leverages the strengths of these techniques for efficient exploration of the belief space (informed search via value bounds) while addressing their drawbacks in handling loops for indefinite-horizon problems. The algorithm produces policies with two-sided bounds on optimal reachability probabilities. We prove convergence to an optimal policy from below under certain conditions. Experimental evaluations on a suite of benchmarks show that our algorithm outperforms existing methods in almost all cases in both probability guarantees and computation time. |

## ID: 750 |
## Evaluating Bayesian deep learning for radio galaxy classificationDevina Mohan, Anna M M ScaifeAbstract
The radio astronomy community is rapidly adopting deep learning techniques to deal with the huge data volumes expected from the next generation of radio observatories. Bayesian neural networks (BNNs) provide a principled way to model uncertainty in the predictions made by such deep learning models and will play an important role in extracting well-calibrated uncertainty estimates on their outputs. In this work, we evaluate the performance of different BNNs against the following criteria: predictive performance, uncertainty calibration and distribution-shift detection for the radio galaxy classification problem. |

## ID: 758 |
## Offline Reward Perturbation Boosts Distributional Shift in Online RLZishun Yu, Siteng Kang, Xinhua ZhangAbstract
Offline-to-online reinforcement learning has recently been shown effective in reducing the online sample complexity by first training from offline collected data. However, this additional data source may also invite new poisoning attacks that target offline training. In this work, we reveal such vulnerabilities in $\textit{critic-regularized}$ offline RL by proposing a novel data poisoning attack method, which is stealthy in the sense that the performance during the offline training remains intact, but the online fine-tuning stage will suffer a significant performance drop. Our method leverages the techniques from bi-level optimization to promote the over-estimation/distribution shift under offline-to-online reinforcement learning. Experiments on four environments confirm the satisfaction of the new stealthiness requirement, and can be effective in attacking with only a small budget and without having white-box access to the victim model. |

## ID: 763 |
## Linear Opinion Pooling for Uncertainty Quantification on GraphsClemens Damke, Eyke HüllermeierAbstract
We address the problem of uncertainty quantification for graph-structured data, or, more specifically, the problem to quantify the predictive uncertainty in (semi-supervised) node classification. Key questions in this regard concern the distinction between two different types of uncertainty, aleatoric and epistemic, and how to support uncertainty quantification by leveraging the structural information provided by the graph topology. Challenging assumptions and postulates of state-of-the-art methods, we propose a novel approach that represents (epistemic) uncertainty in terms of mixtures of Dirichlet distributions and refers to the established principle of linear opinion pooling for propagating information between neighbored nodes in the graph. The effectiveness of this approach is demonstrated in a series of experiments on a variety of graph-structured datasets. |

## ID: 778 |
## Neural Architecture Search Finds Robust Models by Knowledge DistillationUtkarsh Nath, Yancheng Wang, Yingzhen YangAbstract
Despite their superior performance, Deep Neural Networks (DNNs) are often vulnerable to adversarial attacks. Neural Architecture Search (NAS), a method for automatically designing the architectures of DNNs, has shown remarkable performance across various machine learning applications. However, the adversarial robustness of architectures learned by NAS against adversarial threats remains under-explored. By integrating a robust teacher, we examine whether NAS can yield a robust neural architecture by inheriting robustness from the teacher. In this paper, we propose Robust Neural Architecture Search by Cross-Layer Knowledge Distillation (RNAS-CL), a novel NAS algorithm that enhances the robustness of architectures learned by NAS through employing cross-layer knowledge distillation from a robust teacher. Distinct from previous knowledge distillation approaches that only align student-teacher outputs at the final layer, RNAS-CL dynamically searches for the optimal teacher layer to guide each student layer. Our experimental findings validate the effectiveness of RNAS-CL, demonstrating that it can generate both compact and adversarially robust neural architectures. Our results pave the way for developing new strategies for compact and robust neural architecture design applicable across various fields. The code of RNAS-CL is available at \url{https://github.com/Statistical-Deep-Learning/RNAS-CL}. |

## ID: 780 |
## Efficiently Deciding Algebraic Equivalence of Bow-Free Acyclic Path DiagramsThijs van OmmenAbstract
For causal discovery in the presence of latent confounders, constraints beyond conditional independences exist that can enable causal discovery algorithms to distinguish more pairs of graphs. Such constraints are not well-understood yet. In the setting of linear structural equation models without bows, we study algebraic constraints and argue that these provide the most fine-grained resolution achievable. We propose efficient algorithms that decide whether two graphs impose the same algebraic constraints, or whether the constraints imposed by one graph are a subset of those imposed by another graph. |