Testing for Conditional Mean Independence with Covariates through Martingale Difference Divergence
Ze Jin, Xiaohan Yan, David S. Matteson
A crucial problem in statistics is to decide whether additional variables are needed in a regression model. We propose a new multivariate test to investigate the conditional mean independence of Y given X conditioning on some known effect Z, i.e., E(Y|X, Z) = E(Y|Z). Assuming that E(Y|Z) and Z are linearly related, we reformulate an equivalent notion of conditional mean independence through transformation, which is approximated in practice. We apply the martingale difference divergence (Shao and Zhang, 2014) to measure conditional mean dependence, and show that the estimation error from approximation is negligible, as it has no impact on the asymptotic distribution of the test statistic under some regularity assumptions. The implementation of our test is demonstrated by both simulations and a financial data example.
Analysis of Thompson Sampling for Graphical Bandits Without the Graphs
Fang Liu, Zizhan Zheng, Ness Shroff
We study multi-armed bandit problems with graph feedback, in which the decision maker is allowed to observe the neighboring actions of the chosen action, in a setting where the graph may vary over time and is never fully revealed to the decision maker. We show that when the feedback graphs are undirected the original Thompson Sampling achieves the optimal (within logarithmic factors) regret $\tilde{O}\left(\sqrt{\beta_0(G)T}\right)$ over time horizon $T$, where $\beta_0(G)$ is the average independence number of the latent graphs. To the best of our knowledge, this is the first result showing that the original Thompson Sampling is optimal for graphical bandits in the undirected setting. A slightly weaker regret bound of Thompson Sampling in the directed setting is also presented. To fill this gap, we propose a variant of Thompson Sampling, that attains the optimal regret in the directed setting within a logarithmic factor. Both algorithms can be implemented efficiently and do not require the knowledge of the feedback graphs at any time.
We investigate structured sparsity methods for variable selection in regression problems where the target depends nonlinearly on the inputs. We focus on general nonlinear functions not limiting a priori the function space to additive models. We propose two new regularizers based on partial derivatives as nonlinear equivalents of group lasso and elastic net. We formulate the problem within the framework of learning in reproducing kernel Hilbert spaces and show how the variational problem can be reformulated into a more practical finite dimensional equivalent. We develop a new algorithm derived from the ADMM principles that relies solely on closed forms of the proximal operators. We explore the empirical properties of our new algorithm for Nonlinear Variable Selection based on Derivatives (NVSD) on a set of experiments and confirm favourable properties of our structured-sparsity models and the algorithm in terms of both prediction and variable selection accuracy.
Identification of Strong Edges in AMP Chain Graphs
Jose M. Peña
The essential graph is a distinguished member of a Markov equivalence class of AMP chain graphs. However, the directed edges in the essential graph are not necessarily strong or invariant, i.e. they may not be shared by every member of the equivalence class. Likewise for the undirected edges. In this paper, we develop a procedure for identifying which edges in an essential graph are strong. We also show how this makes it possible to bound some causal effects when the true chain graph is unknown.
Area under ROC (AUC) is an important metric for binary classification and bipartite ranking problems. However, it is difficult to directly optimizing AUC as a learning objective, so most existing algorithms are based on optimizing a surrogate loss to AUC. One significant drawback of these surrogate losses is that they require pairwise comparisons among training data, which leads to slow running time and increasing local storage for online learning. In this work, we describe a new surrogate loss based on a reformulation of the AUC risk, which does not require pairwise comparison but rankings of the predictions. We further show that the ranking operation can be avoided, and the learning objective obtained based on this surrogate enjoys linear complexity in time and storage. We perform experiments to demonstrate the effectiveness of the online and batch algorithms for AUC optimization based on the proposed surrogate loss.
Efficient Bayesian Inference for a Gaussian Process Density Model
Christian Donner, Manfred Opper
We reconsider a nonparametric density model based on Gaussian processes. By augmenting the model with latent Pólya–Gamma random variables and a latent marked Poisson process we obtain a new likelihood which is conjugate to the model’s Gaussian process prior. The augmented posterior allows for efficient inference by Gibbs sampling and an approximate variational mean field approach. For the latter we utilise sparse GP approximations to tackle the infinite dimensionality of the problem. The performance of both algorithms and comparisons with other density estimators are demonstrated on artificial and real datasets with up to several thousand data points.
Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return
Craig Sherstan, Dylan R. Ashley, Brendan Bennett, Kenny Young, Adam White, Martha White, Richard S. Sutton
Temporal-difference (TD) learning methods are widely used in reinforcement learning to estimate the expected return for each state, without a model, because of their significant advantages in computational and data efficiency. For many applications involving risk mitigation, it would also be useful to estimate the \emph{variance} of the return by TD methods. In this paper, we describe a way of doing this that is substantially simpler than those proposed by Tamar, Di Castro, and Mannor in 2012, or those proposed by White and White in 2016. We show that two TD learners operating in series can learn expectation and variance estimates. The trick is to use the square of the TD error of the expectation learner as the reward of the variance learner, and the square of the expectation learner’s discount rate as the discount rate of the variance learner. With these two modifications, the variance learning problem becomes a conventional TD learning problem to which standard theoretical results can be applied. Our formal results are limited to the table lookup case, for which our method is still novel, but the extension to function approximation is immediate, and we provide some empirical results for the linear function approximation case. Our experimental results show that our direct method behaves just as well as a comparable indirect method, but is generally more robust.
We present a data-driven benchmark system to evaluate the performance of new MCMC samplers. Taking inspiration from the COCO benchmark in optimization, we view this benchmark as having critical importance to machine learning and statistics given the rate at which new samplers are proposed. The common hand-crafted examples to test new samplers are unsatisfactory; we take a meta-learning-like approach to generate realistic benchmark examples from a large corpus of data sets and models. Surrogates of posteriors found in real problems are created using highly flexible density models including modern neural network models. We provide new insights into the real effective sample size of various samplers per unit time and the estimation efficiency of the samplers per sample. Additionally, we provide a meta-analysis to assess the predictive utility of various MCMC diagnostics and perform a nonparametric regression to combine them.
Learning Deep Hidden Nonlinear Dynamics from Aggregate Data
Yisen Wang, Bo Dai, Lingkai Kong, Sarah Monazam Erfani, James Bailey, Hongyuan Zha
Learning nonlinear dynamics from diffusion data is a challenging problem since the individuals observed may be different at different time points, generally following an aggregate behaviour. Existing work cannot handle the tasks well since they model such dynamics either directly on observations or enforce the availability of complete longitudinal individual-level trajectories. However, in most of the practical applications, these requirements are unrealistic: the evolving dynamics may be too complex to be modeled directly on observations, and individual-level trajectories may not be available due to technical limitations, experimental costs and/or privacy issues. To address these challenges, we formulate a model of diffusion dynamics as the {\em hidden stochastic process} via the introduction of hidden variables for flexibility, and learn the hidden dynamics directly on {\em aggregate observations} without any requirement for individual-level trajectories. We propose a dynamic generative model with Wasserstein distance for LEarninG dEep hidden Nonlinear Dynamics (LEGEND) and prove its theoretical guarantees as well. Experiments on a range of synthetic and real-world datasets illustrate that LEGEND has very strong performance compared to state-of-the-art baselines.
Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain
Yu-Xiang Wang
We revisit the problem of linear regression under a differential privacy constraint. By consolidating existing pieces in the literature, we clarify the correct dependence of the feature, label and coefficient domains in the optimization error and estimation error, hence revealing the delicate price of differential privacy in statistical estimation and statistical learning. Moreover, we propose simple modifications of two existing DP algorithms: (a) posterior sampling, (b) sufficient statistics perturbation, and show that they can be upgraded into **adaptive algorithms** that are able to exploit data-dependent quantities and behave nearly optimally **for every instance**. Extensive experiments are conducted on both simulated data and real data, which conclude that both AdaOPS and AdaSSP outperform the existing techniques on nearly all 36 data sets that we test on.
We introduce a class of adjustment rules for a collection of beliefs, extending Lewis' Imaging to absorb probabilistic evidence in generalized settings. Unlike standard tools for belief revision, our proposals may be used when information is inconsistent with an agent's belief base. We show the functionals we introduce are based on the Imaginary counterparts of Probability Kinematics for standard belief revision, and prove they satisfy all standard postulates for belief revision, under certain conditions.
From Deterministic ODEs to Dynamic Structural Causal Models
Paul K. Rubenstein, Stephan Bongers, Joris M. Mooij, Bernhard Schoelkopf
Structural Causal Models are widely used in causal modelling, but how they relate to other modelling tools is poorly understood. In this paper we provide a novel perspective on the relationship between Ordinary Differential Equations and Structural Causal Models. We show how, under certain conditions, the asymptotic behaviour of an Ordinary Differential Equation under non-constant interventions can be modelled using Dynamic Structural Causal Models. In contrast to earlier work, we study not only the effect of interventions on equilibrium states; rather, we model asymptotic behaviour that is dynamic under interventions that vary in time, and include as a special case the study of static equilibria.
Frank-Wolfe Optimization for Symmetric-NMF under Simplicial Constraint
Han Zhao, Geoff Gordon
Symmetric nonnegative matrix factorization has found abundant applications in various domains by providing a symmetric low-rank decomposition of nonnegative matrices. In this paper we propose a Frank-Wolfe (FW) solver to optimize the symmetric nonnegative matrix factorization problem under a simplicial constraint, which has recently been proposed for probabilistic clustering. Compared with existing solutions, this algorithm is simple to implement, and has no hyperparameters to be tuned. Building on the recent advances of FW algorithms in nonconvex optimization, we prove an $O(1/\eps^2)$ convergence rate to $\eps$-approximate KKT points, via a tight bound $\Theta(n^2)$ on the curvature constant, which matches the best known result in unconstrained nonconvex setting using gradient methods. Numerical results demonstrate the effectiveness of our algorithm. As a side contribution, we construct a simple nonsmooth convex problem where the FW algorithm fails to converge to the optimum. This result raises an interesting question about necessary conditions of the success of the FW algorithm on convex problems.
Learning Time Series Segmentation Models from Temporally Imprecise Labels
Roy Adams, Benjamin M. Marlin
This paper considers the problem of learning time series segmentation models when the labeled data is subject to temporal uncertainty or noise. Our approach augments the semi-Markov conditional random field (semi-CRF) model with a probabilistic model of the label observation process. This augmentation allows us to estimate the parameters of the semi-CRF from timestamps corresponding roughly to the occurrence of segment transitions. We show how exact marginal inference can be performed in polynomial time enabling learning based on marginal likelihood maximization. Our experiments on two activity detection problems show that the proposed approach can learn models from temporally imprecise labels, and can successfully refine imprecise segmentations through posterior inference. Finally, we show how inference complexity can be reduced by a factor of 40 using static and model-based pruning of the inference dynamic program.
Multi-Target Optimisation via Bayesian Optimisation and Linear Programming
Alistair Shilton, Santu Rana, Sunil Gupta, Svetha Venkatesh
In Bayesian Multi-Objective optimisation, expected hypervolume improvement is often used to measure the goodness of candidate solutions. However when there are many objectives the calculation of expected hypervolume improvement can become computationally prohibitive. An alternative approach measures the goodness of a candidate based on the distance of that candidate from the Pareto front in objective space. In this paper we present a novel distance-based Bayesian Many-Objective optimisation algorithm. We demonstrate the efficacy of our algorithm on three problems, namely the DTLZ2 benchmark problem, a hyper-parameter selection problem, and high-temperature creep-resistant alloy design.
Stochastic Learning for Sparse Discrete Markov Random Fields with Controlled Gradient Approximation Error
Sinong Geng, Zhaobin Kuang, Jie Liu, Stephen Wright, David Page
We study the L1-regularized maximum likelihood estimator/estimation (MLE) problem for discrete Markov random fields (MRFs), where efficient and scalable learning requires both sparse regularization and approximate inference. To address these challenges, we consider a stochastic learning framework called stochastic proximal gradient (SPG; Honorio 2012a, Atchade et al. 2014, Miasojedow and Rejchel 2016). SPG is an inexact proximal gradient algorithm [Schmidt et al., 2011], whose inexactness stems from the stochastic oracle (Gibbs sampling) for gradient approximation – exact gradient evaluation is infeasible in general due to the NP-hard inference problem for discrete MRFs [Koller and Friedman, 2009]. Theoretically, we provide novel verifiable bounds to inspect and control the quality of gradient approximation. Empirically, we propose the tighten asymptotically (TAY) learning strategy based on the verifiable bounds to boost the performance of SPG.
Active Information Acquisition for Linear Optimization
Shuran Zheng, Bo Waggoner, Yang Liu, Yiling Chen
We consider partially-specified optimization problems where the goal is to actively, but efficiently, acquire missing information about the problem in order to solve it. An algorithm designer wishes to solve a linear program (LP), $\max \V{c}^T \V{x}$ s.t. $\V{A}\V{x} \leq \V{b}, \V{x} \ge \V{0}$, but does not initially know some of the parameters. The algorithm can iteratively choose an unknown parameter and gather information in the form of a noisy sample centered at the parameter's (unknown) value. The goal is to find an approximately feasible and optimal solution to the underlying LP with high probability while drawing a small number of samples. We focus on two cases.
(1) When the parameters $\V{b}$ of the constraints are initially unknown, we propose an efficient algorithm combining techniques from the ellipsoid method for LP and confidence-bound approaches from bandit algorithms. The algorithm adaptively gathers information about constraints only as needed in order to make progress. We give sample complexity bounds for the algorithm and demonstrate its improvement over a naive approach via simulation.
(2) When the parameters $\V{c}$ of the objective are initially unknown, we take an information-theoretic approach and give roughly matching upper and lower sample complexity bounds, with an (inefficient) successive-elimination algorithm.
Meta learning algorithms are effective at obtaining meta models with the capability of solving new tasks quickly. However, they critically require sufficient tasks for meta model training and the resulted model can only solve new tasks similar to the training ones. These limitations make them suffer performance decline in presence of insufficiency of training tasks in target domains and task heterogeneity—the source (model training) tasks presents different characteristics from target (model application) tasks. To overcome these two significant limitations of existing meta learning algorithms, we introduce the cross-domain meta learning framework and propose a new transferable meta learning (TML) algorithm. TML performs meta task adaptation jointly with meta model learning, which effectively narrows divergence between source and target tasks and enables transferring source meta-knowledge to solve target tasks. Thus, the resulted transferable meta model can solve new learning tasks in new domains quickly. We apply the proposed TML to cross-domain few-shot classification problems and evaluate its performance on multiple benchmarks. It performs significantly better and faster than well-established meta learning algorithms and fine-tuned domain-adapted models.
Learning the Causal Structure of Copula Models with Latent Variables
Ruifei Cui, Perry Groot, Moritz Schauer, Tom Heskes
A common goal in psychometrics, sociology, and econometrics is to uncover causal relations among latent variables representing hypothetical constructs that cannot be measured directly, such as attitude, intelligence, and motivation. Through measurement models, these constructs are typically linked to measurable indicators, e.g., responses to questionnaire items. This paper addresses the problem of causal structure learning among such latent variables and other observed variables. We propose the 'Copula Factor PC' algorithm as a novel two-step approach. It first draws samples of the underlying correlation matrix in a Gaussian copula factor model via a Gibbs sampler on rank-based data. These are then translated into an average correlation matrix and an effective sample size, which are taken as input to the standard PC algorithm for causal discovery in the second step. We prove the consistency of our 'Copula Factor PC' algorithm, and demonstrate that it outperforms the PC-MIMBuild algorithm and a greedy step-wise approach. We illustrate our method on a real-world data set about children with Attention Deficit Hyperactivity Disorder.
Learning sparse features from only positive and unlabeled (PU) data is a fundamental task for problems of several domains, such as natural language processing (NLP), computer vision (CV), information retrieval (IR). Considering the numerous amount of unlabeled data, most prevalent methods rely on negative sampling (NS) to increase computational efficiency. However, sampling a fraction of unlabeled data as negative for training may ignore other important examples, and thus lead to non-optimal prediction performance. To address this, we present a fast and generic batch gradient descent optimizer ($f_{BGD}$) to learn from all training examples without sampling. By leveraging sparsity in PU data, we accelerate $f_{BGD}$ by several magnitudes, making its time complexity the same level as the NS-based stochastic gradient descent method. Meanwhile, we observe that the standard batch gradient method suffers from gradient instability issues due to the sparsity property. Driven by a theoretical analysis for this potential cause, an intuitive solution arises naturally. To verify its efficacy, we perform experiments on multiple tasks with PU data across domains, and show that $f_{BGD}$ consistently outperforms NS-based models on all tasks with comparable efficiency.
Esther Derman, Daniel J Mankowitz, Timothy A Mann, Shie Mannor
Robust Reinforcement Learning aims to derive an optimal behavior that accounts for model uncertainty
in dynamical systems. However, previous studies have shown that by considering the worst case scenario,
robust policies can be overly conservative. Our soft-robust framework is an attempt to overcome
this issue. In this paper, we present a novel Soft-Robust Actor-Critic algorithm (SR-AC).
It learns an optimal policy with respect to a distribution over an uncertainty set and stays robust to model uncertainty but avoids the conservativeness of robust strategies. We show the convergence of SR-AC and test the efficiency of our approach
on two different domains by comparing it against regular learning methods and their robust formulations.
Constant Step Size Stochastic Gradient Descent for Probabilistic Modeling
Dmitry Babichev, Francis Bach
Stochastic gradient methods enable learning probabilistic models from large amounts of data. While large step-sizes (learning rates) have shown to be best for least-squares (e.g., Gaussian noise) once combined with parameter averaging, these are not leading to convergent algorithms in general. In this paper, we consider generalized linear models, that is, conditional models based on exponential families. We propose averaging moment parameters instead of natural parameters for constant-step-size stochastic gradient descent.
For finite-dimensional models, we show that this can sometimes (and surprisingly) lead to better predictions than the best linear model.
For infinite-dimensional models, we show that it always converges to optimal predictions, while averaging natural parameters never does.
We illustrate our findings with simulations on synthetic data and classical benchmarks.
Discrete Sampling using Semigradient-based Product Mixtures
Alkis Gotovos, Hamed Hassani, Andreas Krause, Stefanie Jegelka
We consider the problem of inference in discrete probabilistic models, that is, distributions over subsets of a finite ground set. These encompass a range of well-known models in machine learning, such as determinantal point processes and Ising models. Locally-moving Markov chain Monte Carlo algorithms, such as the Gibbs sampler, are commonly used for inference in such models, but their convergence is, at times, prohibitively slow. This is often caused by state-space bottlenecks that greatly hinder the movement of such samplers. We propose a novel sampling strategy that uses a specific mixture of product distributions to propose global moves and, thus, accelerate convergence. Furthermore, we show how to construct such a mixture using semigradient information. We illustrate the effectiveness of combining our sampler with existing ones, both theoretically on an example model, as well as practically on three models learned from real-world data sets.
The uncertainty associated with Human perception is often reduced by one's extensive prior experience and knowledge. Current datasets and systems do not emphasize the necessity and benefit of using such knowledge. In this work, we propose the task of solving a genre of image-puzzles (``image riddles) that require both capabilities involving visual detection (including object, activity recognition) and, knowledge-based or commonsense reasoning. Each puzzle involves a set of images and the question ``"what word connects these images?". We compile a dataset of over 3k riddles where each riddle consists of 4 images and a ground truth answer. The annotations are validated using crowd-sourced evaluation. We also define an automatic evaluation metric to track future progress. Our task bears similarity with the commonly known IQ tasks such as analogy solving, sequence filling that is often used to test intelligence. We develop a Probabilistic Reasoning-based approach that utilizes commonsense knowledge about words and phrases to answer these riddles with a reasonable accuracy. Our approach achieves some promising results for these riddles and provides a strong baseline for future attempts. We make the entire dataset and related materials publicly available to the community (bit.ly/22f9Ala).
We formalize the notion of nesting probabilistic programming queries and investigate the resulting statistical implications. We demonstrate that while query nesting allows the definition of models which could not otherwise be expressed, such as those involving agents reasoning about other agents, existing systems take approaches which lead to inconsistent estimates. We show how to correct this by delineating possible ways one might want to nest queries and asserting the respective conditions required for convergence. We further introduce a new online nested Monte Carlo estimator that makes it substantially easier to ensure these conditions are met, thereby providing a simple framework for designing statistically correct inference engines. We prove the correctness of this online estimator and show that, when using the recommended setup, its asymptotic variance is always better than that of the equivalent fixed estimator, while its bias is always within a factor of two.
Linear mixed models (LMMs) are used extensively to model observations that are not independent. Parameter estimation for LMMs can be computationally prohibitive on big data. State-of-the-art learning algorithms require computational complexity which depends at least linearly on the dimension $p$ of the covariates, and often use heuristics that do not offer theoretical guarantees. We present scalable algorithms for learning high-dimensional LMMs with sublinear computational complexity dependence on $p$. Key to our approach are novel dual estimators which use only kernel functions of the data, and fast computational techniques based on the subsampled randomized Hadamard transform. We provide theoretical guarantees for our learning algorithms, demonstrating the robustness of parameter estimation. Finally, we complement the theory with experiments on large synthetic and real data.
Constraint-based Causal Discovery for Non-Linear Structural Causal Models with Cycles and Latent Confounders
Patrick Forré, Joris M. Mooij
We address the problem of causal discovery from data, making use of the recently proposed causal modeling framework of modular structural causal models (mSCM) to handle cycles, latent confounders and non-linearities. We introduce σ-connection graphs (σ-CG), a new class of mixed graphs (containing undirected, bidirected and directed edges) with additional structure, and extend the concept of σ-separation, the appropriate generalization of the well-known notion of d-separation in this setting, to apply to σ-CGs. We prove the closedness of σ-separation under marginalisation and conditioning and exploit this to implement a test of σ-separation on a σ-CG. This then leads us to the first causal discovery algorithm that can handle non-linear functional relations, latent confounders, cyclic causal relationships, and data from different (stochastic) perfect interventions. As a proof of concept, we show on synthetic data how well the algorithm recovers features of the causal graph of modular structural causal models.
Marginal Weighted Maximum Log-likelihood for Efficient Learning of Perturb-and-Map models
Tatiana Shpakova, Francis Bach, Anton Osokin
We consider the structured-output prediction problem through probabilistic approaches and generalize the ``''perturb-and-MAP'' framework to more challenging weighted Hamming losses, which are crucial in applications. While in principle our approach is a straightforward marginalization, it requires solving many related MAP inference problems. We show that for log-supermodular pairwise models these operations can be performed efficiently using the machinery of dynamic graph cuts. We also propose to use \emph{double} stochastic gradient descent, both on the data and on the perturbations, for efficient learning. Our framework can naturally take weak supervision (e.g., partial labels) into account. We conduct a set of experiments on large-scale character recognition and image segmentation, showing the benefits of our algorithms.
Variational Inference for Gaussian Processes with Panel Count Data
Hongyi Ding, Young Lee, Issei Sato, Masashi Sugiyama
We present the first framework for Gaussian-process-modulated Poisson processes when the temporal data appear in the form of panel counts. Panel count data frequently arise when experimental subjects are observed only at discrete time points and only the numbers of occurrences of the events between subsequent observation times are available. The exact occurrence timestamps of the events are unknown. The method of conducting the efficient variational inference is presented, based on the assumption of a Gaussian-process-modulated intensity function. We derive a tractable lower bound to alleviate the problems of the intractable evidence lower bound inherent in the variational inference framework. Our algorithm outperforms classical methods on both synthetic and three real panel count sets.
A unified probabilistic model for learning latent factors and their connectivities from high-dimensional data
Ricardo Pio Monti, Aapo Hyvarinen
Connectivity estimation is challenging in the context of high-dimensional data. A useful preprocessing step is to group variables into clusters, however, it is not always clear how to do so from the perspective of connectivity estimation. Another practical challenge is that we may have data from multiple related classes (e.g., multiple subjects or conditions) and wish to incorporate constraints on the similarities across classes. We propose a probabilistic model which simultaneously performs both a grouping of variables (i.e., detecting community structure) and estimation of connectivities between the groups which correspond to latent variables. The model is essentially a factor analysis model where the factors are allowed to have arbitrary correlations, while the factor loading matrix is constrained to express a community structure. The model can be applied on multiple classes so that the connectivities can be different between the classes, while the community structure is the same for all classes. We propose an efficient estimation algorithm based on score matching, and prove the identifiability of the model. Finally, we present an extension to directed (causal) connectivities over latent variables. Simulations and experiments on fMRI data validate the practical utility of the method.
Improved Stochastic Trace Estimation using Mutually Unbiased Bases
JK Fitzsimons, MA Osborne, SJ Roberts, JF Fitzsimons
The paper begins by introducing the definition and construction of mutually unbiased bases, which are a widely used concept in quantum information processing but have received little to no attention in the machine learning and statistics literature. We demonstrate their usefulness by using them to create a new sampling technique which offers an improvement on the previously well established bounds of stochastic trace estimation. This approach offers a new state of the art single shot sampling variance while requiring $\mathcal{O}(\log(n))$ random bits for $\x \in \mathbb{R}^{n}$ which significantly improves on traditional methods such as fixed basis methods, Hutchinson's and Gaussian estimators in terms of the number of random bits required and worst case sample variance.
Jiaming Huang, Zhao Li, Vincent W. Zheng, Wen Wen, Yifan Yang, Yuanmi Chen
In this paper, we study the unsupervised multi-view graph embedding (UMGE) problem, which aims to learn graph embedding from multiple perspectives in an unsupervised manner. However, the vast majority of multi-view learning work focuses on non-graph data, and surprisingly there are limited work on UMGE. By systematically analyzing different existing methods for UMGE, we discover that cross-view and nonlinearity play a vital role in efficiently improving graph embedding quality. Motivated by this concept, we develop an unsupervised Multi-viEw nonlineaR Graph Embedding (MERGE) approach to model relational multi-view consistency. Experimental results on five benchmark datasets demonstrate that MERGE significantly outperforms the state-of-the-art baselines in terms of accuracy in node classification tasks without sacrificing the computational efficiency.
Rafael Pinot, Anne Morvan, Florian Yger, Cedric Gouy-Pailler, Jamal Atif
In this paper, we present the first differentially private clustering method for arbitrary-shaped node clusters in a graph. This algorithm takes as input only an approximate Minimum Spanning Tree (MST) $\mathcal{T}$ released under weight differential privacy constraints from the graph. Then, the underlying nonconvex clustering partition is successfully recovered from cutting optimal cuts on $\mathcal{T}$. As opposed to existing methods, our algorithm is theoretically well-motivated. Experiments support our theoretical findings.
We propose a new network architecture, Gated Attention Networks (GaAN), for learning on graphs. Unlike the traditional multi-head attention mechanism, which equally consumes all attention heads, GaAN uses a convolutional sub-network to control each attention head's importance. We demonstrate the effectiveness of GaAN on the inductive node classification problem on large graphs. Moreover, with GaAN as a building block, we construct the Graph Gated Recurrent Unit (GGRU) to address the traffic speed forecasting problem. Extensive experiments on three real-world datasets show that our GaAN framework achieves state-of-the-art results on both tasks.
Causal Learning for Partially Observed Stochastic Dynamical Systems
Søren Wengel Mogensen, Daniel Malinsky, Niels Richard Hansen
Many models of dynamical systems have causal interpretations that support reasoning about the consequences of interventions, suitably defined. Furthermore, local independence has been suggested as a useful independence concept for stochastic dynamical systems. There is, however, no well-developed theoretical framework for causal learning based on this notion of independence. We study independence models induced by directed graphs (DGs) and provide abstract graphoid properties that guarantee that an independence model has the global Markov property w.r.t. a DG. We apply these results to Itô diffusions and event processes. For a partially observed system, directed mixed graphs (DMGs) represent the marginalized local independence model, and we develop, under a faithfulness assumption, a sound and complete learning algorithm of the directed mixed equivalence graph (DMEG) as a summary of all Markov equivalent DMGs.
Variational zero-inflated Gaussian processes with sparse kernels
Pashupati Hegde, Markus Heinonen, Samuel Kaski
Zero-inflated datasets, which have an excess of zero outputs, are commonly encountered in problems such as climate or rare event modelling. Conventional machine learning approaches tend to overestimate the non-zeros leading to poor performance. We propose a novel model family of zero-inflated Gaussian processes (ZiGP) for such zero-inflated datasets, produced by sparse kernels through learning a latent probit Gaussian process that can zero out kernel rows and columns whenever the signal is absent. The ZiGPs are particularly useful for making the powerful Gaussian process networks more interpretable. We introduce sparse GP networks where variable-order latent modelling is achieved through sparse mixing signals. We derive the non-trivial stochastic variational inference tractably for scalable learning of the sparse kernels in both models. The novel output-sparse approach improves both prediction of zero-inflated data and interpretability of latent mixing models.
KBlrn: End-to-End Learning of Knowledge Base Representations with Latent, Relational, and Numerical Features
Alberto Garcia-Duran, Mathias Niepert
We present KBLRN, a framework for end-to-end learning of knowledge base representations from latent, relational, and numerical features. KBLRN integrates feature types with a novel combination of neural representation learning and probabilistic product of experts models. To the best of our knowledge, KBLRN is the first approach that learns representations of knowledge bases by integrating latent, relational, and numerical features. We show that instances of KBLRN outperform existing methods on a range of knowledge base completion tasks. We contribute a novel data set enriching commonly used knowledge base completion benchmarks with numerical features. We also investigate the impact numerical features have on KB completion performance.
Probabilistic AND-OR Attribute Grouping for Zero-Shot Learning
Yuval Atzmon, Gal Chechik
In zero-shot learning (ZSL), a classifier is trained to recognize visual classes without any image samples. Instead, it is given semantic information about the class, like a textual description or a set of attributes. Learning from attributes could benefit from explicitly modeling structure of the attribute space. Unfortunately, learning of general structure from empirical samples is hard with typical dataset sizes.
Here we describe LAGO, a probabilistic model designed to capture natural soft and-or relations across groups of attributes. We show how this model can be learned end-to-end with a deep attribute-detection model. The soft group structure can be learned from data jointly as part of the model, and can also readily incorporate prior knowledge about groups if available. The soft and-or structure succeeds to capture meaningful and predictive structures, improving the accuracy of zero-shot learning on two of three benchmarks.
Finally, LAGO reveals a unified formulation over two ZSL approaches: DAP (Lampert et al., 2009) and ESZSL (Romera-Paredes & Torr, 2015). Interestingly, taking only one singleton group for each attribute, introduces a new soft-relaxation of DAP, that outperforms DAP by ~40.
Sylvester Normalizing Flows for Variational Inference
Rianne van den Berg, Leonard Hasenclever, Jakub Tomczak, Max Welling
Variational inference relies on flexible approximate posterior distributions. Normalizing flows provide a general recipe to con-
struct flexible variational posteriors. We introduce Sylvester normalizing flows, which can be seen as a generalization of planar flows. Sylvester normalizing flows remove the well-known single-unit bottleneck from planar flows, making a single transformation much more flexible. We compare the performance of Sylvester normalizing flows against planar flows and inverse autoregressive flows and demonstrate that they compare favorably on several datasets.
Holistic Representations for Memorization and Inference
Yunpu Ma, Marcel Hildebrandt, Volker Tresp, Stephan Baier
In this paper we introduce a novel holographic memory model for the distributed storage of complex association patterns and apply it to knowledge graphs. In a knowledge graph, a labelled link connects a subject node with an object node, jointly forming a subject-predicate-objects triple. In the presented work, nodes and links have initial random representations, plus holistic representations derived from the initial representations of nodes and links in their local neighbourhoods. A memory trace is represented in the same vector space as the holistic representations themselves. To reduce the interference between stored information, it is required that the initial random vectors should be pairwise quasi-orthogonal. We show that pairwise quasi-orthogonality can be improved by drawing vectors from heavy-tailed distributions, e.g., a Cauchy distribution, and, thus, memory capacity of holistic representations can significantly be improved. Furthermore, we show that, in combination with a simple neural network, the presented holistic representation approach is superior to other methods for link predictions on knowledge graphs.
Simple and practical algorithms for $\ell_p$-norm low-rank approximation
Anastasios Kyrillidis
We propose practical algorithms for entrywise $\ell_p$-norm low-rank approximation, for $p = 1$ or $p = \infty$.
The proposed framework, which is non-convex and gradient-based, is easy to implement and typically attains
better approximations, faster, than state of the art.
From a theoretical standpoint, we show that the proposed scheme can attain $(1 + \varepsilon)$-OPT approximations.
Our algorithms are not hyperparameter-free: they achieve the desiderata only assuming algorithm's hyperparameters are known {\em apriori}---or are at least approximable.
\emph{I.e.}, our theory indicates what problem quantities need to be known, in order to get a good solution within polynomial time, and does not contradict to recent inapproximabilty results, as in \citep{song2017low}.
Quantile-Regret Minimisation in Infinitely Many-Armed Bandits
Arghya Roy Chaudhuri, Shivaram Kalyanakrishnan
The stochastic multi-armed bandit is a well-studied abstraction of decision making in the face of uncertainty.
We consider the setting in which the number of bandit arms is much larger than the possible number of pulls,
and can even be infinite. With the aim of minimising regret with respect to an optimal arm, existing methods
for this setting either assume some structure over the set of arms (Kleinberg et al., 2008, Ray Chowdhury and Gopalan, 2017),
or some property of the reward distribution (Wang et al., 2008). Invariably, the validity of such assumptions—and therefore
the performance of the corresponding methods depends on instance-specific parameters, which might not be known
beforehand.
We propose a conceptually simple, parameter-free, and practically effective alternative.
Specifically, we introduce a notion of regret with respect to the top quantile of a probability distribution over the
expected reward of randomly drawn arms. Our main contribution is an algorithm that achieves
sublinear “quantile-regret”, both (1) when it is specified a quantile, and (2) when the quantile can
be any (unknown) positive value. The algorithm needs no side information about the arms or about
the structure of their reward distributions: it relies on random sampling to reach arms in the top
quantile. Experiments show that our algorithm outperforms several previous methods (in terms of
conventional regret) when the latter are not tuned well, and often even when they are.
Variational Inference for Gaussian Process Models for Survival Analysis
Minyoung Kim, Vladimir Pavlovic
Gaussian process survival analysis model (GPSAM) was recently proposed to address key deficiencies of the Cox proportional hazard model, namely the need to account for uncertainty in the hazard function modeling while, at the same time, relaxing the time-covariates factorized assumption of the Cox model. However, the existing MCMC inference algorithms for GPSAM have proven to be slow in practice. In this paper we propose novel and scalable variational inference algorithms for GPSAM that reduce the time complexity of the sampling approaches and improve scalability to large datasets. We accomplish this by employing two effective strategies in scalable GP: i) using pseudo inputs and ii) approximation via random feature expansions. In both setups, we derive the full and partial likelihood formulations, typically considered in survival analysis settings. The proposed approaches are evaluated on two clinical and a divorce-marriage benchmark datasets, where we demonstrate improvements in prediction accuracy over the existing survival analysis methods, while reducing the complexity of inference compared to the recent state-of-the-art MCMC-based algorithms.
A Cost-Effective Framework for Preference Elicitation and Aggregation
Zhibing Zhao, Haoming Li, Junming Wang, Jeffrey O. Kephart, Nicholas Mattei, Hui Su, Lirong Xia
We propose a cost-effective framework for preference elicitation and aggregation under Plackett-Luce model with features. Given a budget, our framework iteratively computes the most cost-effective elicitation questions in order to help the agents make a better group decision.
We illustrate the viability of the framework with experiments on Amazon Mechanical Turk, which estimate the cost of answering different types of elicitation questions. We compare the prediction accuracy of our framework when adopting various information criteria that evaluate the expected information gain from a question. Our experiments show carefully designed information criteria are much more efficient than randomly asking questions given the budget constraint.
Incremental Learning-to-Learn with Statistical Guarantees
Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, Massimiliano Pontil
In learning-to-learn the goal is to infer a learning algorithm that works well on a class of tasks sampled from an unknown meta-distribution. In contrast to previous work on batch learning-to-learn, we consider a scenario where tasks are presented sequentially and the algorithm needs to adapt incrementally to improve its performance on future tasks. Key to this setting is for the algorithm to rapidly incorporate new observations into the model as they arrive, without keeping them in memory. We focus on the case where the underlying algorithm is Ridge Regression parametrised by a symmetric positive semidefinite matrix. We propose to learn this matrix by applying a stochastic strategy to minimize the empirical error incurred by Ridge Regression on future tasks sampled from the meta-distribution. We study the statistical properties of the proposed algorithm and prove non-asymptotic bounds on its excess transfer risk, that is, the generalization performance on new tasks from the same meta-distribution. We compare our online learning-to-learn approach with a state-of-the-art batch method, both theoretically and empirically.
Bandits with Side Observations: Bounded vs. Logarithmic Regret
Rémy Degenne, Evrard Garcelon, Vianney Perchet
We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency $\epsilon$, an extra observation is gathered by the agent for free. We prove that, no matter how small $\epsilon$ is the agent can ensure a regret uniformly bounded in time.
More precisely, we construct an algorithm with a regret smaller than $\sum_i \frac{\log(1/\epsilon)}{\Delta_i}$, up to multiplicative constant and $\log\log$ terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.
Sampling and Inference for Beta Neutral-to-the-Left Models of Sparse Networks
Benjamin Bloem-Reddy, Adam Foster, Emile Mathieu, Yee Whye Teh
Empirical evidence suggests that heavy-tailed degree distributions occurring in many real networks are well-approximated by power laws with exponents $\eta$ that may take values either less than and greater than two. Models based on various forms of exchangeability are able to capture power laws with $\eta < 2$, and admit tractable inference algorithms; we draw on previous results to show that $\eta > 2$ cannot be generated by the forms of exchangeability used in existing random graph models. Preferential attachment models generate power law exponents greater than two, but have been of limited use as statistical models due to the inherent difficulty of performing inference in non-exchangeable models. Motivated by this gap, we design and implement inference algorithms for a recently proposed class of models that generates $\eta$ of all possible values. We show that although they are not exchangeable, these models have probabilistic structure amenable to inference. Our methods make a large class of previously intractable models useful for statistical inference.
Estimating the dynamic connectivity structure among a system of entities has garnered much attention in recent years. While usual methods are designed to take advantage of temporal consistency to overcome noise, they conﬂict with the detectability of anomalies. We propose Clustered Fused Graphical Lasso (CFGL), a method using precomputed clustering information to improve the signal detectability as compared to typical Fused Graphical Lasso methods. We evaluate our method in both simulated and real-world datasets and conclude that, in many cases, CFGL can signiﬁcantly improve the sensitivity to signals without a signiﬁcant negative effect on the temporal consistency.
Unsupervised Learning of Latent Physical Properties Using Perception-Prediction Networks
David Zheng, Vinson Luo, Jiajun Wu, Joshua Tenenbaum
We propose a framework for the completely unsupervised learning of latent object properties from their interactions: the perception-prediction network (PPN). Consisting of a perception module that extracts representations of latent object properties and a prediction module that uses those extracted properties to simulate system dynamics, the PPN can be trained in an end-to-end fashion purely from samples of object dynamics. We find that the representations of latent object properties learned by PPNs not only are sufficient to accurately simulate the dynamics of systems comprised of previously unseen objects, but also can be translated directly into human-interpretable properties (e.g. mass, coefficient of restitution) in an entirely unsupervised manner. Crucially, PPNs also generalize to novel scenarios: their gradient-based training can be applied to many dynamical systems and their graph-based structure functions over systems comprised of different numbers of objects. Our results demonstrate the efficacy of graph-based neural architectures in object-centric inference and prediction tasks, and our model has the potential to discover relevant object properties in systems that are not yet well understood.
Stochastic variance-reduced gradient Langevin dynamics (SVRG-LD) was recently proposed to improve the performance of stochastic gradient Langevin dynamics (SGLD) by reducing the variance of the stochastic gradient.
In this paper, we propose a variant of SVRG-LD, namely SVRG-LD$^+$, which replaces the full gradient in each epoch with a subsampled one. We provide a nonasymptotic analysis of the convergence of SVRG-LD$^+$ in $2$-Wasserstein distance, and show that SVRG-LD$^+$ enjoys a lower gradient complexity\footnote{Gradient complexity is defined as the required number of stochastic gradient evaluations to reach a target accuracy.} than SVRG-LD, when the sample size is large or the target accuracy requirement is moderate. Our analysis directly implies a sharper convergence rate for SVRG-LD, which improves the existing convergence rate by a factor of $\kappa^{1/6}n^{1/6}$, where $\kappa$ is the condition number of the log-density function and $n$ is the sample size. Experiments on both synthetic and real-world datasets validate our theoretical results.
Finite-State Controllers of POMDPs using Parameter Synthesis
Sebastian Junges, Nils Jansen, Ralf Wimmer, Tim Quatmann, Leonore Winterer, Joost-Pieter Katoen, Bernd Becker
We study finite-state controllers (FSCs) for partially observable Markov decision processes (POMDPs) that are provably correct with respect to given specifications.
The key insight is that computing (randomised) FSCs on POMDPs is equivalent to---and computationally as hard as---synthesis for parametric Markov chains (pMCs).
This correspondence allows to use tools for synthesis in pMCs to compute correct-by-construction FSCs on POMDPs for a variety of specifications.
Our experimental evaluation shows comparable performance to well-known POMDP solvers.
Identification of Personalized Effects Associated With Causal Pathways
Ilya Shpitser, Eli Sherman
Unlike classical causal inference, where the goal is to estimate average causal effects within a population, in settings such as personalized medicine, the goal is to map a unit’s characteristics to a treatment tailored to maximize the expected outcome for that unit. Obtaining high-quality mappings of this type is the goal of the dynamic regime literature. In healthcare settings, optimizing policies with respect to a particular causal pathway is often of interest as well. In the context of average treatment effects, direct and indirect effects are considered in the mediation analysis literature.
In this paper, we combine mediation analysis and dynamic treatment regime ideas and consider how unit characteristics may be used to tailor a treatment strategy that maximizes a particular path-specific effect. In particular, we define counterfactuals associated with the path-specific effects of a policy, give a general identification algorithm for these counterfactuals, and prove completeness of the algorithm for unrestricted policies. A corollary of our results is that the identification algorithm for responses to policies given in [13] is complete for arbitrary policies.
Subhadeep Karan, Matthew Eichhorn, Blake Hurlburt, Grant Iraci, Jaroslaw Zola
We propose scalable methods to execute counting queries in machine learning applications. To achieve memory and computational efficiency, we abstract counting queries and their context such that the counts can be aggregated as a stream. We demonstrate performance and scalability of the resulting approach on random queries, and through extensive experimentation using Bayesian networks learning and association rule mining. Our methods significantly outperform commonly used ADtrees and hash tables, and are practical alternatives for processing large-scale data.
A Dual Approach to Scalable Verification of Deep Networks
Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Timothy Mann, Pushmeet Kohli
This paper addresses the problem of formally verifying desirable properties of neural net-works,i.e., obtaining provable guarantees that neural networks satisfy specifications relating their inputs and outputs (robustness to bounded norm adversarial perturbations, for example). Most previous work on this topic was limited in its applicability by the size of the net-work, network architecture and the complexity of properties to be verified. In contrast, our framework applies to a general class of activation functions and specifications on neural net-work inputs and outputs. We formulate verification as an optimization problem (seeking to find the largest violation of the specification)and solve a Lagrangian relaxation of the optimization problem to obtain an upper bound on the worst case violation of the specification being verified. Our approach is anytime i.e. it can be stopped at any time and a valid bound on the maximum violation can be obtained. We develop specialized verification algorithms with provable tightness guarantees under special assumptions and demonstrate the practical significance of our general verification approach on a variety of verification tasks.
Understanding Measures of Uncertainty for Adversarial Example Detection
Lewis Smith, Yarin Gal
Measuring uncertainty is a promising technique for detecting adversarial examples, crafted inputs on which the model predicts an incorrect class with high confidence. There are various measures of uncertainty, including predictive entropy and mutual information, each capturing distinct types of uncertainty. We study these measures, and shed light on why mutual information seems to be effective at the task of adversarial example detection. We highlight failure modes for MC dropout, a widely used approach for estimating uncertainty in deep models. This leads to an improved understanding of the drawbacks of current methods, and a proposal to improve the quality of uncertainty estimates using probabilistic model ensembles. We give illustrative experiments using MNIST to demonstrate the intuition underlying the different measures of uncertainty, as well as experiments on a real-world Kaggle dogs vs cats classification dataset.
Causal Discovery in the Presence of Measurement Error
Tineke Blom, Anna Klimovskaia, Sara Magliacane, Joris M. Mooij
Causal discovery algorithms infer causal relations from data based on several assumptions, including notably the absence of measurement error. However, this assumption is most likely violated in practical applications, which may result in erroneous, irreproducible results. In this work we show how to obtain an upper bound for the variance of random measurement error from the covariance matrix of measured variables and how to use this upper bound as a correction for constraint-based causal discovery. We demonstrate a practical application of our approach on both simulated data and real-world protein signaling data.
IDK Cascades: Fast Deep Learning by Learning not to Overthink
Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, Fisher Yu, Joseph E. Gonzalez
Advances in deep learning have led to substantial increases in prediction accuracy but have been accompanied by increases in the cost of rendering predictions. We conjecture that for a majority of real-world inputs, the recent advances in deep learning have created models that effectively “over-think” on simple inputs. In this paper we revisit the classic question of building model cascades that primarily leverage class asymmetry to reduce cost. We introduce the “I Don’t Know”(IDK) prediction cascades framework, a general framework to systematically compose a set of pre-trained models to accelerate inference without a loss in prediction accuracy. We propose two search based methods for constructing cascades as well as a new cost-aware objective within this framework. The proposed IDK cascade framework can be easily adopted in the existing model serving systems without additional model re-training. We evaluate the proposed techniques on a range of benchmarks to demonstrate the effectiveness of the proposed framework.
Learning Fast Optimizers for Contextual Stochastic Integer Programs
Vinod Nair, Dj Dvijotham, Iain Dunning, Oriol Vinyals
We present a novel RL-based approach to learning a fast and highly scalable solver for a two-stage stochastic integer program in the large-scale data setting. Branch and bound solvers such as SCIP do not scale to large datasets for this problem class. Additionally, they solve each instance independently, without any knowledge transfer across instances. We address these limitations with a learnable local search solver that jointly learns two policies, one to generate an initial solution and another to iteratively improve it with local moves. The policies use contextual features for a problem instance as input, which enables learning across instances and generalization to new ones. We also propose learning a policy to estimate a bound on the objective using dual decomposition. Benchmark results show that on test instances our approach rapidly achieves approximately 30% to 2000% better objective value, which SCIP requires more than an order of magnitude more running time to match. Our approach also achieves better solution quality on seven out of eight benchmark problems than more scalable baselines such as Tabu Search and Progressive Hedging.
We developed a novel statistical method to identify structural differences between two networks characterized by structural equation models (SEMs). We propose to reparameterize the model to separate the differential structures from common structures, and then design an algorithm with calibration and construction stages to identify these differential structures.
The calibration stage serves to obtain consistent prediction by building the l_`2 regularized regression of each endogenous variables against prescreened exogenous variables, correcting for potential endogeneity issue. The construction stage consistently selects and estimates both common and differential effects by undertaking l_`1 regularized regression of each endogenous variable against the predicts of other endogenous variables as well as its anchoring exogenous variables. Our method allows easy parallel computation at each stage. Theoretical results are obtained to establish non-asymptotic error
bounds of predictions and estimates at both stages, as well as the consistency of identified common and differential effects. Our studies on synthetic data demonstrated that our proposed method performed much better than independently
constructing the networks. A real data set is analyzed to illustrate the applicability of our method.
We propose sparse-matrix belief propagation, which executes loopy belief propagation in pairwise Markov random fields by replacing indexing over graph neighborhoods with sparse-matrix operations. This abstraction allows for seamless integration with optimized sparse linear algebra libraries, including those that perform matrix and tensor operations on modern hardware such as graphical processing units (GPUs). The sparse-matrix abstraction allows the implementation of belief propagation in a high-level language (e.g., Python) that is also able to leverage the power of GPU parallelization. We demonstrate sparse-matrix belief propagation by implementing it in a modern deep learning framework (PyTorch), measuring the resulting massive improvement in running time, and facilitating future integration into deep learning models.
Sequential Learning under Probabilistic Constraints
Amirhossein Meisami, Henry Lam, Chen Dong, Abhishek Pani
We provide the first study on online learning problems under stochastic constraints that are "soft", i.e., need to be satisfied with high probability. These constraints are imposed on all or some stages of the time horizon so that the stage decision probabilistically satisfies a safety condition that is realized after the decision is made. The distributions that govern these conditions are learned through the collected observations. Under a Bayesian framework, we study a scheme that provides statistical feasibility guarantees through the time horizon, by using posterior Monte Carlo samples to form sampled constraints that leverage the scenario generation approach in chance-constrained programming. We demonstrate how our scheme can be integrated into Thompson sampling and illustrate it with an application in online advertisement.
Filjor Broka, Rina Dechter, Alexander Ihler, Kalev Kask
We present a new sampling scheme for approximating hard to compute queries over graphical models, such as computing the partition function. The scheme builds upon exact algorithms that traverse a weighted directed state-space graph representing a global function over a graphical model (e.g., probability distribution). With the aid of an abstraction function and randomization, the state space can be compacted (or trimmed) to facilitate tractable computation, yielding a Monte Carlo estimate that is unbiased. We present the general idea and analyze its properties analytically and empirically, investigating two specific ideas for picking abstractions - targeting reduction of variance or search space size.
Meta Reinforcement Learning with Latent Variable Gaussian Processes
Steindor Saemundsson, Katja Hofmann, Marc Peter Deisenroth
Learning from small data sets is critical in many practical applications where data collection is time consuming or expensive, e.g., robotics, animal experiments or drug design. Meta learning is one way to increase the data efficiency of learning algorithms by generalizing learned concepts from a set of training tasks to unseen, but related, tasks. Often, this relationship between tasks is hard coded or relies in some other way on human expertise. In this paper, we frame meta learning as a hierarchical latent variable model and infer the relationship between tasks automatically from data. We apply our framework in a model-based reinforcement learning setting and show that our meta-learning model effectively generalizes to novel tasks by identifying how new tasks relate to prior ones from minimal data. This results in up to a 60% reduction in the average interaction time needed to solve tasks compared to strong baselines.
Non-Parametric Path Analysis in Structural Causal Models
Junzhe Zhang, Elias Bareinboim
One of the fundamental tasks in causal inference is to decompose the observed association between a decision X and an outcome Y into its most basic structural mechanisms. In this paper, we introduce counterfactual measures for effects along with a specific mechanism, represented as a path from X to Y in an arbitrary structural causal model. We derive a novel non-parametric decomposition formula that expresses the covariance of X and Y as a sum over unblocked paths from X to Y contained in an arbitrary causal model. This formula allows a fine-grained path analysis without requiring a commitment to any particular parametric form, and can be seen as a generalization of Wright's decomposition method in linear systems (1923,1932) and Pearl's non-parametric mediation formula (2001).
Stochastic Layer-Wise Precision in Deep Neural Networks
Griffin Lacey, Graham W. Taylor, Shawki Areibi
Low precision weights, activations, and gradients have been proposed as a way to improve the computational efficiency and memory footprint of deep neural networks. Recently, low precision networks have even shown to be more robust to adversarial attacks. However, typical implementations of low precision DNNs use uniform precision across all layers of the network. In this work, we explore whether a heterogeneous allocation of precision across a network leads to improved performance, and introduce a learning scheme where a DNN stochastically explores multiple precision configurations through learning. This permits a network to learn an optimal precision configuration. We show on convolutional neural networks trained on MNIST and ILSVRC12 that even though these nets learn a uniform or near-uniform allocation strategy respectively, stochastic precision leads to a favourable regularization effect improving generalization.
Estimation of Personalized Effects Associated With Causal Pathways
Razieh Nabi, Phyllis Kanki, Ilya Shpitser
The goal of personalized decision making is to map a unit's characteristics to an action tailored to maximize the expected outcome for that unit. Obtaining high-quality mappings of this type is the goal of the dynamic regime literature. In healthcare settings, optimizing policies with respect to a particular causal pathway may be of interest as well. For example, we may wish to maximize the chemical effect of a drug given data from an observational study where the chemical effect of the drug on the outcome is entangled with the indirect effect mediated by differential adherence. In such cases, we may wish to optimize the direct effect of a drug, while keeping the indirect effect to that of some reference treatment. [15] shows how to combine mediation analysis and dynamic treatment regime ideas to defines policies associated with causal pathways and counterfactual responses to these policies. In this paper, we derive a variety of methods for learning high quality policies of this type from data, in a causal model corresponding to a longitudinal setting of practical importance. We illustrate our methods via a dataset of HIV patients undergoing therapy, gathered in the Nigerian PEPFAR program.
High-confidence error estimates for learned value functions
Touqir Sajed, Wesley Chung, Martha White
Estimating the value function for a fixed policy is a fundamental problem in reinforcement learning. Policy evaluation algorithms---to estimate value functions---continue to be developed, to improve convergence rates, improve stability and handle variability, particularly for off-policy learning. To understand the properties of these algorithms, the experimenter needs high-confidence estimates of the accuracy of the learned value functions. For environments with small, finite state-spaces, like chains, the true value function can be easily computed, to compute accuracy. For large, or continuous state-spaces, however, this is no longer feasible. In this paper, we address the largely open problem of how to obtain these high-confidence estimates, for general state-spaces. We provide a high-confidence bound on an empirical estimate of the value error to the true value error. We use this bound to design an offline sampling algorithm, which stores the required quantities to repeatedly compute value error estimates for any learned value function. We provide experiments investigating the number of samples required by this offline algorithm in simple benchmark reinforcement learning domains, and highlight that there are still many open questions to be solved for this important problem.
Design of incentives or recommendations to users is becoming more common as platform providers continually emerge. We propose a multi-armed bandit approach to matching incentives to users, whose preferences are unknown a priori and evolving dynamically in time, in a resource constrained environment. We propose an algorithm that combines ideas from three distinct domains: (i) a greedy matching paradigm, (ii) traditional upper confidence bound algorithm (UCB) for bandits, and (iii) mixing times from the theory of Markov chains. For this algorithm, we provide theoretical bounds on the regret and we demonstrate its performance via both synthetic and realistic (matching supply and demand in a bike-sharing platform) examples.
We present a new class of sparse multi-prototype classifiers, designed to combine the computational advantages of sparse predictors with the non-linear power of prototype-based classification techniques. This combination makes sparse multi-prototype models especially well-suited for resource constrained computational platforms, such as those found in IoT devices. We cast our supervised learning problem as a convex-concave saddle point problem and design a provably-fast algorithm to solve it. We complement our theoretical analysis with an empirical study that demonstrates the power of our methodology.
Fast Stochastic Quadrature for Approximate Maximum-Likelihood Estimation
Nico Piatkowski, Katharina Morik
Recent stochastic quadrature techniques for undirected graphical models rely on near-minimax degree-k polynomial approximations to the model's potential function for inferring the partition function. While providing desirable statistical guarantees, typical constructions of such approximations are themselves not amenable to efficient inference. Here, we develop a class of Monte Carlo sampling algorithms for efficiently approximating the value of the partition function, as well as the associated pseudo-marginals. More precisely, for pairwise models with n vertices and m edges, the complexity can be reduced from O(d^k) to O(k^4+kn+m), where d >= 4m is the parameter dimension. We also consider the uses of stochastic quadrature for the problem of maximum-likelihood (ML) parameter estimation. For completely observed data, our analysis gives rise to a probabilistic bound on the log-likelihood of the model. Maximizing this bound yields an approximate ML estimate which, in analogy to the moment-matching of exact ML estimation, can be interpreted in terms of pseudo-moment-matching. We present experimental results illustrating the behavior of this approximate ML estimator.
Marginal MAP is a key task in Bayesian inference and decision-making, and known to be very challenging in general. In this paper, we present an algorithm that blends heuristic search and importance sampling to provide anytime finite-sample bounds for marginal MAP along with predicted MAP solutions. We convert bounding marginal MAP to a surrogate task of bounding a series of summation problems of an augmented graphical model, and then adapt dynamic importance sampling, a recent advance in bounding the partition function, to providing finite-sample bounds for the surrogate task. Those bounds are guaranteed to be tight given enough time, and the values of the predicted MAP solutions will converge to the optimum. Our algorithm runs in an anytime/anyspace manner, which gives flexible trade-offs between memory, time, and solution quality. We demonstrate the effectiveness of our approach empirically on multiple challenging benchmarks in comparison with some state-of-the-art search algorithms.
Acyclic Linear SEMs Obey the Nested Markov Property
Ilya Shpitser, Robin Evans, Thomas S. Richardson
The conditional independence structure induced on the observed marginal distribution by a hidden variable directed acyclic graph (DAG) may be represented by a graphical model represented by mixed graphs called maximal ancestral graphs (MAGs). This model has a number of desirable properties, in particular the set of Gaussian distributions can be parameterized by viewing the graph as a path diagram. Models represented by MAGs have been used for causal discovery [22], and identification theory for causal effects [28].
In addition to ordinary conditional independence constraints, hidden variable DAGs also induce generalized independence constraints. These constraints form the nested Markov property [20]. We first show that acyclic linear SEMs obey this property. Further we show that a natural parameterization for all Gaussian distributions obeying the nested Markov property arises from a generalization of maximal ancestral graphs that we call maximal arid graphs (MArG). We show that every nested Markov model can be associated with a MArG; viewed as a path diagram this MArG parametrizes the Gaussian nested Markov model. This leads directly to methods for ML fitting and computing BIC scores for Gaussian nested models.
A Unified Particle-Optimization Framework for Scalable Bayesian Sampling
Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, Liqun Chen
There has been recent interest in developing scalable Bayesian sampling methods such as stochastic gradient MCMC (SG-MCMC) and Stein variational gradient descent (SVGD) for big-data analysis. A standard SG-MCMC algorithm simulates samples from a discrete-time Markov chain to approximate a target distribution, thus samples could be highly correlated, an undesired property for SG-MCMC. In contrary, SVGD directly optimizes a set of particles to approximate a target distribution, and thus is able to obtain good approximations with relatively much fewer samples. In this paper, we propose a principle particle-optimization framework based on Wasserstein gradient flows to unify SG-MCMC and SVGD, and to allow new algorithms to be developed. Our framework interprets SG-MCMC as particle optimization on the space of probability measures, revealing a strong connection between SG-MCMC and SVGD. The key component of our framework is several particle-approximate techniques to efficiently solve the original partial differential equations on the space of probability measures. Extensive experiments on both synthetic data and deep neural networks demonstrate the effectiveness and efficiency of our framework for scalable Bayesian sampling.
An Efficient Quantile Spatial Scan Statistic for Finding Unusual Regions in Continuous Spatial Data with Covariates
Travis Moore, Weng-Keen Wong
Domains such as citizen science biodiversity monitoring and real estate sales are producing spatial data with a continuous response and a vector of covariates associated with each spatial data point. A common data analysis task involves finding unusual regions that differ from the surrounding area. Existing techniques compare regions according to the means of their distributions to measure unusualness. Comparing means is not only vulnerable to outliers, but it is also restrictive as an analyst may want to compare other parts of the probability distributions. For instance, an analyst interested in unusual areas for high-end homes would be more interested in the 90th percentile of home sale prices than in the mean. We introduce the Quantile Spatial Scan Statistic (QSSS), which finds unusual regions in spatial data by comparing quantiles of data distributions while accounting for covariates at each data point. We also develop an exact incremental update of the hypothesis test used by the QSSS, which results in a massive speedup over a naive implementation.
The goal of many machine learning tasks is to learn a model that has small population risk. While mini-batch stochastic gradient descent (SGD) and variants are popular approaches for achieving this goal, it is hard to prescribe a clear stopping criterion and to establish high probability convergence bounds to the population risk. In this paper, we introduce Stable Gradient Descent which validates stochastic gradient computations by splitting data into training and validation sets and reuses samples using a differential private mechanism. StGD comes with a natural upper bound on the number of iterations and has high-probability convergence to the population risk. Experimental results illustrate that StGD is empirically competitive and often better than SGD and GD.
Frederick Callaway, Sayan Gul, Paul M. Krueger, Thomas L. Griffiths, Falk Lieder
The efficient use of limited computational resources is an essential ingredient of intelligence. Selecting computations optimally according to rational metareasoning would achieve this, but this is computationally intractable. Inspired by psychology and neuroscience, we propose the first concrete and domain-general learning algorithm for approximating the optimal selection of computations: Bayesian metalevel policy search (BMPS). We derive this general, sample-efficient search algorithm for a computation-selecting metalevel policy based on the insight that the value of information lies between the myopic value of information and the value of perfect information. We evaluate BMPS on three increasingly difficult metareasoning problems: when to terminate computation, how to allocate computation between competing options, and planning. Across all three domains, BMPS achieved near-optimal performance and compared favorably to previously proposed metareasoning heuristics. Finally, we demonstrate the practical utility of BMPS in an emergency management scenario, even accounting for the overhead of metareasoning.
Per-decision Multi-step Temporal Difference Learning with Control Variates
Kristopher De Asis, Richard S. Sutton
Multi-step temporal difference (TD) learning is an important approach in reinforcement learning, as it unifies one-step TD learning with Monte Carlo methods in a way where intermediate algorithms can outperform either extreme. They address a bias-variance trade off between reliance on current estimates, which could be poor, and incorporating longer sampled reward sequences into the updates. Especially in the off-policy setting, where the agent aims to learn about a policy different from the one generating its behaviour, the variance in the updates can cause learning to diverge as the number of sampled rewards used in the estimates increases. In this paper, we introduce per-decision control variates for multi-step TD algorithms, and compare them to existing methods. Our results show that including the control variates can greatly improve performance on both on and off-policy multi-step temporal difference learning tasks.
The Indian Buffet Hawkes Process to Model Evolving Latent Influences
Xi Tan, Vinayak Rao, Jennifer Neville
Temporal events in the real world often exhibit reinforcing dynamics, where earlier events trigger follow-up activity in the near future. A canonical example of modeling such dynamics is the Hawkes process (HP). However, previous HP models do not capture the rich dynamics of real-world activity—which can be driven by multiple latent triggering factors shared by past and future events, with the latent features themselves exhibiting temporal dependency structures. For instance, rather than view a new document just as a response to other documents in the recent past, it is important to account for the factor-structure underlying all previous documents. This structure itself is not fixed, with the influence of earlier documents decaying with time. To this end, we propose a novel Bayesian nonparametric stochastic point process model, the Indian Buffet Hawkes Processes (IBHP), to learn multiple latent triggering factors underlying streaming document/message data. The IBP facilitates the inclusion of multiple triggering factors in the HP, and the HP allows for modeling latent factor evolution in the IBP. We develop a learn- ing algorithm for the IBHP based on Sequential Monte Carlo and demonstrate the effectiveness of the model. In both synthetic and real data experiments, our model achieves equivalent or higher likelihood and provides interpretable topics and shows their dynamics.
We introduce Battling-Bandits – an online learning framework where given a set of n arms, the learner needs to select a subset of k ≥ 2 arms in each round and subsequently observes a stochastic feedback indicating the winner of the round. This framework generalizes the standard Dueling-Bandit framework which applies to several practical scenarios such as medical treatment preferences, recommender systems, search engine optimization etc., where it is easier and more effective to collect feedback for multiple options simultaneously. We develop a novel class of pairwise subset choice model, for modelling the subset-wise winner feedback and propose three algorithms - Battling-Doubler, Battling-MultiSBM and Battling-Duel: While the first two are designed for a special class of linear-link based choice models, the third one applies to a much general class of pairwise-subset choice models with Condorcet winner. We also analyzed their regret guarantees and show the optimality of Battling-Duel proving a matching regret lower bound of Ω(n log T ), which (perhaps surprisingly) shows that the flexibility of playing size-k subsets does not really help to gather information faster than the corresponding dueling case (k = 2), at least for our current feedback model. The efficacy of our algorithms are demonstrated through extensive experimental evaluations on a variety of synthetic and real world datasets.
Adaptive Stochastic Dual Coordinate Ascent for Conditional Random Fields
Rémi Le Priol, Alexandre Piché, Simon Lacoste-Julien
This work investigates the training of conditional random fields (CRFs) via the stochastic dual coordinate ascent (SDCA) algorithm of Shalev- Shwartz and Zhang (2016). SDCA enjoys a linear convergence rate and a strong empirical perfor- mance for binary classification problems. How- ever, it has never been used to train CRFs. Yet it benefits from an “exact” line search with a single marginalization oracle call, unlike previous ap- proaches. In this paper, we adapt SDCA to train CRFs, and we enhance it with an adaptive non- uniform sampling strategy based on block duality gaps. We perform experiments on four standard sequence prediction tasks. SDCA demonstrates performances on par with the state of the art, and improves over it on three of the four datasets, which have in common the use of sparse features.
Adaptive Stratified Sampling for Precision-Recall Estimation
Ashish Sabharwal, Yexiang Xue
We propose a new algorithm for computing a constant-factor approximation of precision-recall (PR) curves for massive noisy datasets produced by generative models. Assessing validity of items in such datasets requires human annotation, which is costly and must be minimized. Our algorithm, AdaStrat, is the first data-aware method for this task. It chooses the next point to query on the PR curve adaptively, based on previous observations. It then selects specific items to annotate using stratified sampling. Under a mild monotonicity assumption, AdaStrat outputs a guaranteed approximation of the underlying precision function, while using a number of annotations that scales very slowly with N, the dataset size. For example, when the minimum precision is bounded by a constant, it issues only log log N precision queries. In general, it has a regret of no more than log log N w.r.t. an oracle that issues queries at data-dependent (unknown) optimal points. On a scaled-up NLP dataset of 3.5M items, AdaStrat achieves a remarkably close approximation of the true precision function using only 18 precision queries, 13x fewer than best previous approaches.
Fast Kernel Approximations for Latent Force Models and Convolved Multiple-Output Gaussian processes
Cristian Guarnizo, Mauricio Álvarez
A latent force model is a Gaussian process with a covariance function inspired by a differential operator. Such covariance function is obtained by performing convolution integrals between Green's functions associated to the differential operators, and covariance functions associated to latent functions. In the classical formulation of latent force models, the covariance functions are obtained analytically by solving a double integral, leading to expressions that involve numerical solutions of different types of error functions. In consequence, the covariance matrix calculation is considerably expensive, because it requires the evaluation of one or more of these error functions. In this paper, we use random Fourier features to approximate the solution of these double integrals obtaining simpler analytical expressions for such covariance functions. We show experimental results using ordinary differential operators and provide an extension to build general kernel functions for convolved multiple output Gaussian processes.
Imitation learning (IL) consists of a set of tools that leverage expert demonstrations to quickly learn policies. However, if the expert is suboptimal, IL can yield policies with inferior performance compared to reinforcement learning (RL). In this paper, we aim to provide an algorithm that combines the best aspects of RL and IL. We accomplish this by formulating several popular RL and IL algorithms in a common mirror descent framework, showing that these algorithms can be viewed as a variation on a single approach. We then propose LOKI, a strategy for policy learning that first performs a small but random number of IL iterations before switching to a policy gradient RL method. We show that if the switching time is properly randomized, LOKI can learn to outperform a suboptimal expert and converge faster than running policy gradient from scratch. Finally, we evaluate the performance of LOKI experimentally in several simulated environments.
Tim Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, Jakub M. Tomczak
The Variational Auto-Encoder (VAE) is one of the most used unsupervised machine learning models. But although the default choice of a Gaussian distribution for both the prior and posterior represents a mathematically convenient distribution often leading to competitive results, we show that this parameterization fails to model data with a latent hyperspherical structure. To address this issue we propose using a von Mises-Fisher (vMF) distribution instead, leading to a hyperspherical latent space. Through a series of experiments, we show how such a hyperspherical VAE, or S-VAE, is more suitable for capturing data with a hyperspherical latent structure, while outperforming a normal, N-VAE, in low dimensions on other data types.
Dissociation-Based Oblivious Bounds for Weighted Model Counting
Li Chou, Wolfgang Gatterbauer, Vibhav Gogate
We consider the weighted model counting task which includes important tasks in graphical models, such as computing the partition function and probability of evidence as special cases. We propose a novel partition-based bounding algorithm that exploits logical structure and gives rise to a set of inequalities from which upper (or lower) bounds can be derived efficiently. The bounds come with optimality guarantees under certain conditions and are oblivious in that they require only limited observations of the structure and parameters of the problem. We experimentally compare our bounds with the mini-bucket scheme (which is also oblivious) and show that our new bounds are often superior and never worse on a wide variety of benchmark networks.
Averaging Weights Leads to Wider Optima and Better Generalization
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training, with essentially no computational overhead. We interpret this result by analyzing the geometry of SGD trajectories over the loss surfaces of deep neural networks. Moreover, we show that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and ShakeShake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization in deep learning, and has almost no computational overhead.
Block-Value Symmetries in Probabilistic Graphical Models
Gagan Madan, Ankit Anand, Mausam, Parag Singla
One popular way for lifted inference in probabilistic graphical models is to first merge symmetric states into a single cluster (orbit) and then use these for downstream inference, via variations of orbital MCMC [Niepert, 2012]. These orbits are represented compactly using permutations over variables, and variable-value (VV) pairs, but they can miss several state symmetries in a domain.
We define the notion of permutations over block-value (BV) pairs, where a block is a set of variables. BV strictly generalizes VV symmetries, and can compute many more symmetries for increasing block sizes. To operationalize use of BV permutations in lifted inference, we describe 1) an algorithm to compute BV permutations given a block partition of the variables, 2) BV-MCMC, an extension of orbital MCMC that can sample from BV orbits, and 3) a heuristic to suggest good block partitions. Our experiments show that BV-MCMC can mix much faster compared to vanilla MCMC and orbital MCMC.
Rahul G. Krishnan, Arjun Khandelwal, Rajesh Ranganath, David Sontag
We propose a new way to answer probabilistic queries that span multiple datapoints. We formalize reasoning about the similarity of different datapoints as the evaluation of the Bayes Factor within a hierarchical deep generative model that enforces a separation between the latent variables used for representation learning and those used for reasoning. Under this model, we derive an intuitive estimator for the Bayes Factor that represents similarity as the amount of overlap in representation space shared by different points. The estimator we derive relies on a query-conditional latent reasoning network, that parameterizes a distribution over the latent space of the deep generative model. The latent reasoning network is trained to amortize the posterior-predictive distribution under a hierarchical model using supervised data and a max-margin learning algorithm. We explore how the model may be used to focus the data variations captured in the latent space of the deep generative model and how this may be used to build new algorithms for few-shot learning.
Densified Winner Take All (WTA) Hashing for Sparse Datasets
Beidi Chen, Anshumali Shrivastava
WTA (Winner Take All) hashing has been successfully applied in many large-scale vision applications. This hashing scheme was tailored to take advantage of the comparative reasoning (or order based information), which showed significant accuracy improvements. In this paper, we identify a subtle issue with WTA, which grows with the sparsity of the datasets. This issue limits the discriminative power of WTA. We then propose a solution to this problem based on the idea of Densification which makes use of 2-universal hash functions in a novel way. Our experiments show that Densified WTA Hashing outperforms Vanilla WTA Hashing both in image retrieval and classification tasks consistently and significantly.
Lifted inference reduces the complexity of inference in relational probabilistic models by identifying groups of constants (or atoms) which behave symmetric to each other. A number of techniques have been proposed in the literature for lifting marginal as well MAP inference. We present the first application of lifting rules for marginal-MAP (MMAP), an important inference problem in models having latent variables. Our main contribution is two fold: (1) we define a new equivalence class of variables, called Single Occurrence for MAX (SOM), and show that solution lies at extreme with respect to the SOM variables, i.e., predicate groundings differing only in the instantiation of the SOM variables take the same truth value (2) we define a sub-class SOM-R (SOM Reduce) and exploit properties of extreme assignments to show that MMAP inference can be performed by reducing the domain of SOM-R variables to a single constant. We refer to our lifting technique as the SOM-R rule for lifted MMAP. Combined with existing rules such as decomposer and binomial, this results in a powerful framework for lifted MMAP. Experiments on three benchmark domains show significant gains in both time and memory compared to ground inference as well as lifted approaches not using SOM-R.
Ondrej Kuzelka, Yuyi Wang, Jesse Davis, Steven Schockaert
We consider the problem of predicting plausible missing facts in relational data, given a set of imperfect logical rules. In particular, our aim is to provide bounds on the (expected) number of incorrect inferences that are made in this way. Since for classical inference it is in general impossible to bound this number in a non-trivial way, we consider two inference relations that weaken, but remain close in spirit to classical inference.
Pure Exploration of Multi-Armed Bandits with Heavy-Tailed Payoffs
Xiaotian Yu, Han Shao, Michael R. Lyu, Irwin King
Inspired by heavy-tailed distributions in practical scenarios, we investigate the problem on pure exploration of Multi-Armed Bandits (MAB) with heavy-tailed payoffs by breaking the assumption of payoffs with sub-Gaussian noises in MAB, and assuming that stochastic payoffs from bandits are with finite $p$-th moments, where $p\in (1,+\infty)$. The main contributions in this paper are three-fold. First, we technically analyze tail probabilities of empirical average and truncated empirical average (TEA) for estimating expected payoffs in sequential decisions with heavy-tailed noises via martingales. Second, we propose two effective bandit algorithms based on different prior information (i.e., fixed confidence or fixed budget) for pure exploration of MAB generating payoffs with finite $p$-th moments. Third, we derive theoretical guarantees for the proposed two bandit algorithms, and demonstrate the effectiveness of two algorithms in pure exploration of MAB with heavy-tailed payoffs in synthetic data and real-world financial data.
Counterfactual Normalization: Proactively Addressing Dataset Shift Using Causal Mechanisms
Adarsh Subbaswamy, Suchi Saria
Predictive models can fail to generalize from training to deployment environments because of dataset shift, posing a threat to model reliability in practice. As opposed to previous methods which use samples from the target distribution to reactively correct dataset shift, we propose using graphical knowledge of the causal mechanisms relating variables in a prediction problem to proactively remove variables that participate in spurious associations with the prediction target, allowing models to generalize across datasets. To accomplish this, we augment the causal graph with latent counterfactual variables that account for the underlying causal mechanisms, and show how we can estimate these variables. In our experiments we demonstrate that models using good estimates of the latent variables instead of the observed variables transfer better from training to target domains with minimal accuracy loss in the training domain.
Decentralized Planning for Non-dedicated Agent Teams with Submodular Rewards in Uncertain Environments
Pritee Agrawal, Pradeep Varakantham, William Yeoh
Planning under uncertainty is seen in many domains such as disaster rescue, sensor networks, security patrolling, etc. where individual agents work in a decentralized, yet co-operative manner and are tied together through a global reward function. Existing research has made significant progress by providing a rich framework to represent co-operative and decentralized planning under uncertainty. This class of problems also deal with non-dedication of team members where agents may leave due to high priority tasks(e.g., emergency,accidents etc.) or due to damange to the agent. However, there is very limited literature to handle problems of non-dedication in agent teams in decentralized settings. In this paper, we provide a general model to represent such problems and solution approaches for handling cooperative and decentralized planning under uncertainty for non-dedicated agent teams. We specifically provide two greedy approaches (an offline one and an offline-online one) that are able to deal with agents leaving the team in an effective and efficient way by exploiting the submodularity property. We provide a detailed evaluation of our approaches on existing benchmark problems and demonstrate that our approaches are able to obtain more than 90% of optimal solution quality on benchmark problems from the literature.
A Forest Mixture Bound for Block-Free Parallel Inference
Neal Lawton, Greg Ver Steeg, Aram Galstyan
Coordinate ascent variational inference is an important algorithm for inference in probabilistic models, but it is slow because it updates only a single variable at a time. Block coordinate methods perform inference faster by updating blocks of variables in parallel. However, the speed and stability of these algorithms depends on how the variables are partitioned into blocks. In this paper, we give a stable parallel algorithm for inference in deep exponential families that doesn't require the variables to be partitioned into blocks. We achieve this by lower bounding the ELBO by a new objective we call the forest mixture bound (FM bound) that separates the inference problem for variables within a hidden layer. We apply this to the simple case when all random variables are Gaussian and show empirically that the algorithm converges faster for models that are inherently more forest-like.
Assessing the magnitude of cause-and-effect relations is one of the central challenges found throughout the empirical sciences. The problem of identification of causal effects is concerned with determining whether a causal effect can be computed from a combination of observational data and substantive knowledge about the domain under investigation, which is formally expressed in the form of a causal graph. In many practical settings, however, the knowledge available for the researcher is not strong enough so as to specify a unique causal graph. Another line of investigation attempts to use observational data to learn a qualitative description of the domain called a Markov equivalence class, which is the collection of causal graphs that share the same set of observed features. In this paper, we marry both approaches and study the problem of causal identification from an equivalence class, represented by a partial ancestral graph (PAG). We start by deriving a set of graphical properties of PAGs that are carried over to its induced subgraphs. We then develop an algorithm to compute the effect of an arbitrary set of variables on an arbitrary outcome set. We show that the algorithm is strictly more powerful than the current state of the art found in the literature.
The Variational Homoencoder: Learning to learn high capacity generative models from few examples
Luke B. Hewitt, Maxwell I. Nye, Andreea Gane, Tommi Jaakkola, Joshua B. Tenenbaum
Hierarchical Bayesian methods can unify many related tasks (e.g. k-shot classification, conditional and unconditional generation) as inference within a single generative model. However, when this generative model is expressed as a powerful neural network such as a PixelCNN, we show that existing learning techniques typically fail to effectively use latent variables. To address this, we develop a modification of the Variational Autoencoder in which encoded observations are decoded to new elements from the same class. This technique, which we call a Variational Homoencoder (VHE), produces a hierarchical latent variable model which better utilises latent variables. We use the VHE framework to learn a hierarchical PixelCNN on the Omniglot dataset, which outperforms all existing models on test set likelihood and achieves strong performance on one-shot generation and classification tasks. We additionally validate the VHE on natural images from the YouTube Faces database. Finally, we develop extensions of the model that apply to richer dataset structures such as factorial and hierarchical categories.
Probabilistic Collaborative Representation Learning for Personalized Item Recommendation
Aghiles Salah, Hady W. Lauw
We present Probabilistic Collaborative Representation Learning (PCRL), a new generative model of user preferences and item contexts. The latter builds on the assumption that relationships among items within contexts (e.g., browsing session, shopping cart, etc.) may underlie various aspects that guide the choices people make. Intuitively, PCRL seeks representations of items reflecting various regularities between them that might be useful at explaining user preferences. Formally, it relies on Bayesian Poisson Factorization to model user-item interactions, and uses a multilayered latent variable architecture to learn representations of items from their contexts. PCRL seamlessly integrates both tasks within a joint framework. However, inference and learning under the proposed model are challenging due to several sources of intractability. Relying on the recent advances in approximate inference/learning, we derive an efficient variational algorithm to estimate our model from observations. We further conduct experiments on several real-world datasets to showcase the benefits of the proposed model.
Reforming Generative Autoencoders via Goodness-of-Fit Hypothesis Testing
Aaron Palmer, Dipak Dey, Jinbo Bi
Generative models, while not new, have taken
the deep learning field by storm. However,
the widely used training methods have not
exploited the substantial statistical literature
concerning parametric distributional testing.
Having sound theoretical foundations, these
goodness-of-fit tests enable parts of the black
box to be stripped away. In this paper we use
the Shapiro-Wilk and propose a new multivari-
ate generalization of Shapiro-Wilk to respec-
tively test for univariate and multivariate nor-
mality of the code layer of a generative autoen-
coder. By replacing the discriminator in tradi-
tional deep models with the hypothesis tests,
we gain several advantages: objectively evalu-
ate whether the encoder is actually embedding
data onto a normal manifold, accurately define
when convergence happens, explicitly balance
between reconstruction and encoding training.
Not only does our method produce competitive
results, but it does so in a fraction of the time.
We highlight the fact that the hypothesis tests
used in our model asymptotically lead to the
same solution of the L 2 -Wasserstein distance
metrics used by several generative models to-
day.
Towards Flatter Loss Surface via Nonmonotonic Learning Rate Scheduling
Sihyeon Seong, Yegang Lee, Youngwook Kee, Dongyoon Han, Junmo Kim
Whereas optimizing deep neural networks using stochastic gradient descent has been showed great performances in practice, the rule for setting step size (i.e. learning rate) of gradient descent is not well studied. Although it appears that some intriguing learning rate rules such as ADAM (Kingma and Ba, 2014) have since been developed, they concentrated on improving convergence, not on improving generalization capabilities. Recently, the improved generalization property of the flat minima was revisited, and this research guides us towards promising solutions to many current optimization problems. In this paper, we analyze the flatness of loss surfaces in the lens of robustness to input perturbations and advocate that gradient descent should be guided to reach flatter region of loss surfaces to achieve generalization. Finally, we suggest a learning rate rule for escaping sharp regions of loss surfaces, and we demonstrate the capacity of our approach by performing numerous experiments.
A Lagrangian Perspective on Latent Variable Generative Models
Shengjia Zhao, Jiaming Song, Stefano Ermon
A large number of objectives have been proposed to train latent variable generative models. We show that many of them are Lagrangian dual functions of the same primal optimization problem. The primal problem optimizes the mutual information between latent and visible variables, subject to the constraints of accurately modeling the data distribution and performing correct amortized inference. By choosing to maximize or minimize mutual information, and choosing different Lagrange multipliers, we obtain different objectives including InfoGAN, ALI/BiGAN, ALICE, CycleGAN, beta-VAE, adversarial autoencoders, AVB, AS-VAE and InfoVAE. Based on this observation, we provide an exhaustive characterization of the statistical and computational trade-offs made by all the training objectives in this class of Lagrangian duals. Next, we propose a dual optimization method where we optimize model parameters as well as the Lagrange multipliers. This method achieves Pareto optimal solutions in terms of optimizing information and satisfying the constraints.
Stephan Eismann, Daniel Levy, Rui Shu, Stefan Bartzsch, Stefano Ermon
Automatic design via Bayesian optimization holds great promise given the constant increase of available data across domains. However, it faces difficulties from high-dimensional, potentially discrete, search spaces. We propose to probabilistically embed inputs into a lower dimensional, continuous latent space, where we perform gradient-based optimization guided by a Gaussian process. Building on variational autoncoders, we use both labeled and unlabeled data to guide the encoding and increase its accuracy. In addition, we propose an adversarial extension to render the latent representation invariant with respect to specific design attributes, which allows us to transfer these attributes across structures. We apply the framework both to a functional-protein dataset and to perform optimization of drag coefficients directly over high-dimensional shapes
without incorporating domain knowledge or handcrafted features.
Join Graph Decomposition Bounds for Influence Diagrams
Junkyu Lee, Alexander Ihler, Rina Dechter
We introduce a new decomposition method for bounding the maximum expected utility of influence diagrams. While most current schemes use reductions to the Marginal Map task over a Bayesian Network, our approach is direct, aiming to avoid the large explosion in the model size that often results by such reductions. In this paper, we extend to influence diagrams the principles of decomposition methods that were applied earlier to probabilistic inference, utilizing an algebraic framework called valuation algebra which effectively captures both multiplicative and additive local structures present in influence diagrams.
Empirical evaluation on four benchmarks demonstrates the effectiveness of our approach compared to reduction-based approaches and illustrates significant improvements in the upper bounds on maximum expected utility.
Causal Discovery with Linear Non-Gaussian Models under Measurement Error: Structural Identifiability Results
Kun Zhang, Mingming Gong, Joseph Ramsey, Kayhan Batmanghelich, Peter Spirtes, Clark Glymour
Causal discovery methods aim to recover the causal process that generated purely observational data. Despite its successes on a number of real problems, the presence of measurement error in the observed data can produce serious mistakes in the output of various causal discovery methods. Given the ubiquity of measurement error caused by instruments or proxies used in the measuring process, this problem is one of the main obstacles to reliable causal discovery. It is still unknown to what extent the causal structure of relevant variables can be identified in principle. This study aims to take a step towards filling that void. We assume that the underlining process or the measurement-error free variables follows a linear, non-Guassian causal model, and show that the so-called ordered group decomposition of the causal model, which contains major causal information, is identifiable. The causal structure identifiability is further improved with different types of sparsity constraints on the causal structure. Finally, we give rather mild conditions under which the whole causal structure is fully identifiable.