Kianté Brantley
Kianté Brantley is a Postdoctoral scholar at Cornell working with Thorsten Joachims. He completed his Ph.D. in computer science at the University of Maryland College Park (UMD) advised by Professor Hal Daumé III. Brantley designs algorithms that efficiently integrate domain knowledge into sequential decision-making problems. He is most excited about imitation learning and interactive learning—or, more broadly, settings that involve a feedback loop between a machine learning agent and the input the machine learning agent sees. Before coming to UMD in 2016, Brantley attended the University of Maryland, Baltimore County where he earned his bachelor’s degree and master's degree (advised by Tim Oates) in computer science. He also worked as a data scientist for the U.S. Department of Defense from 2010 to 2017. In his free time, Brantley enjoys playing sports; his favorite sport at the moment is powerlifting. Brantley is a member of the UMD CLIP Lab, UMBC CORAL Lab and NYU CILVR lab.
I am on the job market looking for an academic or industry position!
Email  / 
CV  / 
Google Scholar  / 
Semantic Scholar  / 
Github  / 
Twitter
|
|
Research
I'm interested in designing algorithms that efficiently integrate domain knowledge into sequential decision making problems (e.g. reinforcement learning, imitation learning and structure prediction for natural language processing).
|
Publications
|
Ranking with Long-Term Constraints
Kianté Brantley,
Zhichong Fang,
Sarah Dean,
Thorsten Joachims,
arXiv, 2023
[abstract]
The feedback that users provide through their choices (e.g., clicks, purchases) is one of the most common types of data readily avail- able for training search and recommendation algorithms. However, myopically training systems based on choice data may only improve short-term engagement, but not the long-term sustainability of the platform and the long-term benefits to its users, content providers, and other stakeholders. In this paper, we thus develop a new frame- work in which decision makers (e.g., platform operators, regulators, users) can express long-term goals for the behavior of the platform (e.g., fairness, revenue distribution, legal requirements). These goals take the form of exposure or impact targets that go well beyond individual sessions, and we provide new control-based algorithms to achieve these goals. In particular, the controllers are designed to achieve the stated long-term goals with minimum impact on short- term engagement. Beyond the principled theoretical derivation of the controllers, we evaluate the algorithms on both synthetic and real-world data. While all controllers perform well, we find that they provide interesting trade-offs in efficiency, robustness, and the ability to plan ahead.
|
Learning to Generate Better Than Your LLM
Jonathan Chang*,
Kianté Brantley*,
Rajkumar Ramamurthy,
Dipendra Misra,
Wen Sun,
arXiv, 2023
[abstract]
Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models (LLMs) for conditional text generation. In particular, recent LLMs such as ChatGPT and GPT-4 can engage in fluent conversations with users by incorporating RL and feedback from humans. Inspired by learning- to-search algorithms and capitalizing on key properties of text generation, we seek to investigate reinforcement learning algorithms beyond general purpose algorithms such as Proximal policy optimization (PPO). In particular, we extend RL algorithms to allow them to interact with a dynamic black-box guide LLM such as GPT-3 and propose RL with guided feedback (RLGF), a suite of RL algorithms for LLM fine-tuning. We experiment on the IMDB positive review and CommonGen text generation task from the GRUE benchmark. We show that our RL algorithms achieve higher performance than supervised learning (SL) and default PPO baselines, demonstrating the benefit of interaction with the guide LLM. On CommonGen, we not only outperform our SL baselines but also improve beyond PPO across a variety of lexical and semantic metrics beyond the one we optimized for. Notably, on the IMDB dataset, we show that our GPT-2 based policy outperforms the zero-shot GPT-3 oracle, indicating that our algorithms can learn from a powerful, black-box GPT-3 oracle with a simpler, cheaper, and publicly available GPT-2 model while gaining performance.
|
Interactive text generation
Felix Faltings,
Michel Galley,
Baolin Peng,
Kianté Brantley,
Weixin Cai,
Yizhe Zhang,
Jianfeng Gao,
Bill Dolan,
arXiv, 2023
[abstract]
Users interact with text, image, code, or other editors on a daily basis. However, machine learning models are rarely trained in the settings that reflect the interactivity between users and their editor. This is understandable as training AI models with real users is not only slow and costly, but what these models learn may be specific to user interface design choices. Unfortunately, this means most of the research on text, code, and image generation has focused on non-interactive settings, whereby the model is expected to get everything right without accounting for any input from a user who may be willing to help. We introduce a new Interactive Text Generation task that allows training generation models interactively without the costs of involving real users, by using user simulators that provide edits that guide the model towards a given target text. We train our interactive models using Imitation Learning, and our experiments against competitive non-interactive generation models show that models trained interactively are superior to their non-interactive counterparts, even when all models are given the same budget of user inputs or edits.
|
Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization
Rajkumar Ramamurthy*,
Prithviraj Ammanabrolu*,
Kianté Brantley,
Jack Hessel,
Rafet Sifa,
Christian Bauckhage,
Hannaneh Hajishirzi,
Yejin Choi,
International Conference on Learning Representations, 2023 (Spotlight)
[abstract]
[code]
[blog]
We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open- source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP?
To help answer this, we first introduce an open-source modular library, RL4LMs1,2 for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al., 2020) with an arbitrary reward function. Next, we present the GRUE (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference. GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, NLPO (Natural Language Policy Optimization) that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al., 2017)), based on both automatic and human evaluations.
|
lilGym: Natural Language Visual Reasoning with Reinforcement Learning
Anne Wu,
Kianté Brantley,
Noriyuki Kojima,
Yoav Artzi,
Association for Computational Linguistics, 2022
[abstract]
[code]
[demo]
We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We introduce a new approach for exact reward computation in every possible world state by annotating all statements with executable Python programs. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty. We experiment with lilGym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, lilGym forms a challenging open problem.
|
Proceedings of the First Workshop on Interactive Learning for Natural Language Processing
Kianté Brantley,
Soham Dan,
Iryna Gurevych,
Ji-Ung Lee,
Filip Radlinski,
Hinrich Schütze,
Edwin Simpson,
Lili Yu,
Association for Computational Linguistics, 2021
[abstract]
Motivation: A key aspect of human learning is the ability to learn continuously from various sources of feedback. In contrast, much of the recent success of deep learning for NLP relies on large datasets and extensive compute resources to train and fine-tune models, which then remain fixed. This leaves a research gap for systems that adapt to the changing needs of individual users or allow users to continually correct errors as they emerge. Learning from user interaction is crucial for tasks that require a high grade of personalization and for rapidly changing or complex, multi-step tasks where collecting and annotating large datasets is not feasible, but an informed user can provide guidance. What is interactive NLP?: Interactive Learning for NLP means training, fine-tuning or otherwise adapting an NLP model to inputs from a human user or teacher. Relevant approaches range from active learning with a human in the loop, to training with implicit user feedback (eg clicks), dialogue systems that adapt to user utterances and training with new forms of human input. Interactive learning is the converse of learning from datasets collected offline with no human input during the training process
|
Successor Feature Sets: Generalizing Successor Representations Across Policies
Kianté Brantley,
Soroush Mehri,
Geoffrey J. Gordon
Association for the Advancement of Artificial Intelligence, 2021
[abstract]
[poster]
[slides]
Successor-style representations have many advantages for re- inforcement learning: for example, they can help an agent generalize from past experience to new goals, and they have been proposed as explanations of behavioral and neural data from human and animal learners. They also form a natu- ral bridge between model-based and model-free RL meth- ods: like the former they make predictions about future ex- periences, and like the latter they allow efficient prediction of total discounted rewards. However, successor-style rep- resentations are not optimized to generalize across policies: typically, we maintain a limited-length list of policies, and share information among them by representation learning or GPI. Successor-style representations also typically make no provision for gathering information or reasoning about la- tent variables. To address these limitations, we bring together ideas from predictive state representations, belief space value iteration, and convex analysis: we develop a new, general successor-style representation, together with a Bellman equa- tion that connects multiple sources of information within this representation, including different latent states, observations, policies, and reward functions. The new representation is highly expressive: for example, it lets us efficiently read off an optimal policy for a new reward function, or a policy that imitates a demonstration. For this paper, we focus on exact computation of the new representation in small, known en- vironments, since even this restricted setting offers plenty of interesting questions. Our implementation does not scale to large, unknown environments — nor would we expect it to, since it generalizes POMDP value iteration, which is difficult to scale. However, we believe that future work will allow us to extend our ideas to approximate reasoning in large, unknown environments. We conduct experiments to explore which of the potential barriers to scaling are most pressing.
|
Constrained episodic reinforcement learning in concave-convex and knapsack settings
Kianté Brantley,
Miroslav Dudik,
Thodoris Lykouris,
Sobhan Miryoosefi,
Max Simchowitz,
Aleksandrs Slivkins,
Wen Sun
Conference on Neural Information Processing Systems (NeurIPS), 2020
[abstract]
[code]
[poster]
We propose an algorithm for tabular episodic reinforcement learning with constraints. We provide a modular analysis with strong theoretical guarantees for settings with concave rewards and convex constraints, and for settings with hard constraints (knapsacks). Most of the previous work in constrained reinforcement learning is limited to linear constraints, and the remaining work focuses on either the feasibility question or settings with a single episode. Our experiments demonstrate that the proposed algorithm significantly outperforms these approaches in existing constrained episodic environments.
|
Active Imitation Learning with Noisy Guidance
Kianté Brantley,
Amr Sharaf,
Hal Daumé III
Association for Computational Linguistics (ACL), 2020
[abstract]
[code]
[poster]
[slides]
[video]
Imitation learning algorithms provide state-of-the-art results on many structured prediction tasks by learning near-optimal search policies. Such algorithms assume training-time access to an expert that can provide the optimal action at any queried state; unfortunately, the number of such queries is often prohibitive, frequently rendering these approaches impractical. To combat this query complexity, we consider an active learning setting in which the learning algorithm has additional access to a much cheaper noisy heuristic that provides noisy guidance. Our algorithm, LEAQI, learns a difference classifier that predicts when the expert is likely to disagree with the heuristic, and queries the expert only when necessary. We apply LEAQI to three sequence labeling tasks, demonstrating significantly fewer queries to the expert and comparable (or better) accuracies over a passive approach.
|
Disagreement-Regularized Imitation Learning
Kianté Brantley,
Wen Sun,
Mikael Henaff
International Conference on Learning Representations (ICLR), 2020 (Spotlight)
[abstract]
[code]
[poster]
[slides]
[video]
We present a simple and effective algorithm designed to address the covariate shift problem in imitation learning. It operates by training an ensemble of policies on the expert demonstration data, and using the variance of their predictions as a cost which is minimized with RL together with a supervised behavioral cloning cost. Unlike adversarial imitation methods, it uses a fixed reward function which is easy to optimize. We prove a regret bound for the algorithm which is linear in the time horizon multiplied by a coefficient which we show to be low for certain problems on which behavioral cloning fails. We evaluate our algorithm empirically across multiple pixel-based Atari environments and continuous control tasks, and show that it matches or significantly outperforms behavioral cloning and generative adversarial imitation learning
|
Non-monotonic sequential text generation
Sean Welleck,
Kianté Brantley,
Hal Daumé III,
Kyunghyun Cho
International Conference on Machine Learning (ICML), 2019
[abstract]
[code]
[poster]
[slides]
[video]
Standard sequential generation methods assume a pre-specified generation order, such as text gener ation methods which generate words from left to right. In this work, we propose a framework for training models of text generation that operate in non-monotonic orders; the model directly learns good orders, without any additional annotation. Our framework operates by generating a word at an arbitrary position, and then recursively generating words to its left and then words to its right, yielding a binary tree. Learning is framed as imitation learning, including a coaching method which moves from imitating an oracle to reinforcing the policy’s own preferences. Experimental results demonstrate that using the proposed method, it is possible to learn policies which generate text without pre-specifying a generation order, while achieving competitive performance with conventional left-to-right generation.
|
Reinforcement Learning with Convex Constraints
Sobhan Miryoosefi*,
Kianté Brantley*,
Hal Daumé III,
Miro Dudik,
Robert Schapire
Conference on Neural Information Processing Systems (NeurIPS), 2019
[abstract]
[code]
[poster]
[slides]
In standard reinforcement learning (RL), a learning agent seeks to optimize the overall reward. However, many key aspects of a desired behavior are more naturally expressed as constraints. For instance, the designer may want to limit the use of unsafe actions, increase the diversity of trajectories to enable exploration, or approximate expert trajectories when rewards are sparse. In this paper, we propose an algorithmic scheme that can handle a wide class of constraints in RL tasks, specifically, any constraints that require expected values of some vector measurements (such as the use of an action) to lie in a convex set. This captures previously studied constraints (such as safety and proximity to an expert), but also enables new classes of constraints (such as diversity). Our approach comes with rigorous theoretical guarantees and only relies on the ability to approximately solve standard RL tasks. As a result, it can be easily adapted to work with any model-free or model-based RL algorithm. In our experiments, we show that it matches previous algorithms that enforce safety via constraints, but can also enforce new properties that these algorithms cannot incorporate, such as diversity.
|
The umd neural machine translation systems at wmt17 bandit learning task
Amr Sharaf,
Shi Feng,
Khanh Nguyen,
Kianté Brantley,
Hal Daumé III
Second Conference on Machine Translation, 2017
[abstract]
[poster]
We describe the University of Maryland machine translation systems submitted to the WMT17 German-English Bandit Learning Task. The task is to adapt a translation system to a new domain, using only bandit feedback: the system receives a German sentence to translate, produces an English sentence, and only gets a scalar score as feedback. Targeting these two challenges (adaptation and bandit learning), we built a standard neural machine translation system and extended it in two ways: (1) robust reinforcement learning techniques to learn effectively from the bandit feedback, and (2) domain adaptation using data selection from a large corpus of parallel data.
|
BCAP: An Artificial Neural Network Pruning Technique to Reduce Overfitting
Kianté Brantley
University of Maryland, Baltimore County Master Thesis, 2016
[abstract]
[slides]
Determining the optimal size of a neural network is complicated. Neural networks, with many free parameters, can be used to solve very complex problems. However, these neural networks are susceptible to overfitting. BCAP (Brantley-Clark Artificial Neural Network Pruning Technique) addresses overfitting by combining duplicate neurons in a neural network hidden layer, thereby forcing the network to learn more distinct features. We compare hidden units using the cosine similarity, and combine those that are similar with each other within a threshold ϵ. By doing so the co-adaption of the neurons in the network is reduced because hidden units that are highly correlated (ie similar) are combined. In this paper we show evidence that BCAP is successful in reducing network size while maintaining accuracy, or improving accuracy of neural networks during and after training.
|
LDAexplore: Visualizing topic models generated using latent dirichlet allocation
Ashwinkumar Ganesan,
Kianté Brantley,
Shimei Pan,
Jian Chen
extvis Workshop - Intelligent User Interfaces (IUI), 2015
[abstract]
[code]
[slides]
We present LDAExplore, a tool to visualize topic distributions in a given document corpus that are generated using Topic Modeling methods. Latent Dirichlet Allocation (LDA) is one of the basic methods that is predominantly used to generate topics. One of the problems with methods like LDA is that users who apply them may not understand the topics that are generated. Also, users may find it difficult to search correlated topics and correlated documents. LDAExplore, tries to alleviate these problems by visualizing topic and word distributions generated from the document corpus and allowing the user to interact with them. The system is designed for users, who have minimal knowledge of LDA or Topic Modelling methods. To evaluate our design, we run a pilot study which uses the abstracts of 322 Information Visualization papers, where every abstract is considered a document. The topics generated are then explored by users. The results show that users are able to find correlated documents and group them based on topics that are similar.
|
|