Papers

Working Papers

Optimal Strategies in Ranked-Choice Voting [abstract | arxiv]
Sanyukta Deshpande, Nikhil Garg, and Sheldon Jacobson

Ranked Choice Voting (RCV) and Single Transferable Voting (STV) are widely valued; but are complex to understand due to intricate per-round vote transfers. Questions like determining how far a candidate is from winning or identifying effective election strategies are computationally challenging as minor changes in voter rankings can lead to significant ripple effects - for example, lending support to a losing candidate can prevent their votes from transferring to a more competitive opponent. We study optimal strategies - persuading voters to change their ballots or adding new voters - both algorithmically and theoretically. Algorithmically, we develop efficient methods to reduce election instances while maintaining optimization accuracy, effectively circumventing the computational complexity barrier. Theoretically, we analyze the effectiveness of strategies under both perfect and imperfect polling information. Our algorithmic approach applies to the ranked-choice polling data on the US 2024 Republican Primary, finding, for example, that several candidates would have been optimally served by boosting another candidate instead of themselves.
Bayesian Modeling of Zero-Shot Classifications for Urban Flood Detection [abstract | arxiv]
Matt Franchi, Nikhil Garg, Wendy Ju, and Emma Pierson

Street scene datasets, collected from Street View or dashboard cameras, offer a promising means of detecting urban objects and incidents like street flooding. However, a major challenge in using these datasets is their lack of reliable labels: there are myriad types of incidents, many types occur rarely, and ground-truth measures of where incidents occur are lacking. Here, we propose BayFlood, a two-stage approach which circumvents this difficulty. First, we perform zero-shot classification of where incidents occur using a pretrained vision-language model (VLM). Second, we fit a spatial Bayesian model on the VLM classifications. The zero-shot approach avoids the need to annotate large training sets, and the Bayesian model provides frequent desiderata in urban settings - principled measures of uncertainty, smoothing across locations, and incorporation of external data like stormwater accumulation zones. We comprehensively validate this two-stage approach, showing that VLMs provide strong zero-shot signal for floods across multiple cities and time periods, the Bayesian model improves out-of-sample prediction relative to baseline methods, and our inferred flood risk correlates with known external predictors of risk. Having validated our approach, we show it can be used to improve urban flood detection: our analysis reveals 113,738 people who are at high risk of flooding overlooked by current methods, identifies demographic biases in existing methods, and suggests locations for new flood sensors. More broadly, our results showcase how Bayesian modeling of zero-shot LM annotations represents a promising paradigm because it avoids the need to collect large labeled datasets and leverages the power of foundation models while providing the expressiveness and uncertainty quantification of Bayesian models.
Inferring fine-grained migration patterns across the United States [abstract | arxiv | data]
Gabriel Agostini, Rachel Young, Maria Fitzpatrick, Nikhil Garg, and Emma Pierson

Fine-grained migration data illuminate important demographic, environmental, and health phenomena. However, migration datasets within the United States remain lacking: publicly available Census data are neither spatially nor temporally granular, and proprietary data have higher resolution but demographic and other biases. To address these limitations, we develop a scalable iterative-proportional-fitting based method which reconciles high-resolution but biased proprietary data with low-resolution but more reliable Census data. We apply this method to produce MIGRATE, a dataset of annual migration matrices from 2010 - 2019 which captures flows between 47.4 billion pairs of Census Block Groups – about four thousand times more granular than publicly available data. These estimates are highly correlated with external ground-truth datasets, and improve accuracy and reduce bias relative to raw proprietary data. We publicly release MIGRATE estimates and provide a case study illustrating how they reveal granular patterns of migration in response to California wildfires.
Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts [abstract | arxiv]
Kenny Peng, Rajiv Movva, Jon Kleinberg, Emma Pierson, and Nikhil Garg

While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that while SAEs may be less effective for acting on known concepts, SAEs are powerful tools for discovering unknown concepts. This distinction cleanly separates existing negative and positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.
Optimizing Library Usage and Browser Experience: Application to the New York Public Library [abstract | arxiv]
Zhi Liu, Wenchang Zhu, Sarah Rankin, and Nikhil Garg

We tackle the challenge brought to urban library systems by the holds system – which allows users to request books available at other branches to be transferred for local pickup. The holds system increases usage of the entire collection, at the expense of an in-person browser’s experience at the source branch. We study the optimization of usage and browser experience, where the library has two levers: where a book should come from when a hold request is placed, and how many book copies at each branch should be available through the holds system versus reserved for browsers. We first show that the problem of maximizing usage can be viewed through the lens of revenue management, for which near-optimal fulfillment policies exist. We then develop a simulation framework that further optimizes for browser experience, through book reservations. We empirically apply our methods to data from the New York Public Library to design implementable policies. We find that though a substantial trade-off exists between these two desiderata, a balanced policy can improve browser experience over the historical policy without significantly sacrificing usage. Because browser usage is more prevalent among branches in low-income areas, this policy further increases system-wide equity: notably, for branches in the 25% lowest-income neighborhoods, it improves both usage and browser experience by about 15%.

Journal Articles

2018 Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes [abstract | official link | code & data | talk]
Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou
Proceedings of the National Academy of Sciences (PNAS)
Media: Stanford News (and EE department), Science Magazine, Smithsonian Magazine (in print), The World Economic Forum, Futurity, etc.

Word embeddings are a powerful machine-learning framework that represents each English word by a vector. The geometric relationship between these vectors captures meaningful semantic relationships between the corresponding words. In this paper, we develop a framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States. We integrate word embeddings trained on 100 y of text data with the US Census to show that changes in the embedding track closely with demographic and occupation shifts over time. The embedding captures societal shifts - e.g., the women’s movement in the 1960s and Asian immigration into the United States - and also illuminates how specific adjectives and occupations became more closely associated with certain populations over time. Our framework for temporal analysis of word embedding opens up a fruitful intersection between machine learning and quantitative social science.
2019 Iterative Local Voting for Collective Decision-making in Continuous Spaces [abstract | demo | official link]
Nikhil Garg, Vijay Kamble, Ashish Goel, David Marn, and Kamesh Munagala
Journal of Artificial Intelligence Research (JAIR)
(Conference version published in WWW‘17.)

Many societal decision problems lie in high-dimensional continuous spaces not amenable to the voting techniques common for their discrete or single-dimensional counterparts. These problems are typically discretized before running an election or decided upon through negotiation by representatives. We propose a algorithm called Iterative Local Voting for collective decision-making in this setting. In this algorithm, voters are sequentially sampled and asked to modify a candidate solution within some local neighborhood of its current value, as defined by a ball in some chosen norm, with the size of the ball shrinking at a specified rate. We first prove the convergence of this algorithm under appropriate choices of neighborhoods to Pareto optimal solutions with desirable fairness properties in certain natural settings: when the voters’ utilities can be expressed in terms of some form of distance from their ideal solution, and when these utilities are additively decomposable across dimensions. In many of these cases, we obtain convergence to the societal welfare maximizing solution. We then describe an experiment in which we test our algorithm for the decision of the U.S. Federal Budget on Mechanical Turk with over 2,000 workers, employing neighborhoods defined by various L-Norm balls. We make several observations that inform future implementations of such a procedure.
We have a demo of our Mechanical Turk experiment available live here. It can be used as follows:
1. If the URL is entered without any parameters, it uses the current radius (based on previous uses of the demo, going down by $1/N$) and uses the $\mathcal{L}^2$ mechanism.
2. To set the mechanism, navigate to http://54.183.140.235/mechanism/[option]/, where instead of [option] use either, l1, l2, linf, or full, for the respective mechanisms.
3. To set the radius, navigate to http://54.183.140.235/mechanism/[number]/, where any integer can be entered instead of [number]. This option resets the starting radius for the specific mechanism, which will go down by $1/N$ in subsequent accesses.
4. To set both the mechanism and the radius, navigate to http://54.183.140.235/radius/[number]/mechanism/[option]/, with the above options.
2020 Designing Informative Rating Systems: Evidence from an Online Labor Market [abstract | arxiv | talk | official link]
Nikhil Garg and Ramesh Johari
Manufacturing & Service Operations Management
Media: New York Times, Stanford Engineering magazine.
M&SOM student paper award (2nd place), 2020
(Conference version published in EC‘20.)

Platforms critically rely on rating systems to learn the quality of market participants. In practice, however, these ratings are often highly inflated, drastically reducing the signal available to distinguish quality. We consider two questions: First, can rating systems better discriminate quality by altering the meaning and relative importance of the levels in the rating system? And second, if so, how should the platform optimize these choices in the design of the rating system? We first analyze the results of a randomized controlled trial on an online labor market in which an additional question was added to the feedback form. Between treatment conditions, we vary the question phrasing and answer choices. We further run an experiment on Amazon Mechanical Turk with similar structure, to confirm the labor market findings. Our tests reveal that current inflationary norms can in fact be countered by re-anchoring the meaning of the levels of the rating system. In particular, scales that are positive-skewed and provide specific interpretations for what each label means yield rating distributions that are much more informative about quality. Second, we develop a theoretical framework to optimize the design of a rating system by choosing answer labels and their numeric interpretations in a manner that maximizes the rate of convergence to the true underlying quality distribution. Finally, we run simulations with an empirically calibrated model and use these to study the implications for optimal rating system design. Our simulations demonstrate that our modeling and optimization approach can substantially improve the quality of information obtained over baseline designs. Overall, our study illustrates that rating systems that are informative in practice can be designed, and demonstrates how to design them in a principled manner.
2020 Markets for Public Decision-making [abstract | arxiv | official link]
Nikhil Garg, Ashish Goel, and Ben Plaut
Social Choice and Welfare
(Conference version published in WINE‘18.)

A public decision-making problem consists of a set of issues, each with multiple possible alternatives, and a set of competing agents, each with a preferred alternative for each issue. We study adaptations of market economies to this setting, focusing on binary issues. Issues have prices, and each agent is endowed with artificial currency that she can use to purchase probability for her preferred alternatives (we allow randomized outcomes). We first show that when each issue has a single price that is common to all agents, market equilibria can be arbitrarily bad. This negative result motivates a different approach. We present a novel technique called "pairwise issue expansion", which transforms any public decision-making instance into an equivalent Fisher market, the simplest type of private goods market. This is done by expanding each issue into many goods: one for each pair of agents who disagree on that issue. We show that the equilibrium prices in the constructed Fisher market yield a "pairwise pricing equilibrium" in the original public decision-making problem which maximizes Nash welfare. More broadly, pairwise issue expansion uncovers a powerful connection between the public decision-making and private goods settings; this immediately yields several interesting results about public decisions markets, and furthers the hope that we will be able to find a simple iterative voting protocol that leads to near-optimum decisions.
2021 Driver Surge Pricing [abstract | ssrn | code & data | talk | official link]
Nikhil Garg and Hamid Nazerzadeh
Management Science
(Conference version published in EC‘20.)

Ride-hailing marketplaces like Uber and Lyft use dynamic pricing, often called surge, to balance the supply of available drivers with the demand for rides. We study pricing mechanisms for such marketplaces from the perspective of drivers, presenting the theoretical foundation that has informed the design of Uber’s new additive driver surge mechanism. We present a dynamic stochastic model to capture the impact of surge pricing on driver earnings and their strategies to maximize such earnings. In this setting, some time periods (surge) are more valuable than others (non-surge), and so trips of different time lengths vary in the opportunity cost they impose on drivers. First, we show that multiplicative surge, historically the standard on ride-hailing platforms, is not incentive compatible in a dynamic setting. We then propose a structured, incentive-compatible pricing mechanism. This closed-form mechanism has a simple form and is well-approximated by Uber’s new additive surge mechanism. Finally, through both numerical analysis and real data from a ride-hailing marketplace, we show that additive surge is more approximately incentive compatible in practice than multiplicative surge, providing more stable earnings to drivers.
2023 Quantifying Spatial Under-reporting Disparities in Resident Crowdsourcing [abstract | arxiv | official link | talk | code & data]
Zhi Liu, Uma Bhandaram, and Nikhil Garg
Nature Computational Science
Conference version published in ACM Conference on Economics and Computation (EC‘22), titled “Equity in Resident Crowdsourcing: Measuring Under-reporting without Ground Truth Data”
Media: Cornell News.

Modern city governance relies heavily on crowdsourcing to identify problems such as downed trees and power lines. A major concern is that residents do not report problems at the same rates, with heterogeneous reporting delays directly translating to downstream disparities in how quickly incidents can be addressed. Here we develop a method to identify reporting delays without using external ground-truth data. Our insight is that the rates at which duplicate reports are made about the same incident can be leveraged to disambiguate whether an incident has occurred by investigating its reporting rate once it has occurred. We apply our method to over 100,000 resident reports made in New York City and to over 900,000 reports made in Chicago, finding that there are substantial spatial and socioeconomic disparities in how quickly incidents are reported. We further validate our methods using external data and demonstrate how estimating reporting delays leads to practical insights and interventions for a more equitable, efficient government service.
2025 Addressing Discretization-Induced Bias in Demographic Prediction [abstract | arxiv | official link]
Evan Dong, Aaron Schein, Yixin Wang, and Nikhil Garg
PNAS Nexus
Conference version appeared in ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT 2024).

Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions – e.g., based on name and geography – and then to often discretize the predictions by selecting the most likely class (argmax), potentially with a minimum threshold (thresholding). We study how this practice produces discretization bias. For example, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of Black voters, e.g., by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a joint optimization approach – and a tractable data-driven threshold heuristic – that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences.
2025 Faster Information for Effective Long-Term Discharge: A Field Study in Adult Foster Care [abstract | official link]
Vince Bartle, Ashley Shearer, Alexandra Wroe, Nicola Dell, and Nikhil Garg
Proceedings of the ACM on Human-Computer Interaction.
Journal Track for 28th ACM SIGCHI Conference on Computer-Supported Cooperative Work & Social Computing (CSCW‘25). Also appeared in EAAMO‘23

As the US population ages, a growing challenge is placing hospital patients who require long-term post-acute care into adult foster care facilities: small long-term nursing facilities that care for those unable to age in place because their care requirements exceed what can be delivered at home. A key challenge in patient placement is the dynamic matching process between hospital discharge coordinators looking to place patients and facilities looking for residents. We designed, built, deployed, and maintain a system to support decision making among a team of six discharge coordinators assisting in the discharge of 127 patients across 1,047 facilities in Hawai’i. Our system collects vacancy and capability data from facilities via conversational SMS and processes it to recommend facilities that discharge coordinators might contact. Findings from a 14-month deployment provide evidence for how timely, accurate information positively impacts matching efficacy. We close with lessons learned for information collection systems and provisioning platforms in similar contexts.
2025 Heterogeneous participation and allocation skews: when is choice “worth it”? [abstract | arxiv]
Nikhil Garg
ACM SIGecom Exchanges (Research Letter)
Invited; lightly reviewed by editors

A core ethos of the Economics and Computation (EconCS) community is that people have complex private preferences and information of which the central planner is unaware, but which an appropriately designed mechanism can uncover to improve collective decisionmaking. This ethos underlies the community’s largest deployed success stories, from stable matching systems to participatory budgeting. I ask: is this choice and information aggregation “worth it”? In particular, I discuss how such systems induce heterogeneous participation: those already relatively advantaged are, empirically, more able to pay time costs and navigate administrative burdens imposed by the mechanisms. I draw on three case studies, including my own work – complex democratic mechanisms, resident crowdsourcing, and school matching. I end with lessons for practice and research, challenging the community to help reduce participation heterogeneity and design and deploy mechanisms that meet a “best of both worlds” north star: use preferences and information from those who choose to participate, but provide a “sufficient” quality of service to those who do not.

Peer Reviewed Conference Proceedings (without journal versions)

2015 Impact of Dual Slope Path Loss on User Association in HetNets [abstract | official link]
Nikhil Garg, Sarabjot Singh, and Jeffrey Andrews
IEEE Globecom Workshop

Intelligent load balancing is essential to fully realize the benefits of dense heterogeneous networks. Current techniques have largely been studied with single slope path loss models, though multi-slope models are known to more closely match real deployments. This paper develops insight into the performance of biasing and uplink/downlink decoupling for user association in HetNets with dual slope path loss models. It is shown that dual slope path loss models change the tradeoffs inherent in biasing and reduce gains from both biasing and uplink/downlink decoupling. The results show that with the dual slope path loss models, the bias maximizing the median rate is not optimal for other users, e.g., edge users. Furthermore, optimal downlink biasing is shown to realize most of the gains from downlink-uplink decoupling. Moreover, the user association gains in dense networks are observed to be quite sensitive to the path loss exponent beyond the critical distance in a dual slope model.
2019 Designing Optimal Binary Rating Systems [abstract | official link]
Nikhil Garg and Ramesh Johari
International Conference on Artificial Intelligence and Statistics (AISTATS‘19)

Modern online platforms rely on effective rating systems to learn about items. We consider the optimal design of rating systems that collect binary feedback after transactions. We make three contributions. First, we formalize the performance of a rating system as the speed with which it recovers the true underlying ranking on items (in a large deviations sense), accounting for both items’ underlying match rates and the platform’s preferences. Second, we provide an efficient algorithm to compute the binary feedback system that yields the highest such performance. Finally, we show how this theoretical perspective can be used to empirically design an implementable, approximately optimal rating system, and validate our approach using real-world experimental data collected on Amazon Mechanical Turk.
2019 Analyzing Polarization in Social Media: Method and Application to Tweets on 21 Mass Shootings [abstract | arxiv | code & data | official link]
Dorottya Demszky, Nikhil Garg, Rob Voigt, James Zou, Jesse Shapiro, Matthew Gentzkow, and Dan Jurafsky
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL‘19)
Media: Washington Post, Stanford News.

We provide an NLP framework to uncover four linguistic dimensions of political polarization in social media: topic choice, framing, affect and illocutionary force. We quantify these aspects with existing lexical methods, and propose clustering of tweet embeddings as a means to identify salient topics for analysis across events; human evaluations show that our approach generates more cohesive topics than traditional LDA-based models. We apply our methods to study 4.4M tweets on 21 mass shootings. We provide evidence that the discussion of these events is highly polarized politically and that this polarization is primarily driven by partisan differences in framing rather than topic choice. We identify framing devices, such as grounding and the contrasting use of the terms "terrorist" and "crazy", that contribute to polarization. Results pertaining to topic choice, affect and illocutionary force suggest that Republicans focus more on the shooter and event-specific facts (news) while Democrats focus more on the victims and call for policy changes. Our work contributes to a deeper understanding of the way group divisions manifest in language and to computational methods for studying them.
2019 Who is in Your Top Three? Optimizing Learning in Elections with Many Candidates [abstract | arxiv | official link]
Nikhil Garg, Lodewijk Gelauff, Sukolsak Sakshuwong, and Ashish Goel
AAAI Conference on Human Computation and Crowdsourcing (HCOMP‘19)

Elections and opinion polls often have many candidates, with the aim to either rank the candidates or identify a small set of winners according to voters’ preferences. In practice, voters do not provide a full ranking; instead, each voter provides their favorite K candidates, potentially in ranked order. The election organizer must choose K and an aggregation rule. We provide a theoretical framework to make these choices. Each K-Approval or K-partial ranking mechanism (with a corresponding positional scoring rule) induces a learning rate for the speed at which the election correctly recovers the asymptotic outcome. Given the voter choice distribution, the election planner can thus identify the rate optimal mechanism. Earlier work in this area provides coarse order-of-magnitude guaranties which are not sufficient to make such choices. Our framework further resolves questions of when randomizing between multiple mechanisms may improve learning, for arbitrary voter noise models. Finally, we use data from 5 large participatory budgeting elections that we organized across several US cities, along with other ranking data, to demonstrate the utility of our methods. In particular, we find that historically such elections have set K too low and that picking the right mechanism can be the difference between identifying the correct winner with only a 80% probability or a 99.9% probability after 500 voters.
2020 Fair Allocation through Selective Information Acquisition [abstract | arxiv | official link]
William Cai, Johann Gaebler, Nikhil Garg, and Sharad Goel
AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES‘20)

Public and private institutions must often allocate scare resources under uncertainty. Banks, for example, extend credit to loan applicants based in part on their estimated likelihood of repaying a loan. But when the quality of information differs across candidates (e.g., if some applicants lack traditional credit histories), common lending strategies can lead to disparities across groups. Here we consider a setting in which decision makers—before allocating resources—can choose to spend some of their limited budget further screening select individuals. We present a computationally efficient algorithm for deciding whom to screen that maximizes a standard measure of social welfare. Intuitively, decision makers should screen candidates on the margin, for whom the additional information could plausibly alter the allocation. We formalize this idea by showing the problem can be reduced to solving a series of linear programs. Both on synthetic and real-world datasets, this strategy improves utility, illustrating the value of targeted information acquisition in such decisions. Further, when there is social value for distributing resources to groups for whom we have a priori poor information—like those without credit scores—our approach can substantially improve the allocation of limited assets.
2021 Dropping Standardized Testing for Admissions: Differential Variance and Access [abstract | arxiv]
Nikhil Garg, Hannah Li, and Faidra Monachou
ACM Conference on Fairness, Accountability, and Transparency (FAccT‘21)
Also appeared in EAAMO‘21, with Best Student Paper Award
Appeared in the 2021 NBER Decentralization Conference

The University of California suspended through 2024 the requirement that applicants from California submit SAT scores, upending the major role standardized testing has played in college admissions. We study the impact of such decisions and its interplay with other intervention such as affirmative action on admitted class composition. More specifically, this paper develops a theoretical framework to study the effect of requiring test scores on academic merit and diversity in college admissions. The model has a college and set of potential students. Each student an unobserved noisy skill level, and multiple observed application components and group membership. The college is Bayesian and maximizes an objective that depends on both diversity and merit. It estimates each applicant’s true skill level using the observed features, and then admits students with or without affirmative action. We characterize the trade-off between the (potentially positive) informational role of standardized testing in college admissions and its (negative) exclusionary nature. Dropping test scores may exacerbate disparities by decreasing the amount of information available for each applicant, especially those from non-traditional backgrounds. However, if there are substantial barriers to testing, removing the test improves both academic merit and diversity by increasing the size of the applicant pool. The overall effect of testing depends on both the variance of the test score noise and the amount of people excluded by the test requirement. Finally, using application and transcript data from the University of Texas at Austin, we demonstrate how an admissions committee could measure the trade-off in practice.
2021 Test-optional Policies: Overcoming Strategic Behavior and Informational Gaps [abstract | arxiv | official link]
Zhi Liu and Nikhil Garg
AAAI/ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO‘21)

Due to the Covid-19 pandemic, more than 500 US-based colleges and universities went “test-optional” for admissions and promised that they would not penalize applicants for not submitting test scores, part of a longer trend to rethink the role of testing in college admissions. However, it remains unclear how (and whether) a college can simultaneously use test scores for those who submit them, while not penalizing those who do not–and what that promise even means. We formalize these questions, and study how a college can overcome two challenges with optional testing: strategic applicants (when those with low test scores can pretend to not have taken the test), and informational gaps (it has more information on those who submit a test score than those who do not). We find that colleges can indeed do so, if and only if they are able to use information on who has test access and are willing to randomize admissions.
2021 The Stereotyping Problem in Collaboratively Filtered Recommender Systems [abstract | arxiv | official link]
Wenshuo Guo, Karl Krauth, Michael I. Jordan, and Nikhil Garg
AAAI/ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO‘21)

Recommender systems – and especially matrix factorization-based collaborative filtering algorithms – play a crucial role in mediating our access to online information. We show that such algorithms induce a particular kind of stereotyping: if preferences for a set of items are anti-correlated in the general user population, then those items may not be recommended together to a user, regardless of that user’s preferences and ratings history. First, we introduce a notion of joint accessibility, which measures the extent to which a set of items can jointly be accessed by users. We then study joint accessibility under the standard factorization-based collaborative filtering framework, and provide theoretical necessary and sufficient conditions when joint accessibility is violated. Moreover, we show that these conditions can easily be violated when the users are represented by a single feature vector. To improve joint accessibility, we further propose an alternative modelling fix, which is designed to capture the diverse multiple interests of each user using a multi-vector representation. We conduct extensive experiments on real and simulated datasets, demonstrating the stereotyping problem with standard single-vector matrix factorization models.
2022 Strategic Ranking [abstract | arxiv | official link]
Lydia Liu, Nikhil Garg, and Christian Borgs
International Conference on Artificial Intelligence and Statistics (AISTATS‘22)

Strategic classification studies the design of a classifier robust to the manipulation of input by strategic individuals. However, the existing literature does not consider the effect of competition among individuals as induced by the algorithm design. Motivated by constrained allocation settings such as college admissions, we introduce strategic ranking, in which the (designed) individual reward depends on an applicant’s post-effort rank in a measurement of interest. Our results illustrate how competition among applicants affects the resulting equilibria and model insights. We analyze how various ranking reward designs trade off applicant, school, and societal utility and in particular how ranking design can counter inequities arising from disparate access to resources to improve one’s measured score: We find that randomization in the ranking reward design can mitigate two measures of disparate impact, welfare gap and access, whereas non-randomization may induce a high level of competition that systematically excludes a disadvantaged group.
2022 Fair ranking: a critical review, challenges, and future directions [abstract | arxiv | official link]
Gourab K Patro, Lorenzo Porcaro, Laura Mitchell, Qiuyue Zhang, Meike Zehlike, and Nikhil Garg
ACM Conference on Fairness, Accountability, and Transparency (FAccT‘22)
This work was written as part of a distributed, student-led working group of Mechanism Design for Social Good

Ranking, recommendation, and retrieval systems are widely used in online platforms and other societal systems, including e-commerce, media-streaming, admissions, gig platforms, and hiring. In the recent past, a large "fair ranking" research literature has been developed around making these systems fair to the individuals, providers, or content that are being ranked. Most of this literature defines fairness for a single instance of retrieval, or as a simple additive notion for multiple instances of retrievals over time. This work provides a critical overview of this literature, detailing the often context-specific concerns that such an approach misses: the gap between high ranking placements and true provider utility, spillovers and compounding effects over time, induced strategic incentives, and the effect of statistical uncertainty. We then provide a path forward for a more holistic and impact-oriented fair ranking research agenda, including methodological lessons from other fields and the role of the broader stakeholder community in overcoming data bottlenecks and designing effective regulatory environments.
2022 Trucks Don’t Mean Trump: Diagnosing Human Error in Image Analysis [abstract | arxiv | official link]
J.D. Zamfirescu-Pereira, Jerry Chen, Emily Wen, Allison Koenecke, Nikhil Garg, and Emma Pierson
ACM Conference on Fairness, Accountability, and Transparency (FAccT‘22)
Media: Cornell News.

Algorithms provide powerful tools for detecting and dissecting human bias and error. Here, we develop machine learning methods to to analyze how humans err in a particular high-stakes task: image interpretation. We leverage a unique dataset of 16,135,392 human predictions of whether a neighborhood voted for Donald Trump or Joe Biden in the 2020 US election, based on a Google Street View image. We show that by training a machine learning estimator of the Bayes optimal decision for each image, we can provide an actionable decomposition of human error into bias, variance, and noise terms, and further identify specific features (like pickup trucks) which lead humans astray. Our methods can be applied to ensure that human-in-the-loop decision-making is accurate and fair and are also applicable to black-box algorithmic systems.
2022 Combatting Gerrymandering with Social Choice: the Design of Multi-member Districts [abstract | arxiv]
Nikhil Garg, Wes Gurnee, David Rothschild, and David Shmoys
ACM Conference on Economics and Computation (EC‘22)
Media: Cornell Chronicle.

Every representative democracy must specify a mechanism under which voters choose their representatives. The most common mechanism in the United States – winner-take-all single-member districts – both enables substantial partisan gerrymandering and constrains‘fair’ redistricting, preventing proportional representation in legislatures. We study the design of multi-member districts (MMDs), in which each district elects multiple representatives, potentially through a non-winner-takes-all voting rule. We carry out large-scale analyses for the U.S. House of Representatives under MMDs with different social choice functions, under algorithmically generated maps optimized for either partisan benefit or proportionality. Doing so requires efficiently incorporating predicted partisan outcomes – under various multi-winner social choice functions – into an algorithm that optimizes over an ensemble of maps. We find that with three-member districts using Single Transferable Vote, fairness-minded independent commissions would be able to achieve proportional outcomes in every state up to rounding, and advantage-seeking partisans would have their power to gerrymander significantly curtailed. Simultaneously, such districts would preserve geographic cohesion, an arguably important aspect of representative democracies. In the process, we open up a rich research agenda at the intersection of social choice and computational redistricting.
2023 Coarse race data conceals disparities in clinical risk score performance [abstract | arxiv | official link]
Rajiv Movva, Divya Shanmugam, Kaihua Hou, Priya Pathak, John Guttag, Nikhil Garg, and Emma Pierson
Machine Learning for Healthcare (ML4HC)

Healthcare data in the United States often records only a patient’s coarse race group: for example, both Indian and Chinese patients are typically coded as “Asian.” It is unknown, however, whether this coarse coding conceals meaningful disparities in the performance of clinical risk scores across granular race groups. Here we show that it does. Using data from 418K emergency department visits, we assess clinical risk score performance disparities across granular race groups for three outcomes, five risk scores, and four performance metrics. Across outcomes and metrics, we show that there are significant granular disparities in performance within coarse race categories. In fact, variation in performance metrics within coarse groups often exceeds the variation between coarse groups. We explore why these disparities arise, finding that outcome rates, feature distributions, and the relationships between features and outcomes all vary significantly across granular race categories. Our results suggest that healthcare providers, hospital systems, and machine learning researchers should strive to collect, release, and use granular race data in place of coarse race data, and that existing analyses may significantly underestimate racial disparities in performance.
2023 Interface Design to Mitigate Inflation in Recommender Systems [abstract | official link]
Rana Shahout, Yehonatan Peisakhovsky, Sasha Stoikov, and Nikhil Garg
ACM Conference on Recommender Systems (RecSys ’23 Short paper)

Recommendation systems rely on user-provided data to learn about item quality and provide personalized recommendations. An implicit assumption when aggregating ratings into item quality is that ratings are strong indicators of item quality. In this work, we test this assumption using data collected from a music discovery application. Our study focuses on two factors that cause rating inflation: heterogeneous user rating behavior and the dynamics of personalized recommendations. We show that user rating behavior is substantially varies by user, leading to item quality estimates that reflect the users who rated an item more than the item quality itself. Additionally, items that are more likely to be shown via personalized recommendations can experience a substantial increase in their exposure and potential bias toward them. To mitigate these effects, we analyze the results of a randomized controlled trial in which the rating interface was modified. The test resulted in a substantial improvement in user rating behavior and a reduction in item quality inflation. These findings highlight the importance of carefully considering the assumptions underlying recommendation systems and designing interfaces that encourage accurate rating behavior.
2023 Supply-Side Equilibria in Recommender Systems [abstract | arxiv | official link]
Meena Jagadeesan, Nikhil Garg, and Jacob Steinhardt
Neural Information Processing Systems (NeurIPS ‘23)

Digital recommender systems such as Spotify and Netflix affect not only consumer behavior but also producer incentives: producers seek to supply content that will be recommended by the system. But what content will be produced? In this paper, we investigate the supply-side equilibria in content recommender systems. We model users and content as D-dimensional vectors, and recommend the content that has the highest dot product with each user. The main features of our model are that the producer decision space is high-dimensional and the user base is heterogeneous. This gives rise to new qualitative phenomena at equilibrium: First, the formation of genres, where producers specialize to compete for subsets of users. Using a duality argument, we derive necessary and sufficient conditions for this specialization to occur. Second, we show that producers can achieve positive profit at equilibrium, which is typically impossible under perfect competition. We derive sufficient conditions for this to occur, and show it is closely connected to specialization of content. In both results, the interplay between the geometry of the users and the structure of producer costs influences the structure of the supply-side equilibria. At a conceptual level, our work serves as a starting point to investigate how recommender systems shape supply-side competition between producers.
2024 A Bayesian Spatial Model to Correct Under-Reporting in Urban Crowdsourcing [abstract | official link | arxiv | code & data]
Gabriel Agostini, Emma Pierson, and Nikhil Garg
AAAI Conference on Artificial Intelligence (AAAI‘24) (Oral Presentation)

Decision-makers often observe the occurrence of events through a reporting process. City governments, for example, rely on resident reports to find and then resolve urban infrastructural problems such as fallen street trees, flooded basements, or rat infestations. Without additional assumptions, there is no way to distinguish events that occur but are not reported from events that truly did not occur–a fundamental problem in settings with positive-unlabeled data. Because disparities in reporting rates correlate with resident demographics, addressing incidents only on the basis of reports leads to systematic neglect in neighborhoods that are less likely to report events. We show how to overcome this challenge by leveraging the fact that events are \textitspatially correlated. Our framework uses a Bayesian spatial latent variable model to infer event occurrence probabilities and applies it to storm-induced flooding reports in New York City, further pooling results across multiple storms. We show that a model accounting for under-reporting and spatial correlation predicts future reports more accurately than other models, and further induces a more equitable set of inspections: its allocations better reflect the population and provide equitable service to non-white and lower-income areas. This finding reflects heterogeneous reporting behavior learned by the model: reporting rates are higher in Census tracts with higher populations, proportions of white residents, and proportions of owner-occupied households. Our work lays the groundwork for more equitable proactive government services, even with disparate reporting behavior.
2024 Identifying and Addressing Disparities in Public Libraries with Bayesian Latent Variable Modeling [abstract | official link]
Zhi Liu, Sarah Rankin, and Nikhil Garg
AAAI Conference on Artificial Intelligence (AAAI‘24)

Public libraries are an essential public good. We ask: are urban library systems providing equitable service to all residents, in terms of the books they have access to and check out? If not, what causes disparities: heterogeneous book collections, resident behavior and access, and/or operational policies? Existing methods leverage only system-level outcome data (such as overall checkouts per branch), and so cannot distinguish between these factors. As a result, it is difficult to use their results to guide interventions to increase equitable access. We propose a Bayesian framework to characterize book checkout behavior across multiple branches of a library system, learning heterogeneous book popularity, overall branch demand, and usage of the online hold system, while controlling for book availability. In collaboration with the New York Public Library, we apply our framework to granular data consisting of over 400,000 checkouts during 2022. We first show that our model significantly out-performs baseline methods in predicting checkouts at the book-branch level. Next, we study spatial and socioeconomic disparities. We show that disparities are largely driven by disparate use of the online holds system, which allows library patrons to receive books from any other branch through an online portal. This system thus leads to a large outflow of popular books from branches in lower income neighborhoods to those in high income ones. Finally, we illustrate the use of our model and insights to quantify the impact of potential interventions, such as changing how books are internally routed between branches to fulfill hold requests.
2024 Domain constraints improve risk prediction when outcome data is missing [abstract | arxiv | official link]
Sidhika Balachandar, Nikhil Garg, and Emma Pierson
International Conference on Learning Representations (ICLR‘24)

Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that the human decision censors the outcome data: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model class which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that domain constraints improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model’s inferred risk predicts cancer diagnoses, its inferred testing policy captures known public health policies, and it can identify suboptimalities in test allocation. Though our case study is in healthcare, our analysis reveals a general class of domain constraints which can improve model estimation in many settings.
2024 Reconciling the accuracy-diversity trade-off in recommendations [abstract | arxiv | official link]
Kenny Peng, Manish Raghavan, Emma Pierson, Jon Kleinberg, and Nikhil Garg
The ACM Web Conference (WWW‘24) (Oral Presentation)

In recommendation settings, there is an apparent trade-off between the goals of accuracy (to recommmend items a user is most likely to want) and diversity (to recommend items representing a range of categories). As such, real-world recommender systems often explicitly incorporate diversity separately from accuracy. This approach, however, leaves a basic question unanswered: Why is there a trade-off in the first place? We analyze a stylized model of recommendations reconciling this trade-off. Accounting for a user’s capacity constraints (users do not typically make use of all the items that are recommended to them), optimal recommendations in our model are inherently diverse. Thus, accuracy and diversity appear misaligned because traditional accuracy metrics do not consider capacity constraints. Our model yields precise and interpretable characterizations of diversity in different settings, giving practical insights into the design of diverse recommendations.
2024 Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers [abstract | arxiv | official link]
Rajiv Movva, Sidhika Balachandar, Kenny Peng, Gabriel Agostini, Nikhil Garg, and Emma Pierson
The North American Chapter of the Association for Computational Linguistics (NAACL‘24)

Large language models (LLMs) are dramatically influencing AI research, spurring discussions on what has changed so far and how to shape the field’s future. To clarify such questions, we analyze a new dataset of 16,979 LLM-related arXiv papers, focusing on recent trends in 2023 vs. 2018-2022. First, we study disciplinary shifts: LLM research increasingly considers societal impacts, evidenced by 20× growth in LLM submissions to the Computers and Society sub-arXiv. An influx of new authors – half of all first authors in 2023 – are entering from non-NLP fields of CS, driving disciplinary expansion. Second, we study industry and academic publishing trends. Surprisingly, industry accounts for a smaller publication share in 2023, largely due to reduced output from Google and other Big Tech companies; universities in Asia are publishing more. Third, we study institutional collaboration: while industry-academic collaborations are common, they tend to focus on the same topics that industry focuses on rather than bridging differences. The most prolific institutions are all US- or China-based, but there is very little cross-country collaboration. We discuss implications around (1) how to support the influx of new authors, (2) how industry trends may affect academics, and (3) possible effects of (the lack of) collaboration.
2024 Wisdom and Foolishness of Noisy Matching Markets [abstract | arxiv]
Kenny Peng and Nikhil Garg
ACM Conference on Economics and Computation (EC‘24)

We consider a many-to-one matching market where colleges share true preferences over students but make decisions using only independent noisy rankings. Each student has a \textittrue value v, but each college c ranks the student according to an independently drawn \textitestimated value v + X_c for X_c∼\DD. We ask a basic question about the resulting stable matching: How noisy is the set of matched students? Two striking effects can occur in large markets (i.e., with a continuum of students and a large number of colleges). When \DD is light-tailed, noise is fully attenuated: only the highest-value students are matched. When \DD is long-tailed, noise is fully amplified: students are matched uniformly at random. These results hold for any distribution of student preferences over colleges, and extend to when only subsets of colleges agree on true student valuations instead of the entire market. More broadly, our framework provides a tractable approach to analyze implications of imperfect preference formation in large markets.
2024 Redesigning Service Level Agreements: Equity and Efficiency in City Government Operations [abstract | arxiv]
Zhi Liu and Nikhil Garg
ACM Conference on Economics and Computation (EC‘24)

We consider government service allocation – how the government allocates resources (e.g., maintenance of public infrastructure) over time. It is important to make these decisions efficiently and equitably – though these desiderata may conflict. In particular, we consider the design of Service Level Agreements (SLA) in city government operations: promises that incidents such as potholes and fallen trees will be responded to within a certain time. We model the problem of designing a set of SLAs as an optimization problem with equity and efficiency objectives under a queuing network framework; the city has two decision levers: how to allocate response budgets to different neighborhoods, and how to schedule responses to individual incidents. We: (1) Theoretically analyze a stylized model and find that the "price of equity" is small in realistic settings; (2) Develop a simulation-optimization framework to optimize policies in practice; (3) Apply our framework empirically using data from NYC, finding that: (a) status quo inspections are highly inefficient and inequitable compared to optimal ones, and (b) in practice, the equity-efficiency trade-off is not substantial: generally, inefficient policies are inequitable, and vice versa.
2024 Equitable Congestion Pricing under the Markovian Traffic Model: An Application to Bogota [abstract | arxiv]
Alfredo Torrico, Natthawut Boonsiriphatthanajaroen, Nikhil Garg, Andrea Lodi, and Hugo Mainguy
ACM Conference on Economics and Computation (EC‘24)

Congestion pricing is used to raise revenues and reduce traffic and pollution. However, people have heterogeneous spatial demand patterns and willingness (or ability) to pay tolls, and so pricing may have substantial equity implications. We develop a data-driven approach to design congestion pricing given policymakers’ equity and efficiency objectives. First, algorithmically, we extend the Markovian traffic equilibrium setting introduced by Baillon & Cominetti (2008) to model heterogeneous populations and incorporate prices and outside options such as public transit. Second, we empirically evaluate various pricing schemes using data collected by an industry partner in the city of Bogota, one of the most congested cities in the world. We find that pricing personalized to each economic stratum can be substantially more efficient and equitable than uniform pricing; however, non-personalized but area-based pricing can recover much of the gap.
2024 Ending Affirmative Action Harms Diversity Without Improving Academic Merit [abstract | arxiv | official link]
Jinsook Lee, Emma Harvey, Joyce Zhou, Nikhil Garg, Thorsten Joachims, and René Kizilcec
Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ’24)
Media: Cornell News.

Each year, selective American colleges sort through tens of thousands of applications to identify a first-year class that displays both academic merit and diversity. In the 2023-2024 admissions cycle, these colleges faced unprecedented challenges to doing so. First, the number of applications has been steadily growing year-over-year. Second, test-optional policies that have remained in place since the COVID-19 pandemic limit access to key information that has historically been predictive of academic success. Most recently, longstanding debates over affirmative action culminated in the Supreme Court banning race-conscious admissions. Colleges have explored machine learning (ML) models to address the issues of scale and missing test scores, often via ranking algorithms intended to allow human reviewers to focus attention on ‘top’ applicants. However, the Court’s ruling will force changes to these models, which were previously able to consider race as a factor in ranking. There is currently a poor understanding of how these mandated changes will shape applicant ranking algorithms, and, by extension, admitted classes. We seek to address this by quantifying the impact of different admission policies on the applications prioritized for review. We show that removing race data from a previously developed applicant ranking algorithm reduces the diversity of the top-ranked pool of applicants without meaningfully increasing the academic merit of that pool. We further measure the impact of policy change on individuals by quantifying arbitrariness in applicant rank. We find that any given policy has a high degree of arbitrariness (i.e. at most 9% of applicants are consistently ranked in the top 20%), and that removing race data from the ranking algorithm increases arbitrariness in outcomes for most applicants.
2024 Monoculture in Matching Markets [abstract | arxiv | official link]
Kenny Peng and Nikhil Garg
Neural Information Processing Systems (NeurIPS ‘24)

Algorithmic monoculture arises when many decision-makers rely on the same algorithm to evaluate applicants. An emerging body of work investigates possible harms of this kind of homogeneity, but has been limited by the challenge of incorporating market effects in which the preferences and behavior of many applicants and decision-makers jointly interact to determine outcomes. Addressing this challenge, we introduce a tractable theoretical model of algorithmic monoculture in a two-sided matching market with many participants. We use the model to analyze outcomes under monoculture (when decision-makers all evaluate applicants using a common algorithm) and under polyculture (when decision-makers evaluate applicants independently). All else equal, monoculture (1) selects less-preferred applicants when noise is well-behaved, (2) matches more applicants to their top choice, though individual applicants may be worse off depending on their value to decision-makers and risk tolerance, and (3) is more robust to disparities in the number of applications submitted.
2024 User-item fairness tradeoffs in recommendations [abstract | arxiv | official link]
Sophie Greenwood, Sudalakshmee Chiniah, and Nikhil Garg
Neural Information Processing Systems (NeurIPS ‘24)

In the basic recommendation paradigm, the most relevant item is recommended to each user. This may result in some items receiving lower exposure than they "should"; to counter this, several algorithmic approaches have been developed to ensure item fairness. These approaches necessarily degrade recommendations for some users to improve outcomes for items, leading to user fairness concerns. In turn, a recent line of work has focused on developing algorithms for multi-sided fairness, to jointly optimize user fairness, item fairness, and overall recommendation quality. This induces the question: what is the tradeoff between these objectives, and what are the characteristics of (multi-objective) optimal solutions? Theoretically, we develop a model of recommendations with user, item, and overall utility objectives and characterize the solutions of fairness-constrained optimization. We identify two phenomena: (a) when user preferences are diverse, there is "free" item and user fairness; and (b) users whose preferences are misestimated can be especially disadvantaged by item fairness constraints. Empirically, we build a recommendation system for preprints on arXiv and implement our framework, measuring the phenomena in practice and showing how these phenomena inform the design of markets with recommendation systems-intermediated matching.
2025 A No Free Lunch Theorem for Human-AI Collaboration [abstract | arxiv | official link]
Kenny Peng, Nikhil Garg, and Jon Kleinberg
AAAI Conference on Artificial Intelligence (AAAI‘25)

The gold standard in human-AI collaboration is complementarity – when combined performance exceeds both the human and algorithm alone. We investigate this challenge in binary classification settings where the goal is to maximize 0-1 accuracy. Given two or more agents who can make calibrated probabilistic predictions, we show a "No Free Lunch"-style result. Any deterministic collaboration strategy (a function mapping calibrated probabilities into binary classifications) that does not essentially always defer to the same agent will sometimes perform worse than the least accurate agent. In other words, complementarity cannot be achieved "for free." The result does suggest one model of collaboration with guarantees, where one agent identifies "obvious" errors of the other agent. We also use the result to understand the necessary conditions enabling the success of other collaboration techniques, providing guidance to human-AI collaboration.
2025 Learning Disease Progression Models That Capture Health Disparities [abstract | arxiv]
Erica Chiang, Divya Shanmugan, Ashley N. Beecy, Gabriel Sayer, Deborah Estrin, Nikhil Garg, and Emma Pierson
Conference on Health, Inference, and Learning (CHIL ‘25)
Best Paper Award at CHIL ‘25

Disease progression models are widely used to inform the diagnosis and treatment of many progressive diseases. However, a significant limitation of existing models is that they do not account for health disparities that can bias the observed data. To address this, we develop an interpretable Bayesian disease progression model that captures three key health disparities: certain patient populations may (1) start receiving care only when their disease is more severe, (2) experience faster disease progression even while receiving care, or (3) receive follow-up care less frequently conditional on disease severity. We show theoretically and empirically that failing to account for any of these disparities can result in biased estimates of severity (e.g., underestimating severity for disadvantaged groups). On a dataset of heart failure patients, we show that our model can identify groups that face each type of health disparity, and that accounting for these disparities while inferring disease severity meaningfully shifts which patients are considered high-risk.
2025 Balancing Producer Fairness and Efficiency via Bayesian Rating System Design [abstract | arxiv | official link]
Thomas Ma, Michael Bernstein, Ramesh Johari, and Nikhil Garg
International AAAI Conference on Web and Social Media (ICWSM ‘25)

Online marketplaces use rating systems to promote discovery of high quality products. However, these systems also lead to high variance in producers’ economic outcomes: a new producer who sells high-quality items, may, by luck, receive one low rating early on, negatively impacting their popularity with future customers. We investigate the design of rating systems that balance the goals of identifying high quality products ("efficiency") and minimizing the variance in economic outcomes of producers of similar quality (individual "producer fairness"). We observe that there is a trade-off between these two goals: rating systems that promote efficiency are necessarily less individually fair to producers. We introduce Bayesian rating systems as an approach to managing this trade-off. Informally, the systems we propose set a system-wide prior for the quality of an incoming product, and subsequently the system updates that prior to a Bayesian posterior on quality based on user-generated ratings over time. Through calibrated simulations, we show that the strength of the prior directly determines the operating point on the identified trade-off: the stronger the prior, the more the marketplace discounts early ratings data (so individual producer fairness increases), but the slower the platform is in learning about true item quality (so efficiency suffers). Importantly, the prevailing method of ratings aggregation – displaying the sample mean of ratings – is an extreme point in this design space, that maximally prioritizes efficiency at the expense of producer fairness. Instead, by choosing a Bayesian rating system design with an appropriately set prior, a platform can be intentional about the consequential choice of a balance between efficiency and producer fairness.
2025 Correlated Errors in Large Language Models [abstract | arxiv | code]
Elliot Kim, Avi Garg, Kenny Peng, and Nikhil Garg
International Conference on Machine Learning (ICML‘25)

Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors – on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring – the latter reflecting theoretical predictions regarding algorithmic monoculture.
2025 Sparse Autoencoders for Hypothesis Generation [abstract | arxiv | website | code]
Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, and Emma Pierson
International Conference on Machine Learning (ICML‘25)

We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets ( twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.

Other (workshops and technical reports)

2013 Multi-Modal, Multi-State, Real-Time Crew State Monitoring System [abstract | pdf]
Kier Fortier, Nikhil Garg, and Elizabeth Pickering
NASA Glenn Research Center Research Report

I helped develop a Real-time, Multi-Modal, Crew State Monitoring System that integrates EEG, GSR, and HRV. I was in charge of overall design, EEG processing, machine learning, and multi-modal integration. I developed robust artifact rejection to detect and remove blinks and other artifacts from the EEG data; EEG feature extraction to represent blocks of data with frequency characteristics, statistical measures, and blink rate; and a machine learning classification system (through Support Vector Machines) that uses the features and characterizes data from a block of time as originating from either a state of rest or a state of concentration. I then integrated EEG and GSR features for joint classiﬁcation, and we demoed a end-to-end system that collected data from multiple sensors, extracted features, and trained and used the classiﬁer to predict subject state. The system successfully classiﬁed 80 percent of subject states.
2015 Use of Electroencephalography and Galvanic Skin Response in the Prediction of an Attentive Cognitive State [abstract | pdf]
Beth Lewandowski, Kier Fortier, Nikhil Garg, Victor Rielly, Jeff Mackey, Tristan Hearn, Angela Harrivel, and Bradford Fenton
Health and Human Performance Research Summit, Dayton, CO

As part of an effort aimed at improving aviation safety, the Crew State Monitoring Element of the NASA Vehicle Systems Safety Technologies Project is developing a monitoring system capable of detecting cognitive states that may be associated with unsafe piloting conditions. The long term goal is a real-time, integrated system, that uses multiple physiological sensing modalities to detect multiple cognitive states with high accuracy, which can be used to help optimize human performance. Prior to realizing an integrated system, individual sensing modalities are being investigated, including the use of electroencephalographic (EEG) and galvanic skin response (GSR) signals, in the determination of an attentive or inattentive state. EEG and GSR data are collected during periods of rest and as subjects perform psychological tests including the psychomotor vigilance test, the Mackwork clock test and the attention network test. Subjects also perform tasks designed to simulate piloting tasks within the NASA multi-attribute task battery (MATB-II) program. The signals are filtered, the artifacts are rejected and the power spectral density (PSD) of the signals are found. Comparisons of the PSD between the rest and test blocks are made, along with the change in PSD over the time course of the blocks. Future work includes the collection of heart rate data and the investigation of heart rate variability as an additional measure to use in the prediction of attentive state, as well as the investigation of additional EEG signal processing methods such as source localization, multi- scale entropy and coherence measures. Preliminary results will be presented to highlight the methods used and to discuss our hypotheses. The challenges associated with realizing a real-time, accurate, multi-modal, cognitive state monitoring system are numerous. A discussion of some of the challenges will be provided, including real-time artifact rejection methods, quantification of inter- and intra-subject variability, determination of what information within the signals provides the best measurement of attention and determination of how information from the different modalities can be integrated to improve the overall accuracy of the system.
2015 Fair Use and Innovation in Unlicensed Wireless Spectrum: LTE unlicensed and Wi-Fi in the 5 GHz unlicensed band [pdf]
Nikhil Garg
IEEE-USA Journal of Technology and Public Policy
2016 Transfer Learning: The Impact of Test Set Word Vectors, with Applications to Political Tweets [abstract | pdf]
Nikhil Garg and Arjun Seshadri

A major difficulty in applying deep learning in novel domains is the expense associated with acquiring sufficient training data. In this work, we extend literature in deep transfer learning by studying the role of initializing the embedding matrix with word vectors from GLoVe on a target dataset before training models with data from another domain. We study transfer learning on variants of four models (2 RNNs, a CNN, and an LSTM) and three datasets. We conclude that 1) the simple idea of initializing word vectors significantly and robustly improves transfer learning performance, 2) cross-domain learning occurs in fewer iterations than in-domain learning, considerably reduces train time, and 3) blending various out-of-domain datasets before training improves transfer learning. We then apply our models to a dataset of over 400k tweets by politicians, classifying sentiment and subjectivity vs. objectivity. This dataset was provided unlabeled, motivating an unsupervised and transfer learning approach. With transfer learning, we achieve reasonable performance on sentiment classification, but fail in classifying subjectivity vs. objectivity.
2018 Comparing Voting Methods for Budget Decisions on the ASSU Ballot [abstract | pdf]
Lodewijk Gelauff, Sukolsak Sakshuwong, Nikhil Garg, and Ashish Goel

During the 2018 Associated Students of Stanford University (ASSU; Stanford’s student body) election and annual grants process, the Stanford Crowdsourced Democracy Team (SCDT) ran a research ballot and survey to develop insights into voting behavior on the budget component of the ballot (annual grants) where multiple grant requests are considered. We provided voters with additional voting methods for the budget component, collected further insights through a survey and demonstrated the viability of the proposed workflow. Some of our findings are directly relevant to ASSU. Furthermore, the (appropriately anonymized) data gathered in this year’s research ballots is beneficial for research purposes. Overall, our platform and pipeline (PB Stanford) with post-validation of ballots functioned well on a large scale. In particular, the knapsack ballot mechanism shows promise in voter feedback.
2019 Deliberative Democracy with the Online Deliberation Platform [official link]
James Fishkin, Nikhil Garg, Lodewijk Gelauff, Ashish Goel, Kamesh Munagala, Sukolsak Sakshuwong, Alice Siu, and Sravya Yandamuri
AAAI Conference on Human Computation and Crowdsourcing Demo Track
2025 Choosing the Right Weights: Balancing Value, Strategy, and Noise in Recommender Systems [abstract | arxiv]
Smitha Milli, Emma Pierson, and Nikhil Garg
Unpublished manuscript

Many recommender systems are based on optimizing a linear weighting of different user behaviors, such as clicks, likes, shares, etc. Though the choice of weights can have a significant impact, there is little formal study or guidance on how to choose them. We analyze the optimal choice of weights from the perspectives of both users and content producers who strategically respond to the weights. We consider three aspects of user behavior: value-faithfulness (how well a behavior indicates whether the user values the content), strategy-robustness (how hard it is for producers to manipulate the behavior), and noisiness (how much estimation error there is in predicting the behavior). Our theoretical results show that for users, upweighting more value-faithful and less noisy behaviors leads to higher utility, while for producers, upweighting more value-faithful and strategy-robust behaviors leads to higher welfare (and the impact of noise is non-monotonic). Finally, we discuss how our results can help system designers select weights in practice.

Theses

2015 Downlink and Uplink User Association in Dense Next-Generation Wireless Networks [abstract | official link]
Nikhil Garg
Bachelors Thesis, University of Texas at Austin.

5G, the next-generation cellular network, must serve an aggregate data rate of 1000 times that of current 4G networks while reducing data latency by a factor of ten. To meet these requirements, 5G networks will be far denser than existing networks, and small cells (femtocells and picocells) will augment network capacity. However, dense networks raise questions regarding interference, user association, and handoff between base stations. Where recent papers have demonstrated that interference from small cells will not be prohibitive under multi-slope path loss models, this thesis describes how the use of different path loss models affects the design of such dense, multi-tier networks. This thesis concludes that the gains realized by downlink biasing and uplink/downlink decoupling are strongly dependent on the path loss model assumed and the density differential between base station tiers. Furthermore, this thesis argues that the gains from uplink/downlink decoupling are reduced by a factor of 50% when optimal biasing for the downlink is used.
2020 Designing Marketplaces and Civic Engagement Platforms: Learning, Incentives, and Pricing [abstract | official link | summary | talk | short talk]
Nikhil Garg
PhD Dissertation, Stanford University
INFORMS George Dantzig Dissertation Award, 2020
ACM SIGecom dissertation award (Honorable mention), 2021

Platforms increasingly mediate interactions between people: both helping us find work and transportation, and supporting our civic society through discussion and decision-making. Principled system design requires formalizing the platform’s objective and understanding the incentives, behavioral tendencies, and capabilities of participants; in turn, the design influences participant behavior. In this dissertation, I describe work designing platforms in two domains – two-sided marketplaces and civic engagement platforms – combining both theoretical and empirical analyses of such systems. First, I consider the design of surge pricing that is incentive compatible for drivers in ride-hailing platforms. Second, I tackle rating system inflation and design on online platforms. Finally, I study the design and deployment of systems for participatory budgeting. The work in this dissertation has informed deployments at Uber, a large online labor platform, and in participatory budgeting elections across the U.S.

Online Marketplaces

bibliography_topic
bibliography_topic
bibliography_topic

Civic Engagement

bibliography_topic
bibliography_topic
bibliography_topic
bibliography_topic
bibliography_topic

Algorithmic Fairness

bibliography_topic
bibliography_topic
bibliography_topic
bibliography_topic
bibliography_topic
bibliography_topic

Natural Language Processing

bibliography_topic
bibliography_topic
bibliography_topic

Wireless Communications and Signal Processing

bibliography_topic
bibliography_topic
bibliography_topic
bibliography_topic
bibliography_topic
bibliography_topic

Working Papers

Optimal Strategies in Ranked-Choice Voting [abstract | arxiv]
Sanyukta Deshpande, Nikhil Garg, and Sheldon Jacobson

Ranked Choice Voting (RCV) and Single Transferable Voting (STV) are widely valued; but are complex to understand due to intricate per-round vote transfers. Questions like determining how far a candidate is from winning or identifying effective election strategies are computationally challenging as minor changes in voter rankings can lead to significant ripple effects - for example, lending support to a losing candidate can prevent their votes from transferring to a more competitive opponent. We study optimal strategies - persuading voters to change their ballots or adding new voters - both algorithmically and theoretically. Algorithmically, we develop efficient methods to reduce election instances while maintaining optimization accuracy, effectively circumventing the computational complexity barrier. Theoretically, we analyze the effectiveness of strategies under both perfect and imperfect polling information. Our algorithmic approach applies to the ranked-choice polling data on the US 2024 Republican Primary, finding, for example, that several candidates would have been optimally served by boosting another candidate instead of themselves.
Bayesian Modeling of Zero-Shot Classifications for Urban Flood Detection [abstract | arxiv]
Matt Franchi, Nikhil Garg, Wendy Ju, and Emma Pierson

Street scene datasets, collected from Street View or dashboard cameras, offer a promising means of detecting urban objects and incidents like street flooding. However, a major challenge in using these datasets is their lack of reliable labels: there are myriad types of incidents, many types occur rarely, and ground-truth measures of where incidents occur are lacking. Here, we propose BayFlood, a two-stage approach which circumvents this difficulty. First, we perform zero-shot classification of where incidents occur using a pretrained vision-language model (VLM). Second, we fit a spatial Bayesian model on the VLM classifications. The zero-shot approach avoids the need to annotate large training sets, and the Bayesian model provides frequent desiderata in urban settings - principled measures of uncertainty, smoothing across locations, and incorporation of external data like stormwater accumulation zones. We comprehensively validate this two-stage approach, showing that VLMs provide strong zero-shot signal for floods across multiple cities and time periods, the Bayesian model improves out-of-sample prediction relative to baseline methods, and our inferred flood risk correlates with known external predictors of risk. Having validated our approach, we show it can be used to improve urban flood detection: our analysis reveals 113,738 people who are at high risk of flooding overlooked by current methods, identifies demographic biases in existing methods, and suggests locations for new flood sensors. More broadly, our results showcase how Bayesian modeling of zero-shot LM annotations represents a promising paradigm because it avoids the need to collect large labeled datasets and leverages the power of foundation models while providing the expressiveness and uncertainty quantification of Bayesian models.
Inferring fine-grained migration patterns across the United States [abstract | arxiv | data]
Gabriel Agostini, Rachel Young, Maria Fitzpatrick, Nikhil Garg, and Emma Pierson

Fine-grained migration data illuminate important demographic, environmental, and health phenomena. However, migration datasets within the United States remain lacking: publicly available Census data are neither spatially nor temporally granular, and proprietary data have higher resolution but demographic and other biases. To address these limitations, we develop a scalable iterative-proportional-fitting based method which reconciles high-resolution but biased proprietary data with low-resolution but more reliable Census data. We apply this method to produce MIGRATE, a dataset of annual migration matrices from 2010 - 2019 which captures flows between 47.4 billion pairs of Census Block Groups – about four thousand times more granular than publicly available data. These estimates are highly correlated with external ground-truth datasets, and improve accuracy and reduce bias relative to raw proprietary data. We publicly release MIGRATE estimates and provide a case study illustrating how they reveal granular patterns of migration in response to California wildfires.
Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts [abstract | arxiv]
Kenny Peng, Rajiv Movva, Jon Kleinberg, Emma Pierson, and Nikhil Garg

While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that while SAEs may be less effective for acting on known concepts, SAEs are powerful tools for discovering unknown concepts. This distinction cleanly separates existing negative and positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.
Optimizing Library Usage and Browser Experience: Application to the New York Public Library [abstract | arxiv]
Zhi Liu, Wenchang Zhu, Sarah Rankin, and Nikhil Garg

We tackle the challenge brought to urban library systems by the holds system – which allows users to request books available at other branches to be transferred for local pickup. The holds system increases usage of the entire collection, at the expense of an in-person browser’s experience at the source branch. We study the optimization of usage and browser experience, where the library has two levers: where a book should come from when a hold request is placed, and how many book copies at each branch should be available through the holds system versus reserved for browsers. We first show that the problem of maximizing usage can be viewed through the lens of revenue management, for which near-optimal fulfillment policies exist. We then develop a simulation framework that further optimizes for browser experience, through book reservations. We empirically apply our methods to data from the New York Public Library to design implementable policies. We find that though a substantial trade-off exists between these two desiderata, a balanced policy can improve browser experience over the historical policy without significantly sacrificing usage. Because browser usage is more prevalent among branches in low-income areas, this policy further increases system-wide equity: notably, for branches in the 25% lowest-income neighborhoods, it improves both usage and browser experience by about 15%.

2025

Addressing Discretization-Induced Bias in Demographic Prediction [abstract | arxiv | official link]
Evan Dong, Aaron Schein, Yixin Wang, and Nikhil Garg
PNAS Nexus
Conference version appeared in ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT 2024).

Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions – e.g., based on name and geography – and then to often discretize the predictions by selecting the most likely class (argmax), potentially with a minimum threshold (thresholding). We study how this practice produces discretization bias. For example, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of Black voters, e.g., by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a joint optimization approach – and a tractable data-driven threshold heuristic – that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences.
Choosing the Right Weights: Balancing Value, Strategy, and Noise in Recommender Systems [abstract | arxiv]
Smitha Milli, Emma Pierson, and Nikhil Garg
Unpublished manuscript

Many recommender systems are based on optimizing a linear weighting of different user behaviors, such as clicks, likes, shares, etc. Though the choice of weights can have a significant impact, there is little formal study or guidance on how to choose them. We analyze the optimal choice of weights from the perspectives of both users and content producers who strategically respond to the weights. We consider three aspects of user behavior: value-faithfulness (how well a behavior indicates whether the user values the content), strategy-robustness (how hard it is for producers to manipulate the behavior), and noisiness (how much estimation error there is in predicting the behavior). Our theoretical results show that for users, upweighting more value-faithful and less noisy behaviors leads to higher utility, while for producers, upweighting more value-faithful and strategy-robust behaviors leads to higher welfare (and the impact of noise is non-monotonic). Finally, we discuss how our results can help system designers select weights in practice.
A No Free Lunch Theorem for Human-AI Collaboration [abstract | arxiv | official link]
Kenny Peng, Nikhil Garg, and Jon Kleinberg
AAAI Conference on Artificial Intelligence (AAAI‘25)

The gold standard in human-AI collaboration is complementarity – when combined performance exceeds both the human and algorithm alone. We investigate this challenge in binary classification settings where the goal is to maximize 0-1 accuracy. Given two or more agents who can make calibrated probabilistic predictions, we show a "No Free Lunch"-style result. Any deterministic collaboration strategy (a function mapping calibrated probabilities into binary classifications) that does not essentially always defer to the same agent will sometimes perform worse than the least accurate agent. In other words, complementarity cannot be achieved "for free." The result does suggest one model of collaboration with guarantees, where one agent identifies "obvious" errors of the other agent. We also use the result to understand the necessary conditions enabling the success of other collaboration techniques, providing guidance to human-AI collaboration.
Faster Information for Effective Long-Term Discharge: A Field Study in Adult Foster Care [abstract | official link]
Vince Bartle, Ashley Shearer, Alexandra Wroe, Nicola Dell, and Nikhil Garg
Proceedings of the ACM on Human-Computer Interaction.
Journal Track for 28th ACM SIGCHI Conference on Computer-Supported Cooperative Work & Social Computing (CSCW‘25). Also appeared in EAAMO‘23

As the US population ages, a growing challenge is placing hospital patients who require long-term post-acute care into adult foster care facilities: small long-term nursing facilities that care for those unable to age in place because their care requirements exceed what can be delivered at home. A key challenge in patient placement is the dynamic matching process between hospital discharge coordinators looking to place patients and facilities looking for residents. We designed, built, deployed, and maintain a system to support decision making among a team of six discharge coordinators assisting in the discharge of 127 patients across 1,047 facilities in Hawai’i. Our system collects vacancy and capability data from facilities via conversational SMS and processes it to recommend facilities that discharge coordinators might contact. Findings from a 14-month deployment provide evidence for how timely, accurate information positively impacts matching efficacy. We close with lessons learned for information collection systems and provisioning platforms in similar contexts.
Learning Disease Progression Models That Capture Health Disparities [abstract | arxiv]
Erica Chiang, Divya Shanmugan, Ashley N. Beecy, Gabriel Sayer, Deborah Estrin, Nikhil Garg, and Emma Pierson
Conference on Health, Inference, and Learning (CHIL ‘25)
Best Paper Award at CHIL ‘25

Disease progression models are widely used to inform the diagnosis and treatment of many progressive diseases. However, a significant limitation of existing models is that they do not account for health disparities that can bias the observed data. To address this, we develop an interpretable Bayesian disease progression model that captures three key health disparities: certain patient populations may (1) start receiving care only when their disease is more severe, (2) experience faster disease progression even while receiving care, or (3) receive follow-up care less frequently conditional on disease severity. We show theoretically and empirically that failing to account for any of these disparities can result in biased estimates of severity (e.g., underestimating severity for disadvantaged groups). On a dataset of heart failure patients, we show that our model can identify groups that face each type of health disparity, and that accounting for these disparities while inferring disease severity meaningfully shifts which patients are considered high-risk.
Balancing Producer Fairness and Efficiency via Bayesian Rating System Design [abstract | arxiv | official link]
Thomas Ma, Michael Bernstein, Ramesh Johari, and Nikhil Garg
International AAAI Conference on Web and Social Media (ICWSM ‘25)

Online marketplaces use rating systems to promote discovery of high quality products. However, these systems also lead to high variance in producers’ economic outcomes: a new producer who sells high-quality items, may, by luck, receive one low rating early on, negatively impacting their popularity with future customers. We investigate the design of rating systems that balance the goals of identifying high quality products ("efficiency") and minimizing the variance in economic outcomes of producers of similar quality (individual "producer fairness"). We observe that there is a trade-off between these two goals: rating systems that promote efficiency are necessarily less individually fair to producers. We introduce Bayesian rating systems as an approach to managing this trade-off. Informally, the systems we propose set a system-wide prior for the quality of an incoming product, and subsequently the system updates that prior to a Bayesian posterior on quality based on user-generated ratings over time. Through calibrated simulations, we show that the strength of the prior directly determines the operating point on the identified trade-off: the stronger the prior, the more the marketplace discounts early ratings data (so individual producer fairness increases), but the slower the platform is in learning about true item quality (so efficiency suffers). Importantly, the prevailing method of ratings aggregation – displaying the sample mean of ratings – is an extreme point in this design space, that maximally prioritizes efficiency at the expense of producer fairness. Instead, by choosing a Bayesian rating system design with an appropriately set prior, a platform can be intentional about the consequential choice of a balance between efficiency and producer fairness.
Correlated Errors in Large Language Models [abstract | arxiv | code]
Elliot Kim, Avi Garg, Kenny Peng, and Nikhil Garg
International Conference on Machine Learning (ICML‘25)

Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors – on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring – the latter reflecting theoretical predictions regarding algorithmic monoculture.
Sparse Autoencoders for Hypothesis Generation [abstract | arxiv | website | code]
Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, and Emma Pierson
International Conference on Machine Learning (ICML‘25)

We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets ( twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
Heterogeneous participation and allocation skews: when is choice “worth it”? [abstract | arxiv]
Nikhil Garg
ACM SIGecom Exchanges (Research Letter)
Invited; lightly reviewed by editors

A core ethos of the Economics and Computation (EconCS) community is that people have complex private preferences and information of which the central planner is unaware, but which an appropriately designed mechanism can uncover to improve collective decisionmaking. This ethos underlies the community’s largest deployed success stories, from stable matching systems to participatory budgeting. I ask: is this choice and information aggregation “worth it”? In particular, I discuss how such systems induce heterogeneous participation: those already relatively advantaged are, empirically, more able to pay time costs and navigate administrative burdens imposed by the mechanisms. I draw on three case studies, including my own work – complex democratic mechanisms, resident crowdsourcing, and school matching. I end with lessons for practice and research, challenging the community to help reduce participation heterogeneity and design and deploy mechanisms that meet a “best of both worlds” north star: use preferences and information from those who choose to participate, but provide a “sufficient” quality of service to those who do not.

2024

A Bayesian Spatial Model to Correct Under-Reporting in Urban Crowdsourcing [abstract | official link | arxiv | code & data]
Gabriel Agostini, Emma Pierson, and Nikhil Garg
AAAI Conference on Artificial Intelligence (AAAI‘24) (Oral Presentation)

Decision-makers often observe the occurrence of events through a reporting process. City governments, for example, rely on resident reports to find and then resolve urban infrastructural problems such as fallen street trees, flooded basements, or rat infestations. Without additional assumptions, there is no way to distinguish events that occur but are not reported from events that truly did not occur–a fundamental problem in settings with positive-unlabeled data. Because disparities in reporting rates correlate with resident demographics, addressing incidents only on the basis of reports leads to systematic neglect in neighborhoods that are less likely to report events. We show how to overcome this challenge by leveraging the fact that events are \textitspatially correlated. Our framework uses a Bayesian spatial latent variable model to infer event occurrence probabilities and applies it to storm-induced flooding reports in New York City, further pooling results across multiple storms. We show that a model accounting for under-reporting and spatial correlation predicts future reports more accurately than other models, and further induces a more equitable set of inspections: its allocations better reflect the population and provide equitable service to non-white and lower-income areas. This finding reflects heterogeneous reporting behavior learned by the model: reporting rates are higher in Census tracts with higher populations, proportions of white residents, and proportions of owner-occupied households. Our work lays the groundwork for more equitable proactive government services, even with disparate reporting behavior.
Identifying and Addressing Disparities in Public Libraries with Bayesian Latent Variable Modeling [abstract | official link]
Zhi Liu, Sarah Rankin, and Nikhil Garg
AAAI Conference on Artificial Intelligence (AAAI‘24)

Public libraries are an essential public good. We ask: are urban library systems providing equitable service to all residents, in terms of the books they have access to and check out? If not, what causes disparities: heterogeneous book collections, resident behavior and access, and/or operational policies? Existing methods leverage only system-level outcome data (such as overall checkouts per branch), and so cannot distinguish between these factors. As a result, it is difficult to use their results to guide interventions to increase equitable access. We propose a Bayesian framework to characterize book checkout behavior across multiple branches of a library system, learning heterogeneous book popularity, overall branch demand, and usage of the online hold system, while controlling for book availability. In collaboration with the New York Public Library, we apply our framework to granular data consisting of over 400,000 checkouts during 2022. We first show that our model significantly out-performs baseline methods in predicting checkouts at the book-branch level. Next, we study spatial and socioeconomic disparities. We show that disparities are largely driven by disparate use of the online holds system, which allows library patrons to receive books from any other branch through an online portal. This system thus leads to a large outflow of popular books from branches in lower income neighborhoods to those in high income ones. Finally, we illustrate the use of our model and insights to quantify the impact of potential interventions, such as changing how books are internally routed between branches to fulfill hold requests.
Domain constraints improve risk prediction when outcome data is missing [abstract | arxiv | official link]
Sidhika Balachandar, Nikhil Garg, and Emma Pierson
International Conference on Learning Representations (ICLR‘24)

Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that the human decision censors the outcome data: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model class which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that domain constraints improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model’s inferred risk predicts cancer diagnoses, its inferred testing policy captures known public health policies, and it can identify suboptimalities in test allocation. Though our case study is in healthcare, our analysis reveals a general class of domain constraints which can improve model estimation in many settings.
Reconciling the accuracy-diversity trade-off in recommendations [abstract | arxiv | official link]
Kenny Peng, Manish Raghavan, Emma Pierson, Jon Kleinberg, and Nikhil Garg
The ACM Web Conference (WWW‘24) (Oral Presentation)

In recommendation settings, there is an apparent trade-off between the goals of accuracy (to recommmend items a user is most likely to want) and diversity (to recommend items representing a range of categories). As such, real-world recommender systems often explicitly incorporate diversity separately from accuracy. This approach, however, leaves a basic question unanswered: Why is there a trade-off in the first place? We analyze a stylized model of recommendations reconciling this trade-off. Accounting for a user’s capacity constraints (users do not typically make use of all the items that are recommended to them), optimal recommendations in our model are inherently diverse. Thus, accuracy and diversity appear misaligned because traditional accuracy metrics do not consider capacity constraints. Our model yields precise and interpretable characterizations of diversity in different settings, giving practical insights into the design of diverse recommendations.
Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers [abstract | arxiv | official link]
Rajiv Movva, Sidhika Balachandar, Kenny Peng, Gabriel Agostini, Nikhil Garg, and Emma Pierson
The North American Chapter of the Association for Computational Linguistics (NAACL‘24)

Large language models (LLMs) are dramatically influencing AI research, spurring discussions on what has changed so far and how to shape the field’s future. To clarify such questions, we analyze a new dataset of 16,979 LLM-related arXiv papers, focusing on recent trends in 2023 vs. 2018-2022. First, we study disciplinary shifts: LLM research increasingly considers societal impacts, evidenced by 20× growth in LLM submissions to the Computers and Society sub-arXiv. An influx of new authors – half of all first authors in 2023 – are entering from non-NLP fields of CS, driving disciplinary expansion. Second, we study industry and academic publishing trends. Surprisingly, industry accounts for a smaller publication share in 2023, largely due to reduced output from Google and other Big Tech companies; universities in Asia are publishing more. Third, we study institutional collaboration: while industry-academic collaborations are common, they tend to focus on the same topics that industry focuses on rather than bridging differences. The most prolific institutions are all US- or China-based, but there is very little cross-country collaboration. We discuss implications around (1) how to support the influx of new authors, (2) how industry trends may affect academics, and (3) possible effects of (the lack of) collaboration.
Wisdom and Foolishness of Noisy Matching Markets [abstract | arxiv]
Kenny Peng and Nikhil Garg
ACM Conference on Economics and Computation (EC‘24)

We consider a many-to-one matching market where colleges share true preferences over students but make decisions using only independent noisy rankings. Each student has a \textittrue value v, but each college c ranks the student according to an independently drawn \textitestimated value v + X_c for X_c∼\DD. We ask a basic question about the resulting stable matching: How noisy is the set of matched students? Two striking effects can occur in large markets (i.e., with a continuum of students and a large number of colleges). When \DD is light-tailed, noise is fully attenuated: only the highest-value students are matched. When \DD is long-tailed, noise is fully amplified: students are matched uniformly at random. These results hold for any distribution of student preferences over colleges, and extend to when only subsets of colleges agree on true student valuations instead of the entire market. More broadly, our framework provides a tractable approach to analyze implications of imperfect preference formation in large markets.
Redesigning Service Level Agreements: Equity and Efficiency in City Government Operations [abstract | arxiv]
Zhi Liu and Nikhil Garg
ACM Conference on Economics and Computation (EC‘24)

We consider government service allocation – how the government allocates resources (e.g., maintenance of public infrastructure) over time. It is important to make these decisions efficiently and equitably – though these desiderata may conflict. In particular, we consider the design of Service Level Agreements (SLA) in city government operations: promises that incidents such as potholes and fallen trees will be responded to within a certain time. We model the problem of designing a set of SLAs as an optimization problem with equity and efficiency objectives under a queuing network framework; the city has two decision levers: how to allocate response budgets to different neighborhoods, and how to schedule responses to individual incidents. We: (1) Theoretically analyze a stylized model and find that the "price of equity" is small in realistic settings; (2) Develop a simulation-optimization framework to optimize policies in practice; (3) Apply our framework empirically using data from NYC, finding that: (a) status quo inspections are highly inefficient and inequitable compared to optimal ones, and (b) in practice, the equity-efficiency trade-off is not substantial: generally, inefficient policies are inequitable, and vice versa.
Equitable Congestion Pricing under the Markovian Traffic Model: An Application to Bogota [abstract | arxiv]
Alfredo Torrico, Natthawut Boonsiriphatthanajaroen, Nikhil Garg, Andrea Lodi, and Hugo Mainguy
ACM Conference on Economics and Computation (EC‘24)

Congestion pricing is used to raise revenues and reduce traffic and pollution. However, people have heterogeneous spatial demand patterns and willingness (or ability) to pay tolls, and so pricing may have substantial equity implications. We develop a data-driven approach to design congestion pricing given policymakers’ equity and efficiency objectives. First, algorithmically, we extend the Markovian traffic equilibrium setting introduced by Baillon & Cominetti (2008) to model heterogeneous populations and incorporate prices and outside options such as public transit. Second, we empirically evaluate various pricing schemes using data collected by an industry partner in the city of Bogota, one of the most congested cities in the world. We find that pricing personalized to each economic stratum can be substantially more efficient and equitable than uniform pricing; however, non-personalized but area-based pricing can recover much of the gap.
Ending Affirmative Action Harms Diversity Without Improving Academic Merit [abstract | arxiv | official link]
Jinsook Lee, Emma Harvey, Joyce Zhou, Nikhil Garg, Thorsten Joachims, and René Kizilcec
Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ’24)
Media: Cornell News.

Each year, selective American colleges sort through tens of thousands of applications to identify a first-year class that displays both academic merit and diversity. In the 2023-2024 admissions cycle, these colleges faced unprecedented challenges to doing so. First, the number of applications has been steadily growing year-over-year. Second, test-optional policies that have remained in place since the COVID-19 pandemic limit access to key information that has historically been predictive of academic success. Most recently, longstanding debates over affirmative action culminated in the Supreme Court banning race-conscious admissions. Colleges have explored machine learning (ML) models to address the issues of scale and missing test scores, often via ranking algorithms intended to allow human reviewers to focus attention on ‘top’ applicants. However, the Court’s ruling will force changes to these models, which were previously able to consider race as a factor in ranking. There is currently a poor understanding of how these mandated changes will shape applicant ranking algorithms, and, by extension, admitted classes. We seek to address this by quantifying the impact of different admission policies on the applications prioritized for review. We show that removing race data from a previously developed applicant ranking algorithm reduces the diversity of the top-ranked pool of applicants without meaningfully increasing the academic merit of that pool. We further measure the impact of policy change on individuals by quantifying arbitrariness in applicant rank. We find that any given policy has a high degree of arbitrariness (i.e. at most 9% of applicants are consistently ranked in the top 20%), and that removing race data from the ranking algorithm increases arbitrariness in outcomes for most applicants.
Monoculture in Matching Markets [abstract | arxiv | official link]
Kenny Peng and Nikhil Garg
Neural Information Processing Systems (NeurIPS ‘24)

Algorithmic monoculture arises when many decision-makers rely on the same algorithm to evaluate applicants. An emerging body of work investigates possible harms of this kind of homogeneity, but has been limited by the challenge of incorporating market effects in which the preferences and behavior of many applicants and decision-makers jointly interact to determine outcomes. Addressing this challenge, we introduce a tractable theoretical model of algorithmic monoculture in a two-sided matching market with many participants. We use the model to analyze outcomes under monoculture (when decision-makers all evaluate applicants using a common algorithm) and under polyculture (when decision-makers evaluate applicants independently). All else equal, monoculture (1) selects less-preferred applicants when noise is well-behaved, (2) matches more applicants to their top choice, though individual applicants may be worse off depending on their value to decision-makers and risk tolerance, and (3) is more robust to disparities in the number of applications submitted.
User-item fairness tradeoffs in recommendations [abstract | arxiv | official link]
Sophie Greenwood, Sudalakshmee Chiniah, and Nikhil Garg
Neural Information Processing Systems (NeurIPS ‘24)

In the basic recommendation paradigm, the most relevant item is recommended to each user. This may result in some items receiving lower exposure than they "should"; to counter this, several algorithmic approaches have been developed to ensure item fairness. These approaches necessarily degrade recommendations for some users to improve outcomes for items, leading to user fairness concerns. In turn, a recent line of work has focused on developing algorithms for multi-sided fairness, to jointly optimize user fairness, item fairness, and overall recommendation quality. This induces the question: what is the tradeoff between these objectives, and what are the characteristics of (multi-objective) optimal solutions? Theoretically, we develop a model of recommendations with user, item, and overall utility objectives and characterize the solutions of fairness-constrained optimization. We identify two phenomena: (a) when user preferences are diverse, there is "free" item and user fairness; and (b) users whose preferences are misestimated can be especially disadvantaged by item fairness constraints. Empirically, we build a recommendation system for preprints on arXiv and implement our framework, measuring the phenomena in practice and showing how these phenomena inform the design of markets with recommendation systems-intermediated matching.

2023

Coarse race data conceals disparities in clinical risk score performance [abstract | arxiv | official link]
Rajiv Movva, Divya Shanmugam, Kaihua Hou, Priya Pathak, John Guttag, Nikhil Garg, and Emma Pierson
Machine Learning for Healthcare (ML4HC)

Healthcare data in the United States often records only a patient’s coarse race group: for example, both Indian and Chinese patients are typically coded as “Asian.” It is unknown, however, whether this coarse coding conceals meaningful disparities in the performance of clinical risk scores across granular race groups. Here we show that it does. Using data from 418K emergency department visits, we assess clinical risk score performance disparities across granular race groups for three outcomes, five risk scores, and four performance metrics. Across outcomes and metrics, we show that there are significant granular disparities in performance within coarse race categories. In fact, variation in performance metrics within coarse groups often exceeds the variation between coarse groups. We explore why these disparities arise, finding that outcome rates, feature distributions, and the relationships between features and outcomes all vary significantly across granular race categories. Our results suggest that healthcare providers, hospital systems, and machine learning researchers should strive to collect, release, and use granular race data in place of coarse race data, and that existing analyses may significantly underestimate racial disparities in performance.
Interface Design to Mitigate Inflation in Recommender Systems [abstract | official link]
Rana Shahout, Yehonatan Peisakhovsky, Sasha Stoikov, and Nikhil Garg
ACM Conference on Recommender Systems (RecSys ’23 Short paper)

Recommendation systems rely on user-provided data to learn about item quality and provide personalized recommendations. An implicit assumption when aggregating ratings into item quality is that ratings are strong indicators of item quality. In this work, we test this assumption using data collected from a music discovery application. Our study focuses on two factors that cause rating inflation: heterogeneous user rating behavior and the dynamics of personalized recommendations. We show that user rating behavior is substantially varies by user, leading to item quality estimates that reflect the users who rated an item more than the item quality itself. Additionally, items that are more likely to be shown via personalized recommendations can experience a substantial increase in their exposure and potential bias toward them. To mitigate these effects, we analyze the results of a randomized controlled trial in which the rating interface was modified. The test resulted in a substantial improvement in user rating behavior and a reduction in item quality inflation. These findings highlight the importance of carefully considering the assumptions underlying recommendation systems and designing interfaces that encourage accurate rating behavior.
Supply-Side Equilibria in Recommender Systems [abstract | arxiv | official link]
Meena Jagadeesan, Nikhil Garg, and Jacob Steinhardt
Neural Information Processing Systems (NeurIPS ‘23)

Digital recommender systems such as Spotify and Netflix affect not only consumer behavior but also producer incentives: producers seek to supply content that will be recommended by the system. But what content will be produced? In this paper, we investigate the supply-side equilibria in content recommender systems. We model users and content as D-dimensional vectors, and recommend the content that has the highest dot product with each user. The main features of our model are that the producer decision space is high-dimensional and the user base is heterogeneous. This gives rise to new qualitative phenomena at equilibrium: First, the formation of genres, where producers specialize to compete for subsets of users. Using a duality argument, we derive necessary and sufficient conditions for this specialization to occur. Second, we show that producers can achieve positive profit at equilibrium, which is typically impossible under perfect competition. We derive sufficient conditions for this to occur, and show it is closely connected to specialization of content. In both results, the interplay between the geometry of the users and the structure of producer costs influences the structure of the supply-side equilibria. At a conceptual level, our work serves as a starting point to investigate how recommender systems shape supply-side competition between producers.
Quantifying Spatial Under-reporting Disparities in Resident Crowdsourcing [abstract | arxiv | official link | talk | code & data]
Zhi Liu, Uma Bhandaram, and Nikhil Garg
Nature Computational Science
Conference version published in ACM Conference on Economics and Computation (EC‘22), titled “Equity in Resident Crowdsourcing: Measuring Under-reporting without Ground Truth Data”
Media: Cornell News.

Modern city governance relies heavily on crowdsourcing to identify problems such as downed trees and power lines. A major concern is that residents do not report problems at the same rates, with heterogeneous reporting delays directly translating to downstream disparities in how quickly incidents can be addressed. Here we develop a method to identify reporting delays without using external ground-truth data. Our insight is that the rates at which duplicate reports are made about the same incident can be leveraged to disambiguate whether an incident has occurred by investigating its reporting rate once it has occurred. We apply our method to over 100,000 resident reports made in New York City and to over 900,000 reports made in Chicago, finding that there are substantial spatial and socioeconomic disparities in how quickly incidents are reported. We further validate our methods using external data and demonstrate how estimating reporting delays leads to practical insights and interventions for a more equitable, efficient government service.

2022

Strategic Ranking [abstract | arxiv | official link]
Lydia Liu, Nikhil Garg, and Christian Borgs
International Conference on Artificial Intelligence and Statistics (AISTATS‘22)

Strategic classification studies the design of a classifier robust to the manipulation of input by strategic individuals. However, the existing literature does not consider the effect of competition among individuals as induced by the algorithm design. Motivated by constrained allocation settings such as college admissions, we introduce strategic ranking, in which the (designed) individual reward depends on an applicant’s post-effort rank in a measurement of interest. Our results illustrate how competition among applicants affects the resulting equilibria and model insights. We analyze how various ranking reward designs trade off applicant, school, and societal utility and in particular how ranking design can counter inequities arising from disparate access to resources to improve one’s measured score: We find that randomization in the ranking reward design can mitigate two measures of disparate impact, welfare gap and access, whereas non-randomization may induce a high level of competition that systematically excludes a disadvantaged group.
Fair ranking: a critical review, challenges, and future directions [abstract | arxiv | official link]
Gourab K Patro, Lorenzo Porcaro, Laura Mitchell, Qiuyue Zhang, Meike Zehlike, and Nikhil Garg
ACM Conference on Fairness, Accountability, and Transparency (FAccT‘22)
This work was written as part of a distributed, student-led working group of Mechanism Design for Social Good

Ranking, recommendation, and retrieval systems are widely used in online platforms and other societal systems, including e-commerce, media-streaming, admissions, gig platforms, and hiring. In the recent past, a large "fair ranking" research literature has been developed around making these systems fair to the individuals, providers, or content that are being ranked. Most of this literature defines fairness for a single instance of retrieval, or as a simple additive notion for multiple instances of retrievals over time. This work provides a critical overview of this literature, detailing the often context-specific concerns that such an approach misses: the gap between high ranking placements and true provider utility, spillovers and compounding effects over time, induced strategic incentives, and the effect of statistical uncertainty. We then provide a path forward for a more holistic and impact-oriented fair ranking research agenda, including methodological lessons from other fields and the role of the broader stakeholder community in overcoming data bottlenecks and designing effective regulatory environments.
Trucks Don’t Mean Trump: Diagnosing Human Error in Image Analysis [abstract | arxiv | official link]
J.D. Zamfirescu-Pereira, Jerry Chen, Emily Wen, Allison Koenecke, Nikhil Garg, and Emma Pierson
ACM Conference on Fairness, Accountability, and Transparency (FAccT‘22)
Media: Cornell News.

Algorithms provide powerful tools for detecting and dissecting human bias and error. Here, we develop machine learning methods to to analyze how humans err in a particular high-stakes task: image interpretation. We leverage a unique dataset of 16,135,392 human predictions of whether a neighborhood voted for Donald Trump or Joe Biden in the 2020 US election, based on a Google Street View image. We show that by training a machine learning estimator of the Bayes optimal decision for each image, we can provide an actionable decomposition of human error into bias, variance, and noise terms, and further identify specific features (like pickup trucks) which lead humans astray. Our methods can be applied to ensure that human-in-the-loop decision-making is accurate and fair and are also applicable to black-box algorithmic systems.
Combatting Gerrymandering with Social Choice: the Design of Multi-member Districts [abstract | arxiv]
Nikhil Garg, Wes Gurnee, David Rothschild, and David Shmoys
ACM Conference on Economics and Computation (EC‘22)
Media: Cornell Chronicle.

Every representative democracy must specify a mechanism under which voters choose their representatives. The most common mechanism in the United States – winner-take-all single-member districts – both enables substantial partisan gerrymandering and constrains‘fair’ redistricting, preventing proportional representation in legislatures. We study the design of multi-member districts (MMDs), in which each district elects multiple representatives, potentially through a non-winner-takes-all voting rule. We carry out large-scale analyses for the U.S. House of Representatives under MMDs with different social choice functions, under algorithmically generated maps optimized for either partisan benefit or proportionality. Doing so requires efficiently incorporating predicted partisan outcomes – under various multi-winner social choice functions – into an algorithm that optimizes over an ensemble of maps. We find that with three-member districts using Single Transferable Vote, fairness-minded independent commissions would be able to achieve proportional outcomes in every state up to rounding, and advantage-seeking partisans would have their power to gerrymander significantly curtailed. Simultaneously, such districts would preserve geographic cohesion, an arguably important aspect of representative democracies. In the process, we open up a rich research agenda at the intersection of social choice and computational redistricting.

2021

Dropping Standardized Testing for Admissions: Differential Variance and Access [abstract | arxiv]
Nikhil Garg, Hannah Li, and Faidra Monachou
ACM Conference on Fairness, Accountability, and Transparency (FAccT‘21)
Also appeared in EAAMO‘21, with Best Student Paper Award
Appeared in the 2021 NBER Decentralization Conference

The University of California suspended through 2024 the requirement that applicants from California submit SAT scores, upending the major role standardized testing has played in college admissions. We study the impact of such decisions and its interplay with other intervention such as affirmative action on admitted class composition. More specifically, this paper develops a theoretical framework to study the effect of requiring test scores on academic merit and diversity in college admissions. The model has a college and set of potential students. Each student an unobserved noisy skill level, and multiple observed application components and group membership. The college is Bayesian and maximizes an objective that depends on both diversity and merit. It estimates each applicant’s true skill level using the observed features, and then admits students with or without affirmative action. We characterize the trade-off between the (potentially positive) informational role of standardized testing in college admissions and its (negative) exclusionary nature. Dropping test scores may exacerbate disparities by decreasing the amount of information available for each applicant, especially those from non-traditional backgrounds. However, if there are substantial barriers to testing, removing the test improves both academic merit and diversity by increasing the size of the applicant pool. The overall effect of testing depends on both the variance of the test score noise and the amount of people excluded by the test requirement. Finally, using application and transcript data from the University of Texas at Austin, we demonstrate how an admissions committee could measure the trade-off in practice.
Driver Surge Pricing [abstract | ssrn | code & data | talk | official link]
Nikhil Garg and Hamid Nazerzadeh
Management Science
(Conference version published in EC‘20.)

Ride-hailing marketplaces like Uber and Lyft use dynamic pricing, often called surge, to balance the supply of available drivers with the demand for rides. We study pricing mechanisms for such marketplaces from the perspective of drivers, presenting the theoretical foundation that has informed the design of Uber’s new additive driver surge mechanism. We present a dynamic stochastic model to capture the impact of surge pricing on driver earnings and their strategies to maximize such earnings. In this setting, some time periods (surge) are more valuable than others (non-surge), and so trips of different time lengths vary in the opportunity cost they impose on drivers. First, we show that multiplicative surge, historically the standard on ride-hailing platforms, is not incentive compatible in a dynamic setting. We then propose a structured, incentive-compatible pricing mechanism. This closed-form mechanism has a simple form and is well-approximated by Uber’s new additive surge mechanism. Finally, through both numerical analysis and real data from a ride-hailing marketplace, we show that additive surge is more approximately incentive compatible in practice than multiplicative surge, providing more stable earnings to drivers.
Test-optional Policies: Overcoming Strategic Behavior and Informational Gaps [abstract | arxiv | official link]
Zhi Liu and Nikhil Garg
AAAI/ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO‘21)

Due to the Covid-19 pandemic, more than 500 US-based colleges and universities went “test-optional” for admissions and promised that they would not penalize applicants for not submitting test scores, part of a longer trend to rethink the role of testing in college admissions. However, it remains unclear how (and whether) a college can simultaneously use test scores for those who submit them, while not penalizing those who do not–and what that promise even means. We formalize these questions, and study how a college can overcome two challenges with optional testing: strategic applicants (when those with low test scores can pretend to not have taken the test), and informational gaps (it has more information on those who submit a test score than those who do not). We find that colleges can indeed do so, if and only if they are able to use information on who has test access and are willing to randomize admissions.
The Stereotyping Problem in Collaboratively Filtered Recommender Systems [abstract | arxiv | official link]
Wenshuo Guo, Karl Krauth, Michael I. Jordan, and Nikhil Garg
AAAI/ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO‘21)

Recommender systems – and especially matrix factorization-based collaborative filtering algorithms – play a crucial role in mediating our access to online information. We show that such algorithms induce a particular kind of stereotyping: if preferences for a set of items are anti-correlated in the general user population, then those items may not be recommended together to a user, regardless of that user’s preferences and ratings history. First, we introduce a notion of joint accessibility, which measures the extent to which a set of items can jointly be accessed by users. We then study joint accessibility under the standard factorization-based collaborative filtering framework, and provide theoretical necessary and sufficient conditions when joint accessibility is violated. Moreover, we show that these conditions can easily be violated when the users are represented by a single feature vector. To improve joint accessibility, we further propose an alternative modelling fix, which is designed to capture the diverse multiple interests of each user using a multi-vector representation. We conduct extensive experiments on real and simulated datasets, demonstrating the stereotyping problem with standard single-vector matrix factorization models.

2020

Fair Allocation through Selective Information Acquisition [abstract | arxiv | official link]
William Cai, Johann Gaebler, Nikhil Garg, and Sharad Goel
AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES‘20)

Public and private institutions must often allocate scare resources under uncertainty. Banks, for example, extend credit to loan applicants based in part on their estimated likelihood of repaying a loan. But when the quality of information differs across candidates (e.g., if some applicants lack traditional credit histories), common lending strategies can lead to disparities across groups. Here we consider a setting in which decision makers—before allocating resources—can choose to spend some of their limited budget further screening select individuals. We present a computationally efficient algorithm for deciding whom to screen that maximizes a standard measure of social welfare. Intuitively, decision makers should screen candidates on the margin, for whom the additional information could plausibly alter the allocation. We formalize this idea by showing the problem can be reduced to solving a series of linear programs. Both on synthetic and real-world datasets, this strategy improves utility, illustrating the value of targeted information acquisition in such decisions. Further, when there is social value for distributing resources to groups for whom we have a priori poor information—like those without credit scores—our approach can substantially improve the allocation of limited assets.
Designing Marketplaces and Civic Engagement Platforms: Learning, Incentives, and Pricing [abstract | official link | summary | talk | short talk]
Nikhil Garg
PhD Dissertation, Stanford University
INFORMS George Dantzig Dissertation Award, 2020
ACM SIGecom dissertation award (Honorable mention), 2021

Platforms increasingly mediate interactions between people: both helping us find work and transportation, and supporting our civic society through discussion and decision-making. Principled system design requires formalizing the platform’s objective and understanding the incentives, behavioral tendencies, and capabilities of participants; in turn, the design influences participant behavior. In this dissertation, I describe work designing platforms in two domains – two-sided marketplaces and civic engagement platforms – combining both theoretical and empirical analyses of such systems. First, I consider the design of surge pricing that is incentive compatible for drivers in ride-hailing platforms. Second, I tackle rating system inflation and design on online platforms. Finally, I study the design and deployment of systems for participatory budgeting. The work in this dissertation has informed deployments at Uber, a large online labor platform, and in participatory budgeting elections across the U.S.
Designing Informative Rating Systems: Evidence from an Online Labor Market [abstract | arxiv | talk | official link]
Nikhil Garg and Ramesh Johari
Manufacturing & Service Operations Management
Media: New York Times, Stanford Engineering magazine.
M&SOM student paper award (2nd place), 2020
(Conference version published in EC‘20.)

Platforms critically rely on rating systems to learn the quality of market participants. In practice, however, these ratings are often highly inflated, drastically reducing the signal available to distinguish quality. We consider two questions: First, can rating systems better discriminate quality by altering the meaning and relative importance of the levels in the rating system? And second, if so, how should the platform optimize these choices in the design of the rating system? We first analyze the results of a randomized controlled trial on an online labor market in which an additional question was added to the feedback form. Between treatment conditions, we vary the question phrasing and answer choices. We further run an experiment on Amazon Mechanical Turk with similar structure, to confirm the labor market findings. Our tests reveal that current inflationary norms can in fact be countered by re-anchoring the meaning of the levels of the rating system. In particular, scales that are positive-skewed and provide specific interpretations for what each label means yield rating distributions that are much more informative about quality. Second, we develop a theoretical framework to optimize the design of a rating system by choosing answer labels and their numeric interpretations in a manner that maximizes the rate of convergence to the true underlying quality distribution. Finally, we run simulations with an empirically calibrated model and use these to study the implications for optimal rating system design. Our simulations demonstrate that our modeling and optimization approach can substantially improve the quality of information obtained over baseline designs. Overall, our study illustrates that rating systems that are informative in practice can be designed, and demonstrates how to design them in a principled manner.
Markets for Public Decision-making [abstract | arxiv | official link]
Nikhil Garg, Ashish Goel, and Ben Plaut
Social Choice and Welfare
(Conference version published in WINE‘18.)

A public decision-making problem consists of a set of issues, each with multiple possible alternatives, and a set of competing agents, each with a preferred alternative for each issue. We study adaptations of market economies to this setting, focusing on binary issues. Issues have prices, and each agent is endowed with artificial currency that she can use to purchase probability for her preferred alternatives (we allow randomized outcomes). We first show that when each issue has a single price that is common to all agents, market equilibria can be arbitrarily bad. This negative result motivates a different approach. We present a novel technique called "pairwise issue expansion", which transforms any public decision-making instance into an equivalent Fisher market, the simplest type of private goods market. This is done by expanding each issue into many goods: one for each pair of agents who disagree on that issue. We show that the equilibrium prices in the constructed Fisher market yield a "pairwise pricing equilibrium" in the original public decision-making problem which maximizes Nash welfare. More broadly, pairwise issue expansion uncovers a powerful connection between the public decision-making and private goods settings; this immediately yields several interesting results about public decisions markets, and furthers the hope that we will be able to find a simple iterative voting protocol that leads to near-optimum decisions.

2019

Iterative Local Voting for Collective Decision-making in Continuous Spaces [abstract | demo | official link]
Nikhil Garg, Vijay Kamble, Ashish Goel, David Marn, and Kamesh Munagala
Journal of Artificial Intelligence Research (JAIR)
(Conference version published in WWW‘17.)

Many societal decision problems lie in high-dimensional continuous spaces not amenable to the voting techniques common for their discrete or single-dimensional counterparts. These problems are typically discretized before running an election or decided upon through negotiation by representatives. We propose a algorithm called Iterative Local Voting for collective decision-making in this setting. In this algorithm, voters are sequentially sampled and asked to modify a candidate solution within some local neighborhood of its current value, as defined by a ball in some chosen norm, with the size of the ball shrinking at a specified rate. We first prove the convergence of this algorithm under appropriate choices of neighborhoods to Pareto optimal solutions with desirable fairness properties in certain natural settings: when the voters’ utilities can be expressed in terms of some form of distance from their ideal solution, and when these utilities are additively decomposable across dimensions. In many of these cases, we obtain convergence to the societal welfare maximizing solution. We then describe an experiment in which we test our algorithm for the decision of the U.S. Federal Budget on Mechanical Turk with over 2,000 workers, employing neighborhoods defined by various L-Norm balls. We make several observations that inform future implementations of such a procedure.
We have a demo of our Mechanical Turk experiment available live here. It can be used as follows:
1. If the URL is entered without any parameters, it uses the current radius (based on previous uses of the demo, going down by $1/N$) and uses the $\mathcal{L}^2$ mechanism.
2. To set the mechanism, navigate to http://54.183.140.235/mechanism/[option]/, where instead of [option] use either, l1, l2, linf, or full, for the respective mechanisms.
3. To set the radius, navigate to http://54.183.140.235/mechanism/[number]/, where any integer can be entered instead of [number]. This option resets the starting radius for the specific mechanism, which will go down by $1/N$ in subsequent accesses.
4. To set both the mechanism and the radius, navigate to http://54.183.140.235/radius/[number]/mechanism/[option]/, with the above options.
Designing Optimal Binary Rating Systems [abstract | official link]
Nikhil Garg and Ramesh Johari
International Conference on Artificial Intelligence and Statistics (AISTATS‘19)

Modern online platforms rely on effective rating systems to learn about items. We consider the optimal design of rating systems that collect binary feedback after transactions. We make three contributions. First, we formalize the performance of a rating system as the speed with which it recovers the true underlying ranking on items (in a large deviations sense), accounting for both items’ underlying match rates and the platform’s preferences. Second, we provide an efficient algorithm to compute the binary feedback system that yields the highest such performance. Finally, we show how this theoretical perspective can be used to empirically design an implementable, approximately optimal rating system, and validate our approach using real-world experimental data collected on Amazon Mechanical Turk.
Analyzing Polarization in Social Media: Method and Application to Tweets on 21 Mass Shootings [abstract | arxiv | code & data | official link]
Dorottya Demszky, Nikhil Garg, Rob Voigt, James Zou, Jesse Shapiro, Matthew Gentzkow, and Dan Jurafsky
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL‘19)
Media: Washington Post, Stanford News.

We provide an NLP framework to uncover four linguistic dimensions of political polarization in social media: topic choice, framing, affect and illocutionary force. We quantify these aspects with existing lexical methods, and propose clustering of tweet embeddings as a means to identify salient topics for analysis across events; human evaluations show that our approach generates more cohesive topics than traditional LDA-based models. We apply our methods to study 4.4M tweets on 21 mass shootings. We provide evidence that the discussion of these events is highly polarized politically and that this polarization is primarily driven by partisan differences in framing rather than topic choice. We identify framing devices, such as grounding and the contrasting use of the terms "terrorist" and "crazy", that contribute to polarization. Results pertaining to topic choice, affect and illocutionary force suggest that Republicans focus more on the shooter and event-specific facts (news) while Democrats focus more on the victims and call for policy changes. Our work contributes to a deeper understanding of the way group divisions manifest in language and to computational methods for studying them.
Who is in Your Top Three? Optimizing Learning in Elections with Many Candidates [abstract | arxiv | official link]
Nikhil Garg, Lodewijk Gelauff, Sukolsak Sakshuwong, and Ashish Goel
AAAI Conference on Human Computation and Crowdsourcing (HCOMP‘19)

Elections and opinion polls often have many candidates, with the aim to either rank the candidates or identify a small set of winners according to voters’ preferences. In practice, voters do not provide a full ranking; instead, each voter provides their favorite K candidates, potentially in ranked order. The election organizer must choose K and an aggregation rule. We provide a theoretical framework to make these choices. Each K-Approval or K-partial ranking mechanism (with a corresponding positional scoring rule) induces a learning rate for the speed at which the election correctly recovers the asymptotic outcome. Given the voter choice distribution, the election planner can thus identify the rate optimal mechanism. Earlier work in this area provides coarse order-of-magnitude guaranties which are not sufficient to make such choices. Our framework further resolves questions of when randomizing between multiple mechanisms may improve learning, for arbitrary voter noise models. Finally, we use data from 5 large participatory budgeting elections that we organized across several US cities, along with other ranking data, to demonstrate the utility of our methods. In particular, we find that historically such elections have set K too low and that picking the right mechanism can be the difference between identifying the correct winner with only a 80% probability or a 99.9% probability after 500 voters.
Deliberative Democracy with the Online Deliberation Platform [official link]
James Fishkin, Nikhil Garg, Lodewijk Gelauff, Ashish Goel, Kamesh Munagala, Sukolsak Sakshuwong, Alice Siu, and Sravya Yandamuri
AAAI Conference on Human Computation and Crowdsourcing Demo Track

2018

Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes [abstract | official link | code & data | talk]
Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou
Proceedings of the National Academy of Sciences (PNAS)
Media: Stanford News (and EE department), Science Magazine, Smithsonian Magazine (in print), The World Economic Forum, Futurity, etc.

Word embeddings are a powerful machine-learning framework that represents each English word by a vector. The geometric relationship between these vectors captures meaningful semantic relationships between the corresponding words. In this paper, we develop a framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States. We integrate word embeddings trained on 100 y of text data with the US Census to show that changes in the embedding track closely with demographic and occupation shifts over time. The embedding captures societal shifts - e.g., the women’s movement in the 1960s and Asian immigration into the United States - and also illuminates how specific adjectives and occupations became more closely associated with certain populations over time. Our framework for temporal analysis of word embedding opens up a fruitful intersection between machine learning and quantitative social science.

2015

Use of Electroencephalography and Galvanic Skin Response in the Prediction of an Attentive Cognitive State [abstract | pdf]
Beth Lewandowski, Kier Fortier, Nikhil Garg, Victor Rielly, Jeff Mackey, Tristan Hearn, Angela Harrivel, and Bradford Fenton
Health and Human Performance Research Summit, Dayton, CO

As part of an effort aimed at improving aviation safety, the Crew State Monitoring Element of the NASA Vehicle Systems Safety Technologies Project is developing a monitoring system capable of detecting cognitive states that may be associated with unsafe piloting conditions. The long term goal is a real-time, integrated system, that uses multiple physiological sensing modalities to detect multiple cognitive states with high accuracy, which can be used to help optimize human performance. Prior to realizing an integrated system, individual sensing modalities are being investigated, including the use of electroencephalographic (EEG) and galvanic skin response (GSR) signals, in the determination of an attentive or inattentive state. EEG and GSR data are collected during periods of rest and as subjects perform psychological tests including the psychomotor vigilance test, the Mackwork clock test and the attention network test. Subjects also perform tasks designed to simulate piloting tasks within the NASA multi-attribute task battery (MATB-II) program. The signals are filtered, the artifacts are rejected and the power spectral density (PSD) of the signals are found. Comparisons of the PSD between the rest and test blocks are made, along with the change in PSD over the time course of the blocks. Future work includes the collection of heart rate data and the investigation of heart rate variability as an additional measure to use in the prediction of attentive state, as well as the investigation of additional EEG signal processing methods such as source localization, multi- scale entropy and coherence measures. Preliminary results will be presented to highlight the methods used and to discuss our hypotheses. The challenges associated with realizing a real-time, accurate, multi-modal, cognitive state monitoring system are numerous. A discussion of some of the challenges will be provided, including real-time artifact rejection methods, quantification of inter- and intra-subject variability, determination of what information within the signals provides the best measurement of attention and determination of how information from the different modalities can be integrated to improve the overall accuracy of the system.
Downlink and Uplink User Association in Dense Next-Generation Wireless Networks [abstract | official link]
Nikhil Garg
Bachelors Thesis, University of Texas at Austin.

5G, the next-generation cellular network, must serve an aggregate data rate of 1000 times that of current 4G networks while reducing data latency by a factor of ten. To meet these requirements, 5G networks will be far denser than existing networks, and small cells (femtocells and picocells) will augment network capacity. However, dense networks raise questions regarding interference, user association, and handoff between base stations. Where recent papers have demonstrated that interference from small cells will not be prohibitive under multi-slope path loss models, this thesis describes how the use of different path loss models affects the design of such dense, multi-tier networks. This thesis concludes that the gains realized by downlink biasing and uplink/downlink decoupling are strongly dependent on the path loss model assumed and the density differential between base station tiers. Furthermore, this thesis argues that the gains from uplink/downlink decoupling are reduced by a factor of 50% when optimal biasing for the downlink is used.
Fair Use and Innovation in Unlicensed Wireless Spectrum: LTE unlicensed and Wi-Fi in the 5 GHz unlicensed band [pdf]
Nikhil Garg
IEEE-USA Journal of Technology and Public Policy
Impact of Dual Slope Path Loss on User Association in HetNets [abstract | official link]
Nikhil Garg, Sarabjot Singh, and Jeffrey Andrews
IEEE Globecom Workshop

Intelligent load balancing is essential to fully realize the benefits of dense heterogeneous networks. Current techniques have largely been studied with single slope path loss models, though multi-slope models are known to more closely match real deployments. This paper develops insight into the performance of biasing and uplink/downlink decoupling for user association in HetNets with dual slope path loss models. It is shown that dual slope path loss models change the tradeoffs inherent in biasing and reduce gains from both biasing and uplink/downlink decoupling. The results show that with the dual slope path loss models, the bias maximizing the median rate is not optimal for other users, e.g., edge users. Furthermore, optimal downlink biasing is shown to realize most of the gains from downlink-uplink decoupling. Moreover, the user association gains in dense networks are observed to be quite sensitive to the path loss exponent beyond the critical distance in a dual slope model.

Nikhil Garg

Papers

Working Papers

Journal Articles

Peer Reviewed Conference Proceedings (without journal versions)

Other (workshops and technical reports)

Theses

Online Marketplaces

Civic Engagement

Algorithmic Fairness

Natural Language Processing

Wireless Communications and Signal Processing

Working Papers

2025

2024

2023

2022

2021

2020

2019

2018

2015