Undergraduate Upends A 40-Year-Old Data Science Conjecture

In a groundbreaking 1985 publication, computer scientist Andrew Yao, who later received the prestigious A.M. Turing Award, posited that among hash tables with specific characteristics, the most effective method for locating a single element or finding an empty slot is through random exploration—an approach referred to as uniform probing. He further claimed that in the worst-case scenario, particularly when searching for the last vacant position, one could never perform better than x. For four decades, the consensus among computer scientists has been that Yao’s hypothesis held true.

Krapivin, however, was undeterred by established beliefs, primarily because he was unaware of them. “I approached this without any knowledge of Yao’s conjecture,” he explained. His experiments using miniature pointers led to the development of a new type of hash table—one that does not depend on uniform probing. For this innovative hash table, the time taken for the worst-case queries and insertions is proportional to (log x)², which is significantly quicker than x. This finding directly refuted Yao’s conjecture. With the assistance of Farach-Colton and Kuszmaul, Krapivin established that (log x)² represents the optimal, unsurpassable threshold for the widely recognized class of hash tables discussed by Yao.

“This discovery is remarkable as it tackles and resolves such a foundational issue,” remarked Guy Blelloch from Carnegie Mellon.

“It’s not merely that they disproved [Yao’s conjecture], but they’ve also pinpointed the best possible answer to his question,” noted Sepehr Assadi of the University of Waterloo. “We could have spent another 40 years without knowing the right answer.”

Krapivin at the King’s College Bridge at the University of Cambridge. His innovative hash table can retrieve and store data more quickly than researchers ever expected.

Photo credit: Phillip Ammon for Quanta Magazine

Beyond challenging Yao’s conjecture, the new research also unveils what many consider an even more impressive breakthrough. This pertains to a related but slightly different scenario: In 1985, Yao not only examined the worst-case times for queries but also the average time taken across all possible queries. He established that hash tables with certain properties—including those categorized as “greedy,” meaning that new elements must be positioned in the first available slot—could never achieve an average time superior to log x.

Together, Farach-Colton, Krapivin, and Kuszmaul sought to determine if that same restriction applied to non-greedy hash tables. They demonstrated that it did not by presenting a counterexample—an innovative non-greedy hash table with an average query time significantly better than log x. Remarkably, this time does not depend on x at all. “You arrive at a number,” Farach-Colton described, “a constant that remains unaffected by how full the hash table is.” The realization that a constant average query time could be attained, irrespective of the hash table’s fullness, was entirely unforeseen—even by the researchers themselves.

The implications of the team’s findings may not offer immediate practical applications, but as Conway indicated, that does not diminish their value. “Understanding these types of data structures is essential. You never know when insights like this will pave the way for improved practical solutions.”

This article is reprinted with permission from Quanta Magazine, an editorially independent publication of the Simons Foundation dedicated to enhancing public understanding of science through coverage of research developments and trends in mathematics and the physical and life sciences.