The Ghost Files

US historians have long complained about gaps in the National Archives. Can big-data analysis show what kinds of information the government is keeping classified?

by David J. Craig Published Winter 2013-14
  • Comments (0)
  • Email
  • ShareThis
  • Print
  • Download
  • Text Size A A A

“There was all sorts of stuff that should have been released,” says Connelly, a slight man of forty-five with a boyish smile. “But the vast majority of it was still stuck in the pipeline somewhere. So on the one hand we have this amazing potential to study the inner workings of our government with a level of detail that is astonishing. Yet we’re still waiting for the floodgates to open.”

In early 2012, Connelly put aside his research on the Cold War and began studying US secrecy policy. He learned everything he could about how federal records are created, maintained, and released to the public. He learned that since the 1970s, the government’s budget for reviewing and declassifying sensitive documents had failed to keep pace with the production of new ones. The backlog of secrets had grown significantly following the September 11, 2001, attacks, when federal employees were instructed to be more cautious in deciding whether to release old documents. After Barack Obama ’83CC became president, the glut shrank a bit, as government censors were told to relax their standards. By the end of Obama’s first term, though, progress plateaued and the size of the backlog stabilized at about 360 million pages.

Then Connelly had an idea: could he use data mining to infer what types of information were being left out of the public record? In theory, this seemed plausible, if he could compile enough materials to work with. He figured he could start by asking Columbia Libraries to give him special access to several commercial databases that the University licenses from academic publishers and which contain federal records. He could then download a wealth of material from government websites. Maybe he could even gather up documents that fellow scholars, journalists, and citizens had acquired directly from the government under the Freedom of Information Act (FOIA). No one had ever tried to analyze the entire corpus of government records as one big database before. The promise of data mining now made it seem like a worthwhile endeavor to Connelly. He thought that if he were to recruit an interdisciplinary team of data analysts and fellow historians, he might create the first system for highlighting gaps in the National Archives. Perhaps this would even shame the government into releasing more classified materials.

“I thought if this were possible, it would be the most important thing I could do,” he says. “I’d go back to writing books later.”

Connelly would soon cast a new light on why the US government was slow in releasing its secrets. In doing so, he would thrust himself into a debate that had previously been taking place behind closed doors — a debate about whether the free flow of information and national security are on a collision course. 

Toeholds and teamwork

In a small apartment in Harlem, a young mathematician named Daniel Krasner ’10GSAS sits at his kitchen table, staring into the soft blue light of his laptop. On the screen is a line graph depicting the number of teleconferences that Henry Kissinger participated in each day while serving as Richard Nixon’s secretary of state. “You see this spike here in late 1973?” says Krasner, pointing to a brief period when Kissinger was holding fifty to sixty teleconferences a day. “That has a pretty obvious explanation — it’s during the Yom Kippur war. But what about these spikes, here in 1975, or these in 1976? They could be worth looking into.”

Krasner, who earned a PhD in mathematics at Columbia, is among a half dozen computer scientists, mathematicians, and statisticians now working with Connelly on a multimedia research project they call the Declassification Engine. For the past year, this team has been gathering up large numbers of federal documents and creating analytic tools to detect anomalies in the collections. Several of the tools are on the project’s website and available for anyone to use. The one Krasner is developing is intended to find evidentiary traces of important historical episodes — a diplomatic crisis, say, or preparations for a military strike — that scholars until now have failed to notice. The Columbia researchers suspect that by spotting something as subtle as an uptick in a diplomat’s telephone activity they may be able to reveal the existence of historical episodes that the US government has largely suppressed from the public record.

“If you can make out something happening in the shadows, then we can ask: does it seem curious that little information about this event is available in the public record?” says David Allen, a PhD candidate in history at Columbia who is working on the project.

Some of the material that Krasner is analyzing comes from a collection of 1.1 million telegrams, airgrams, telephone transcripts, and other communication records of American diplomats from the mid-1970s. The database, called the Central Foreign Policy Files, is available today on the National Archives’ website, where people can search its contents in rudimentary ways. Connelly, with the help of Columbia’s Digital Humanities Center, got his hands on the raw text files from the government. Now he and his colleagues are picking apart the documents using their own software.

“We can also analyze all of the language in these documents as what we call a ‘bag of words,’” says Krasner. “By seeing what terms tend to occur together in the same documents at certain times, we could spot interesting episodes.”

  • Email
  • ShareThis
  • Print
  • Recommend (82)
Log in with your UNI to post a comment

The best stories wherever you go on the Columbia Magazine App

Maybe next time