The Ghost Files

US historians have long complained about gaps in the National Archives. Can big-data analysis show what kinds of information the government is keeping classified?

by David J. Craig Published Winter 2013-14
  • Comments (0)
  • Email
  • ShareThis
  • Print
  • Download
  • Text Size A A A

The Central Foreign Policy Files data set is an unusual collection in that it covers only material from 1973 — which is when the State Department implemented its first electronic records system — to 1976 — which is as far as the department’s employees have progressed in an ongoing effort to translate the files into a format that is Internet-friendly. But the collection has a couple of key advantages. The first is that it is comprehensive for its time period, containing all records of a particular type. Most collections of government documents are, by contrast, curated by archivists and editors to contain only materials thought to be of particular interest to scholars. The inclusiveness of the Central Foreign Policy Files would help the Columbia researchers spot conspicuous gaps.

Something as subtle as an uptick in a diplomat’s telephone activity could be the sign of an international crisis that has been suppressed from the public record.

The other reason Connelly sought out this collection was because of something he remembered seeing in the US State Department’s physical files at the National Archives. Often, when looking in a box of diplomatic records, he would find a single sheet of paper, slipped in between the others, that described the rough outlines of a document that appeared to be missing. This sheet usually contained only a date, a title, or subject, and sometimes the name of the sender and recipient. Connelly learned that this was the metadata of a classified document that had been rejected for release — either upon turning thirty or when someone requested it through the FOIA.

“They’re not very interesting when viewed one at a time,” says Connelly. “You wouldn’t think much of them.”

But what if you had a quarter million of them? That’s how many were in the electronic version of the Central Foreign Policy Files. Every single diplomatic communication that had been transmitted between 1973 and 1976, marked as classified and later rejected for release, was represented by a metadata file.

It was when Connelly acquired this database, in the fall of 2012, that he began to recruit the help of professional number crunchers. First he called up Columbia statistics professor David Madigan, a versatile researcher who had previously developed algorithms that predict the side effects of medications. Then he brought in several members of Columbia’s computer-science department who specialize in finding patterns in large amounts of text. Within a few months, they would receive a $150,000 award from the Brown Institute for Media Innovation, a joint enterprise run by Columbia and Stanford that promotes interdisciplinary projects between journalists and data scientists.

“I’d worked with scientists before, but never like this,” says Connelly. “This would be as far as I’d ever strayed from the old model of history I grew up with, where Leopold von Ranke is standing alone atop a mountain, surveying the landscape of time with nothing but the facts in his head and a healthy dose of intuition.” 

Ethnic profiling, ’70s style

Last spring, Connelly and his colleagues began inspecting those 250,000 metadata records to see what terms appeared on them most frequently.

“Basically, we were fishing around,” says Connelly. “We were modeling our technology.”

Once they did the analysis, one word stuck out: boulder. It appeared on thousands of cards.

Connelly soon concluded that this was a reference to “Operation Boulder,” a Nixon-era program that involved spying on Arab- Americans and scrutinizing visa applicants with Arab-sounding names. Initiated after the killing of eleven Israeli athletes by Palestinians at the 1972 Munich Olympics, Operation Boulder was roundly denounced by national-security experts for being ineffectual at improving the nation’s security. It was disbanded by the State Department in 1975. Few details about the program had emerged since. But what little information had been released provided Connelly and his colleagues the clues they needed to recognize the documents’ subject. The cards that contained the word boulder, when looked at in the aggregate, were also rich with references to visa applications, for example.

“There’s no doubt that these missing files are about the Nixon program,” says Connelly. “We can tell by looking at documents that have been released about the program. They also tend to mention visas.”

Why would the government release some documents about Operation Boulder and keep others secret? The Columbia researchers can shed light on this, too. Their analysis shows that before 2002, documents about Operation Boulder often got released when they came up for review. And then, abruptly, in April of that year, hardly any such files were declassified. Is it possible that the Bush administration blocked these releases to avoid comparisons between the antiterrorism measures that it was pursuing at the time, such as its no-fly list, and Nixon’s failed policy?

“It’s not a smoking gun,” Connelly says, “but it’s suggestive, isn’t it?”

Could a computer’s guess about the content of a blacked-out passage be considered a leak? Would it matter if it guesses right or not?

David Pozen, a Columbia law professor who is an expert on government secrecy, says that this floating of trial balloons, this dropping of hints, is a valuable contribution to scholarship in itself. He says that the Declassification Engine, by revealing what types of information the US government is keeping secret, is likely to encourage scholars, journalists, and citizens to file more public-record requests. Furthermore, he says that the project’s discoveries could help people win these petitions.

“One of the challenges in getting information through FOIA is that you need to describe what you’re looking for in considerable detail,” he says. “If you can show that an agency is sitting on thousands of documents related to a particular topic, well, the government may find it much less politically feasible to reject you.” 

  • Email
  • ShareThis
  • Print
  • Recommend (88)
Log in with your UNI to post a comment

The best stories wherever you go on the Columbia Magazine App

Maybe next time