COVER STORY

The Ghost Files

US historians have long complained about gaps in the National Archives. Can big-data analysis show what kinds of information the government is keeping classified?

by David J. Craig Published Winter 2013-14
  • Comments (0)
  • Email
  • ShareThis
  • Print
  • Download
  • Text Size A A A

All of it, not some of it

The Declassification Engine will soon provide its visitors access to more declassified US government documents than have ever been available in one place.

Many of the materials on its site have so far come from commercial vendors. These include a set of 117,000 records produced by various US departments and agencies from the 1940s to the present; this database, known as the Declassified Documents Reference System, is considered by scholars the most important of its type, based on the historical significance of its individual items. It is on loan to the Columbia researchers from the publishing company Gale.

In terms of sheer volume, though, the project’s most impressive acquisition is yet to come. The Internet Archive, a nonprofit digital library based in San Francisco that collects all manner of public-domain content, from books to music to court transcripts, has agreed to give the Declassification Engine access to tens of millions of federal documents that its employees have trawled from government websites. These files will be accessible on the project’s website later this year.

To keep the site growing, Connelly is also trying to create a sort of electronic catch basin for collecting documents as soon as the government releases them. One way he aims to do this is by collaborating with nonprofit organizations that have sprung up in recent years to help people file public-record requests. An organization called FOIA Machine, for instance, provides easy-to-use electronic submission forms and then tracks people’s requests for them; when the government meets a request, the materials come to an e-mail account hosted by FOIA Machine. Connelly is now working with the organization to get access to those files. He says that migrating the documents to the Declassification Engine will allow researchers to study them alongside other declassified records using sophisticated analytic tools for the first time. Only a tiny percentage of documents that are released under FOIA, he points out, ever wind up in databases on library or government websites.

“Often, the person who receives material from the government is the only one in the world who now has a digital version of those records,” he says. “That’s a waste. Why not bring them all together?”

Old bars and stripes

One of the tools now operating on the Declassification Engine is ideally suited to gleaning insights from this influx of fresh material. Powered by software created by Columbia PhD candidate Alexander Rush, it can detect when multiple versions of the same document reside in the Declassification Engine’s databases. Connelly says it is common for slightly different versions of the same record to be floating around, because the government will often release a document with lots of text blacked out and then put out a cleaner version, say, in response to a FOIA request, years later. He says researchers can gain insights into the political sensitivities of past US presidents by seeing what language was blacked out under their watch and subsequently restored by their successors.

“Sometimes it’s the older, more heavily redacted version you’re hunting for,” Connelly says. “I’ve met historians who’ve spent years trying to track down all the versions that may exist of a particular memo.”

Analyzing thousands of pairs of documents in this manner might also reveal political schisms within a sitting president’s administration, say the Columbia researchers, because sometimes one federal agency, in response to a FOIA request, will release a more complete version of a document than will another agency in response to similar requests.

“A classic example of this occurred in the aftermath of the Abu Ghraib scandal, when the FBI was eager to show that it had had nothing to do with torture and so it released a lot of information showing that other agencies were responsible for it,” says Connelly. “We hope that by analyzing huge numbers of documents, we’ll be able to identify the kinds of information that tend to get withheld by one or another agency, and thereby correct for the inherent bias in the public record.” 

Truth and consequences

Is Matthew Connelly the next Julian Assange?

That’s a question he gets a lot. His answer is an emphatic “No.” He and his colleagues are only gathering documents that have been publicly released. And they are careful not to reveal any information that would endanger US security. They say their goal is merely to highlight broad categories of information that the federal government is keeping classified.

“Everybody involved in this project appreciates that some information needs to remain secret,” Connelly says. “On the other hand, lots of information is kept secret to avoid embarrassments, for political reasons, or simply because the government isn’t investing properly in reviewing and declassifying old documents. We want to help the government to uphold its own secrecy laws.”

That said, the data-mining technology that Connelly and his colleagues are developing could conceivably be adapted to generate statistically based guesses about what terms lie beneath redactions. And this is where things get tricky. Connelly described this possibility for a few journalists last spring. Their reports, appearing in Wired, the New Yorker, Columbia Journalism Review, and half a dozen other publications, posed riveting questions: Could a computer’s guess about the content of blacked-out passage be considered a leak? Would it matter if it guesses right or not?

  • Email
  • ShareThis
  • Print
  • Recommend (48)
Log in with your UNI to post a comment

The best stories wherever you go on the Columbia Magazine App

Maybe next time