You are viewing this article in the AnnArbor.com archives. For the latest breaking news and updates in Ann Arbor and the surrounding area, see MLive.com/ann-arbor
Posted on Fri, Feb 11, 2011 : 11 a.m.

FOIA Friday: Redaction and reidentification of records

By Edward Vielmetti

Redaction is the process of removing portions of a document that contain sensitive, privileged, or private information. It is regularly used in satisfying Freedom of Information Act requests, where the responding agency keeps some information from the public eye while returning the rest of a document.

I spoke at Ignite Ann Arbor 5 on Wednesday about redaction, and in particular some of the risks when a document is incompletely or sloppily redacted that an inquisitive individual might be able to reconstruct part or all of the redacted materials. Here's some expansion of those remarks, with a particular eye toward the sloppy redaction practices of a Michigan traffic safety agency that unnecessarily exposes personal details of individuals who have been in highway accidents in the state.

Analog methods for reidentification

If you have an original physical document which has been blacked out with a Sharpie, you may be able to use chemical analysis to reconstruct the obliterated text. Sharpie ink dissolves in methanol, but generally printer toner does not. Test with a sample first.

Many documents can be read through optical methods, using a combination of light sources and optical sensors that help distinguish between the original text and the ink that blacks it out. Different inks have different properties in the infrared, ultraviolet, and visible spectrums, and shining a black light on a document or photographing it carefully when illuminated strongly from behind can reconstruct signals in the text.

Most redacted documents are photocopies. Some of them may have been made on photocopiers with auto-sharpening settings, designed to aid in the legibility of text. If the redacting agency's Sharpie is running out of ink, it may produce a document that the redacting agency's photocopier can reconstruct, as the example below noted by the Ann Arbor Chronicle ("FOIA Update: Printed vs Electronic Records", June 30, 2009) shows.

a2gov-out-of-public-discussion-redaction.png

The text "out of public discussion" is evident in this imperfectly redacted document provided by the City of Ann Arbor in response to a FOIA request.

City of Ann Arbor via Ann Arbor Chronicle; used with permission.

In some cases, the redaction will not be complete, as in the case where the black marker misses the tops and bottoms of letters. You may be able to distinguish hidden materials by letters left partially there, or by bits and pieces of corners of the text that escaped being hidden.

In many cases, redaction will be applied to source documents that are printed in a fixed pitch font, so that you can clearly identify how many characters have been removed. This can be one clue to help you rule out multiple alternatives. For example, if the text that has been blotted out has 9 letters and the next to last two letters look like they might be the letter 't', bets are that the hidden word is more likely to be "Vielmetti".

Exploiting failures in digital redaction

Too often, an agency that uses digital methods to redact a text does this poorly, without understanding that the methods that they use to produce a text that prints out cleanly with blacked out bits on paper still retains all of the redacted text.

Adding white rectangles to a PDF document was not enough to hide the underlying text from an AP reporter, who revealed details of Facebook’s confidential settlement of a lawsuit brought by social networking site ConnectU in 2009. This type of redaction failure has been noted multiple times (see e.g. my column "Redaction and how not to do it", December 11, 2009) and is straightforward to avoid if you follow this Adobe Acrobat 8 tutorial on redaction.

Similar problems come up routinely when a document that originated in Microsoft Word gets out to the public. Word's "track changes" feature can show how a document has changed over time, with sometime embarrassing revelations. Shauna Kelly collected a set of how tracked changes have made businesses and government look foolish, noting difficulties at the United Nations, the UK government, and the California Attorney General.

Reconstructing deleted material from other databases

The most difficult part of reidentification is reconstructing data based on clues in the original text that hide some, but not all, of an individual's identity. A clever researcher can use the partial information provided in the text as clues to reconstruct identity by matching up data against an external database.

Arvind Narayanan, a post-doctoral researcher on privacy and anonymity at the Stanford University, publishes a weblog "33 Bits" in which he expands on his thesis that "the level of anonymity that society expects—and companies claim to provide—in published databases is fundamentally unrealizable." He provides a series of compelling examples, backed up with computer code to automate the process, of how an adversary can reconstruct personal information by linking together multiple databases each which have been nominally scrubbed of personally identifiable information.

Take, for example, the Michigan Traffic Crash Facts database, managed by the University of Michigan Transportation Research Institute - Transportation Data Center and the state Office of Highway Safety Planning. It contains a complete collection of crash records in the state of Michigan, with tools to allow queries of statistical data and access to a sanitized version of every UD-10 crash report filed with police agencies in the state. The sanitization blocks out individual names, but retains information about the crash victim's zip code and (crucially) their exact date of birth.

sanitized-crash-report-excerpt.png

Michigan Traffic Crash Facts

In a column published by security researcher Bruce Schneier in Wired Magazine, the question of how easy it is to reidentify someone from fragmentary information is put to the test. ("Why Anonymous Data Sometimes Isn't", December 2007)

Using public anonymous data from the 1990 census, Latanya Sweeney found that 87 percent of the population in the United States, 216 million of 248 million, could likely be uniquely identified by their five-digit ZIP code, combined with their gender and date of birth. About half of the U.S. population is likely identifiable by gender, date of birth and the city, town or municipality in which the person resides. Expanding the geographic scope to an entire county reduces that to a still-significant 18 percent. "In general," the researchers wrote, "few characteristics are needed to uniquely identify a person."

As you can see from the traffic crash database, this information is routinely released in bulk. Match that up with the make and model of the car, and it's even easier to re-identify an individual.

The suggestion for highway safety professionals is fairly easy to describe. We, the public who are interested in detail about how to make highways safe, are also to be protected about our privacy when we get into an accident. There is no compelling public interest to release my exact birthday on traffic crash reports to the general public, especially when just the year of my birth would allow all of the relevant statistical analysis to be done.

Edward Vielmetti shines your letters up to the light to see what you scratched out. Send him a note at edwardvielmetti@annarbor.com.

Comments

Edward Vielmetti

Fri, Feb 11, 2011 : 7 p.m.

What is this, seldon, mad libs? <a href="http://www.collegian.psu.edu/archive/2011/02/02/two_psu_alums_on_wheel_of_fortune_preview.aspx" rel='nofollow'>http://www.collegian.psu.edu/archive/2011/02/02/two_psu_alums_on_wheel_of_fortune_preview.aspx</a> Caitlin Burke, Class of 2006, said she didn't really need to shout out the letter "L" to crack the 27-letter phrase on the "Wheel of Fortune" board. Without a single letter showing, she had already figured out the puzzle in her head: "I've got a good feeling about this."

seldon

Fri, Feb 11, 2011 : 6:33 p.m.

According to &#11035;&#11035;&#11035;&#11035;&#11035; &#11035;&#11035;&#11035;&#11035;&#11035;&#11035;&#11035;, Ed Vielmetti frequently smokes &#11035;&#11035;&#11035;&#11035;&#11035;&#11035; &#11035;&#11035;&#11035;&#11035;&#11035; and uses them in &#11035;&#11035;&#11035;&#11035;&#11035;&#11035;&#11035;&#11035;&#11035;&#11035;. This concerns me greatly because &#11035;&#11035;&#11035; &#11035;&#11035;&#11035;&#11035;&#11035;&#11035;&#11035; on &#11035;&#11035;&#11035;&#11035;&#11035;&#11035;&#11035;&#11035; &#11035;&#11035;, 2011.