Source documents obtained as part of a process of data collection or an access-to-information request may be redacted or anonymized; that is, sensitive data will have been blacked out, or replaced by fictitious data, to prevent individuals from being identified. Scrutinizing page after page of text to find proper names, addresses, or medical, financial and other information to achieve the desired result can be a laborious task—unless it is entrusted to algorithms. This is precisely the goal of a research project that we are conducting in partnership with Irosoft, a company specialized in data valorization.
With certain deep-learning techniques, we can “teach” algorithms to locate sensitive information in a text. Training them to do so involves working with documents in which that kind of information has previously been tagged manually. Of course, no documents can contain all first names, names of cities, dates of birth, etc. in existence, but algorithms can learn to identify them from context. For example, they will recognize a person’s name when it is preceded by “Mr.” or “Ms.” These types of algorithms already exist, but they are always trained using the same sets of texts, for strictly academic purposes, to improve performance.
“For 15 years, researchers have been training algorithms using the same data: news articles, mostly sports stories,” Philippe Langlais explains.
“This is a great collaborative effort between industry players and researchers; it’s productive for both sides and something we want to continue.”
– Alain Lavoie, President and co-founder, Irosoft
Irosoft, however, meets the anonymization requirements for documents of a medical, legal, financial or other nature that contain specific types of sensitive information: names of drugs or financial institutions, for example. In these contexts, a common name can become a sensitive data element. In other cases, “in legal documents where it is ubiquitous, the word judgment may be insignificant, but it could become a clue to identifying a person in another field,” Alain Lavoie points out. Fortunately, there are corpora of documents from a variety of fields for which sensitive information has already been tagged.
Professor Langlais uses this type of corpus to train and test the algorithm. “In each of these corpora, there are sensitive data, and we used them to test algorithms that had been trained on other corpora,” he notes. “It turns out that from one field to another, the algorithm learns differently, so adaptations are necessary for it to be applicable to a different field than that it was originally trained for. The solution is to find concordances of tags that allow switching from one field to another.”
“With Irosoft, we studied algorithms in situations that differ from those of the academic milieu. We asked questions that scientists never ask.”
– Philippe Langlais, professor and project lead in the Université de Montréal Department of Computer Science and Operations Research
However, he emphasizes, “recognizing sensitive information is only a first step toward anonymization.” Disguising the information to prevent individuals from being identified, while ensuring the text remains intelligible, is another matter entirely.
President and co-founder
DIRO, Université de Montréal