De-identified Data Definition

In education, de-identified data generally refers to data from which all personally identifiable information has been removed—i.e., data about individual students, teachers, or administrators that has been rendered anonymous by stripping out any information that would allow people to determine an individual’s identity. Common forms of personally identifiable information include first and last names, home addresses, social security numbers, and other types of information that may reveal—advertently or inadvertently—an individual’s identity in a given set of data. The primary reason for “de-identifying” data is to protect the privacy or identity of the individuals associated with the data.

De-identified data are commonly used for research purposes in education. For example, a state education agency might hire an organization or university to study the results or impact of educational policy such as a recent expansion of state-subsidized pre-kindergarten programs. The researchers would then request the data they need to conduct the study (e.g., records showing the number of students enrolled in pre-kindergarten programs over a ten-year period), and the education agency would then assemble the necessary datasets. Before releasing the data files to the researchers, however, the agency would use a “de-identification process” to prevent individual identities from being revealed in the information provided to the external researcher. In many cases, the education agency and the research organization will also sign a formal agreement specifying how the data can be used and how files need to be disposed of once the study has been completed.

Data may also be de-identified when an education agency, district, or school shares information with external organizations and individuals not authorized to access or view personal information—for example, consultants and companies under contract to provide specialized services to districts and schools.

It is important to note that some datasets may indirectly reveal the identities of specific students or individuals even when the data seemingly contains no personally identifiable information. For example, some small, rural schools have very small minority student populations—perhaps only one or two students of color in the entire school. If state or school records contain, say, test scores or graduation rates for various racial subgroups, the identity of individual African American, Hispanic, or Asian students could inadvertently be revealed even though the data are otherwise “anonymous.” For this reason, states and schools may not publicly report or share certain data when subgroups are small enough to potentially connect otherwise anonymous data to specific students.

The most common strategies for de-identifying data are deleting all personal information in a data file and either “suppressing” or “masking” a selection of data so that the remaining information cannot be used to identify individuals. For more detailed discussions, see data masking and data suppression.

In addition, some de-identified datasets may contain what are often called “re-identification codes”—or random numbers assigned to individual records that have otherwise been stripped of personally identifiable information. Re-identification codes, for example, might allow researchers to match two anonymous datasets when conducting a study. Say a state education agency provides a set of data files to researchers who are studying whether a specific program resulted in academic gains for students. While conducting the study, the researchers determine that an additional year of data is needed to complete their analysis. The education agency may then use re-identification codes to “identify” the students in the original dataset (while still masking their personal identities), and then link those student records to the same students in the new dataset.

The Glossary of Education Reform by Great Schools Partnership is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

De-identified Data

Alphabetical Search