Sensitive datasets, data privacy, and access control
Investigations into clinical materials, especially high-throughput experiments and genetic epidemiology studies using thousands of individuals, generate data from which study participants can be identified. In order to protect these individuals from potential misuse of the data generated about them (e.g. discrimination by health insurance providers or potential employers), the dissemination of these data must be carefully controlled and involves many stakeholders (see e.g. ref. [fn]Foster et al. Share and share alike: deciding how to distribute the scientific and social benefits of genomic data. Nature Reviews Genetics (2007) vol. 8 (8) doi:10.1038/nrg2360[/fn]). But this will become increasingly costly and difficult to manage on a case by case basis, given increases in; the number of such studies; the number of groups/consortia generating such datasets; the number of databases wishing to integrate and disseminate the information; and the number of researchers wishing to access these data.
Case study: individual-level data from genome-wide association studies
Currently, to gain access to genotype data from genome-wide association studies (GWAS) from the Wellcome Trust Case-Control Consortium (WTCCC), one must complete a special form, wait up to 2 months for approval from the relevant Data Access Committee, and sign a Data Access Agreement. The researcher is then allowed to download encrypted files from the European Genotype Archive (EGA) website to his computer, and must decrypt these files with a provided key. NCBI’s database of Genotypes and Phenotypes (dbGaP) has similar procedures in place.
While there are good reasons for these measures, they already impede the rate of research progress, and will increasingly do so as opportunities for broad dataset integration and meta-analysis become ever more curtailed due to limitations on access. Simply extending the current system will not change the core fact that access permissions must to be applied for per dataset/project, making it very onerous for researchers who need to access many datasets from multiple sources. Also, as the researcher must download the data to his local computer, the system does not scale up to future applications where data integration will take place on-the-fly across many diverse data sources on the Internet. Therefore, even though the primary data in question are in principle available to researchers, the potential from data reuse (e.g. data mining, secondary analyses) is greatly diminished by current dissemination practices.
Case study: aggregate data from genome-wide association studies
Aggregate representations of individual genotype information (genotype/allele frequencies, aka genotype summaries) from GWAS's were until recently distributed without restrictions, based on the assumption that this level of detail does not enable re-identification of individuals within the group. This enabled secondary data providers, or portals (such as HGVbaseG2P) to collect these data and present to end users via special-purpose genome views and search modalities, thus adding value to the original results.
However, in a recent paper[fn]Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet (2008) vol. 4 (8) doi:10.1371/journal.pgen.1000167[/fn] the authors show that given a high-density genetic profile for an individual, it is possible to work out whether the individual participated in a genetic association study, even if only aggregate genotype data are available from the study. As a result of these findings, various data providers and funders have effectively halted unrestricted sharing of aggregate data (see e.g. response from NIH[fn]Zerhouni et al. Protecting aggregate genomic data. Science (2008) vol. 322 (5898) doi:10.1126/science.1165490[/fn]) and now full individual-level data access privileges are required to get to the aggregate data. As a consequence of this, secondary data providers now cannot present or re-distribute aggregate GWAS data without greatly restricting the amount of information shown at one time, severely limiting the value such projects could otherwise add.
A registry for users of biomedical data
The whole process would obviously be greatly streamlined if one or more services (probably operated by major regional data centres such as WTSI and NCBI) were to store information on access privileges for each researcher based on an OpenID that he would provide upon registration. The registry (or registries) could then be used by various primary and secondary data providers (whether or not part of the WTCC/SI and NCBI) to check whether or not a person should be allowed access to a given type of sensitive dataset. The same registry could also be used to ‘blacklist’ individuals found guilty of inappropriate use of data (though the complex issue of sanctions needs much further consideration, whatever mechanism for access approval is in operation).
Granularity of data access permissions
The registry and participating data providers could have different levels of granularity for access permissions. For example, in the simplest scenario a person who is listed in the registry (thereby confirming his status as a researcher) could be given 'blanket' access to quasi-sensitive data (such as aggregate genotypes, as outlined in the case study above). System(s) enabling this could be developed relatively quickly, and thus this strategy may serve as an interim solution to the acute aggregate data sharing problem.
In a more complex scenario involving individual-level data, a researcher could be granted access to all datasets from a particular archive (e.g. dbGaP), or all data from a particular consortium which has submitted several datasets to one or more archives (e.g. all WTCCC data). Finally, a researcher could be given access to only a particular dataset (e.g. WTCCC bipolar study).
Conclusions
Current practices for disseminating sensitive biomedical data are onerous and will not scale to support future large-scale data integration tasks. An online registry or registries of researchers and their data access privileges will be key components in streamlining this process, both for acquiring data access permits initially and for accessing the data from primary as well as secondary data providers. Such a framework will require researchers to prove that they are who they say they are in a robust way across multiple websites, which in turn requires the adoption of a universal authentication system (as described in this section of our series).
- Printer-friendly version
- Login to post comments