Looking to the future, one can imagine a world wherein ‘omics’ biomedical sciences are commonplace, even to the point of having one’s genome sequenced in routine medical checkups. In this envisaged world, phenomenally large amounts of G2P data will be produced daily, much of which would flow effortlessly into the internet to be fully absorbed into a sophisticated and powerful ‘biomedical knowledge environment’. Some of this information will be secured for restricted access, whilst much of the raw data and the derived knowledge should be free for everyone to search and exploit.
The system will enable extensive scientific reporting and discussions, it will provide a core reference platform for medical practice, and it will open exciting new operational vistas for journals, industry, and funders. It will provide for and underpin activities in biomedical research, biotechnology, drug development, and personalised healthcare. And it will probably even impact our basic cultural practices (e.g., insurance, the law, employment policies) as society comes to grips with the immense power and relevance of genetics to the human state. But this envisaged future is nothing like the world we presently live in.
No system yet exists that even begins to approximate to a ‘biomedical knowledge environment’ properly able to support G2P data gathering and analysis. There are instead a limited number of unconnected G2P databases that are mostly at rather early stages in their development, with no agreed structured way of effectively modelling phenotype data or G2P relationships, and no convenient mode for passing data from discovery laboratories into the database world. A few recent initiatives are building large databases to host individual-specific genotypes and phenotypes to support some high-throughput disease association studies, but these do not have a global remit, have not engaged with the extensive existing knowledge from Medelian disorders, and are not focused on all the research and clinical communities around G2P. Most progress has arguably been made with locus-specific databases (LSDBs) that target specific diseases or genes, but the vast majority of the several hundred LSDBs that do exist are rudimentary in design and implementation, and operationally isolated from one another. This all contrasts with the situation for databases concerned with purely genetic data (without phenotype association), of which there are many, including several large data warehouses and genome browsers that act as central repositories and search centres for all the human and model organism genome sequences, variants, and feature annotations yet produced.
There are a number of reasons why the G2P database field is so poorly developed. Problems include the complexity/diversity of the pertinent data elements, the contemporary nature of the challenge, and certain practical/cultural issues. However, perhaps the most critical obstacle is the overwhelming scale of the problem. Whereas the genome is a bounded domain of only ~3,000,000,000 nucleotides and ~25,000 genes (in man), there is essentially no limit to the number of G2P relationships that can be examined, each by multiple different procedures. The former is thus relatively straightforward and can be managed and hosted in one or a few large data depositories (as has been accomplished). In contrast, the latter is too large in scale and scope to handle in this way.
There is virtually no limit to how many G2P data will eventually be created, or to their diversity or purpose. The database solutions for G2P information must therefore be based upon new ways of thinking and organising the field’s development - emphasising standards, integration, federation, and broad community participation from the very outset.