The GEN2PHEN Strategy
The GEN2PHEN project has the overall ambition of unifying human and model organism genetic variation databases, and doing this in such a way that the resulting holistic view of G2P data can be blended with all other biomedical database domains via one or more central genome browsers. The project will put in place the main building blocks needed to move substantially from today’s G2P database situation towards the ultimate future of a complete biomedical knowledge environment. The project will then utilise these building blocks to construct a first-generation version of a G2P knowledge environment by the project’s end. This will consist of a European-centred but globally networked hierarchy of bioinformatics GRID-linked databases, tools and standards, all tied into the Ensembl genome browser. To ensure the project builds something that truly works and tangibly benefits the community, rather than merely devising potentially useful technologies, we have focussed the project’s objectives on the three essential components of a functioning G2P database system. These can be viewed as three legs of a ‘stool’, each of which must be robust for the stool to properly function (see Figure 1).
toc_collapse=0;- 1. TO ANALYSE THE G2P FIELD AND INVESTIGATE CURRENT NEEDS AND PRACTICES
- 2. TO DEVELOP KEY STANDARDS FOR THE G2P FIELD
- 3. TO CREATE GENERIC DATABASE COMPONENTS, SERVICES AND INTEGRATION INFRASTRUCTURES FOR THE G2P DOMAIN
- 4. TO CREATE DATA SEARCH AND PRESENTATION SOLUTIONS FOR G2P KNOWLEDGE
- 5. TO FACILITATE THE POPULATING OF RESEARCH AND DIAGNOSTIC G2P DATABASES
- 6. TO BUILD A MAJOR G2P INTERNET PORTAL
- 7. TO DEPLOY GEN2PHEN SOLUTIONS TO THE COMMUNITY
- 8. TO ADDRESS SYSTEM DURABILITY AND LONG-TERM FINANCING
- 9. TO UNDERTAKE A SYSTEM UTILITY AND VALIDATION PILOT STUDY
1. TO ANALYSE THE G2P FIELD AND INVESTIGATE CURRENT NEEDS AND PRACTICES
We recognise that other work is going on in the field, and that different users have related but different needs. Our Consortium comprises representatives of each sector currently building G2P databases, and we have many deep connections into the broader G2P community. We shall utilise these skills and relationships to ensure that our activities match the latest needs and progress of others, and to gain community trust and acceptance of the GEN2PHEN system. This will be achieved by broad opinion gathering and open discussion with the community from the projects outset. This will lead to state-of-the-art documents that describe the general progress of the field and the specific data models and technologies that are particularly favoured and effective. GEN2PHEN itself will almost certainly have a big influence on these things, but we will adapt our work as necessary to maximally interoperate with external developments.
2. TO DEVELOP KEY STANDARDS FOR THE G2P FIELD
From an intimate knowledge of what others are doing, we will develop data models, nomenclature, and technology standards that will be building blocks for us and the community. We will not develop ontologies ourselves, but connect to and be led by the various expert groups doing this for the G2P domain. Each finalised standard will be formally documented, and wherever possible registered with independent bodies to make them official global standards.
3. TO CREATE GENERIC DATABASE COMPONENTS, SERVICES AND INTEGRATION INFRASTRUCTURES FOR THE G2P DOMAIN
Based upon GEN2PHEN-derived and other emerging standards, we will build generic database components and a deeply networked infrastructure (one leg of the stool). This will include solutions for genetic (gene or disease-specific) and genomic (whole genome) databases, with appropriately styled interfaces for the target communities: namely, for biomedical researchers, clinical practitioners, and the general public. The genetic database work will concentrate upon providing one or more ‘LSDB-in-a-box’ applications, so enabling anyone to easily set up an LSDB for their gene/disease of interest. We will also establish an LSDB hosting service for those that prefer this way of proceeding. The genomics database work will concentrate upon providing components for flexible and future-proof database implementations that support summary-level G2P datasets. We will not target support for individual-level G2P datasets as databases for these are already being constructed to support large-scale genetic association studies and medical re-sequencing projects. Instead, we are already partnered with such groups and we will ensure compatibility between their and our developments. At least one major genomics G2P database will be brought into operation by our Consortium. The components of this will be passed on to others so that many such databases can be put in place by the end of the project. The genomics databases will be designed to function towards the top of hierarchies wherein resources towards the bottom carry increasingly detailed datasets. As such, GEN2PHEN databases will help compile and channel information from the wide community into the Ensembl browser. A range of integration technologies and data exchange procedures/conventions will underpin, surround, and infiltrate the databases we wish to build, thus bringing interoperability within the project and with the broader G2P database field.
4. TO CREATE DATA SEARCH AND PRESENTATION SOLUTIONS FOR G2P KNOWLEDGE
A standardised and integrated database layer will make it possible to provide sophisticated and powerful search functionality across an ever greater fraction of all G2P knowledge (another leg of the stool). The databases will be able to reuse common search tools, query interfaces, and data output formats, giving the system the benefits of both branding and familiarity. Search functions and tools will be designed with various different users in mind (especially researchers, clinicians, and the general public), with special emphasis on the needs of the medical/diagnostic community. The most unifying aspect of the project, however, will entail providing support for pan-resource searching via the Ensembl platform. This will be achieved by a range of standardised data output/exchange protocols and new browser capabilities, anchored on GRID technologies and further development of the Mart system. User interfaces across the system will be tailored to meet the needs of the relevant communities. This implies providing, at various overlapping levels in the GEN2PHEN system, a gene-to-disease perspective with entity concepts that will be mostly used by researchers, as well as a disease-to-gene view that is built around medical terminologies that will be more relevant to clinicians. A further view that uses lay terms and simpler interrogation systems will be provided for the general public, and this will further connect to other websites that provide medico-genetic data to the public. Atop all of this we will establish chat and discussion fora, by which anyone can debate relevant subject matters, even down to the level of commenting on individual database records. These community inputs will then be made visible alongside core search results when the database network is searched.
5. TO FACILITATE THE POPULATING OF RESEARCH AND DIAGNOSTIC G2P DATABASES
By both tool development and community interactions, we will proactively seek to populate the G2P domain with valuable data (the third leg of the stool), much of which will not otherwise be brought forward (e.g., negative data) or suitably packaged (e.g., the content of journals or raw datasets). We will additionally seek to devise pipelines and protocols that will enable highly-informative diagnostic laboratory genetic data to also flow into public G2P databases. Success on these undertakings will be apparent by the growth in data content of the GEN2PHEN databases.
6. TO BUILD A MAJOR G2P INTERNET PORTAL
To provide a global focus for G2P database activities and developments, we will construct a ‘GEN2PHEN Knowledge Centre’. This will be an internet domain that not only summarises our project activities and provides downloads of all our available code/software but also provide access to many other sources of relevant information, host calendars and diaries of meeting/activities, enable chat amongst field participants, and offer personalised and holistic search capabilities to the complete G2P internet domain, tailored to the needs of the different communities. This will be seamlessly joined to many of the functions we will set up with Ensembl. Citations and website hits will be used to track the value and usage of this novel ‘main G2P data portal’. A particularly important feature of this Knowledge Center will be that it will include a system that enables users to directly comment upon, and thereby update, contest, or launch a public discussion about, any database record, group of records, or reported observation in the total G2P domain. This totally original G2P feature will help bring the GEN2PHEN system ‘to life’ and inspire healthy debate which is the hallmark of productive science.
7. TO DEPLOY GEN2PHEN SOLUTIONS TO THE COMMUNITY
As the technology development work proceeds, we will take steps to interest the community in those developments and enable the community to adopt and use them. Many strategies will be used to achieve this, not least outreach via the GEN2PHEN Knowledge Centre. Much of this deployment work will be based upon the ‘database federation’ concept - the cultural equivalent of the integration technologies that we will be developing. For LSDBs in particular, a community already exists that has started to grow in this direction. Especially in the second half of the project, we expect to devote substantial resources to advertising, explaining, and training researchers with the uptake of our solutions.
8. TO ADDRESS SYSTEM DURABILITY AND LONG-TERM FINANCING
Questions of durability must be considered. The standards and software we devise will survive for as long as they remain useful. But G2P databases (ours and others) can only survive given a funding stream to resource their maintenance and ongoing development. In the future, many G2P databases may have to be supported by new business models beyond those of academic funding. Academic and industry members of the Consortium will, together, explore this question, and explore ways by which the innate value of G2P data can be ethically and effectively leveraged to keep such databases growing self-sustainably. The solution may entail devising ways to provide incentives for stakeholders to value and contribute to the G2P database future. A ‘Bio-Resource Impact Factor’ may be relevant here (i.e., an index that quantifies the impact of a given bio-resource), and an ethics panel will work within GEN2PHEN and with other EU projects to actively explore this possibility and report their findings.
9. TO UNDERTAKE A SYSTEM UTILITY AND VALIDATION PILOT STUDY
To objectively track progress and deficiencies in the GEN2PHEN project we will continually cycle versions of a ‘System Utility and Validation’ pilot project. This will focus upon specific genes/diseases of interest in clinical medicine, starting from the perspectives/needs of the diagnostic laboratory. A team will attempt to use GEN2PHEN systems to explore and interpret "thematic areas" of current and important biomedical importance – for example, genetic aspects of cancer. The objective will be to use the GEN2PHEN system to glean a complete picture of what is known or predictable about the thematic area of interest. This will span questions of immediate interest to medical clinicians/diagnosticians, and also move into basic research questions, animal model evidence, and perhaps even other non-DNA domains of biology. Besides providing assessment of the usefulness of the system in a ‘real-life-like scenario’, it will also judge the relevance/utility of GEN2PHEN training activities, and the GEN2PHEN Knowledge Centre. This assessment pilot will be run every 12-20 months, delivering reports that will be carefully considered and used to refine and redirect GEN2PHEN activities as necessary.