The GEN2PHEN project aims to unify human and model organism genetic variation databases towards increasingly holistic views into Genotype-To-Phenotype (G2P) data, and to link this system into other biomedical knowledge sources via genome browser functionality.




GEN2PHEN is funded by the Health Thematic Area of the Cooperation Programme of the European Commission within the VII Framework Programme for Research and Technological Development.

Project Summary and Objectives

The GEN2PHEN project aims to unify human and model organism genetic variation databases towards increasingly holistic views into Genotype-To-Phenotype (G2P) data, and to link this system into other biomedical knowledge sources via genome browser functionality. The project will establish the technological building-blocks needed for the evolution of today’s diverse G2P databases into a future seamless G2P biomedical knowledge environment, by the projects end. This will consist of a European-centred but globally-networked hierarchy of bioinformatics GRID-linked databases, tools and standards, all tied into the Ensembl genome browser. The project has the following specific objectives:

The GEN2PHEN Consortium members have been selected from a talented pool of European research groups and companies that are interested in the G2P database challenge. Additionally, a few non-EU participants have been included to bring extra capabilities to the initiative. The final constellation is characterised by broad and proven competence, a network of established working relationships, and high-level roles/connections within other significant projects in this domain.

Background and Concept

By providing a complete Homo sapiens ‘parts list’ (the gene sequences) and a powerful ‘toolkit’ (technologies), the Human Genome Project has revolutionised mankind’s ability to explore how genes cause disease and other phenotypes. Studies in this domain are proceeding at a rapid and ever-increasing pace, generating unprecedented amounts of raw and processed data. It is now imperative that the scientific community finds ways to effectively manage and exploit this flood of information for knowledge creation and practical benefit to society. This fundamental goal lies at the heart of the “Genotype-To-Phenotype Databases: A Holistic Solution (GEN2PHEN)” project.


Previous genetics studies have shown that inter-individual genome variation plays a major role in differential normal development and disease processes. However, the details of how these relationships work are far from clear, even in the case of most Mendelian disorders where single genetic alterations are fully penetrant (essentially causative, rather than risk modifying). Background genetic effects (modifier genes), epistasis, somatic variation, and environmental factors all complicate the situation. This is particularly the case in complex, multi-factorial disorders (e.g., cancer, heart disease, diabetes, dementia) that will affect most of us at some stage in our lifetime. Strategies do, however, now exist to study the genetics of these disorders, and such investigations are a major focus of research throughout Europe and beyond. A common thread in these studies is the need to create ever-larger datasets and integrate these more effectively.

Success in deciphering the mechanisms and pathways underpinning genotype-to-phenotype (G2P) relationships will bring about radical new opportunities for predicting, preventing, diagnosing, and treating all forms of illness. It will launch an era of truly effective personalised medicine. Extensive research is therefore being conducted worldwide to characterise genetic variation in normal and disease contexts. Sadly though, the resulting flood of primary information is not yet being managed or utilised as effectively as it should be - due simply to the lack of a sufficiently organised and mature database infrastructure by which the discoveries can be gathered, stored, integrated and queried as a composite whole in the electronic (internet) domain. Furthermore, whilst new positive findings are being handled sub-optimally, ‘negative’ observations are in most cases not even reported in any way, shape, or form – despite the fact that they constitute an essential part of any complete and accurate G2P depiction. This needs to change, and an international ‘Human Variome Project’ (HVP) has emerged to help argue this case.

It is against this backdrop that the GEN2PHEN project aims to become the key European contribution to the challenges listed above, harmonised with similar projects elsewhere, and dovetailed into many related European programmes of work. It will provide an important and timely solution to a current research need that was highlighted by the European Strategy Forum on Research Infrastructures (ESFRI) - Priority area: ‘Upgrade of European Bio-Informatics Infrastructure (Shared platform for data resources in the Life Sciences)’. It will provide European G2P research and biotech industries with the proper support they need in terms of database technologies and data integration systems. Only then can our societies maximally benefit from the current exponentially increasing rate of genetic data generation in disease research and clinical settings.

Future Vision - Current Reality

Looking to the future, one can imagine a world wherein ‘omics’ biomedical sciences are commonplace, even to the point of having one’s genome sequenced in routine medical checkups. In this envisaged world, phenomenally large amounts of G2P data will be produced daily, much of which would flow effortlessly into the internet to be fully absorbed into a sophisticated and powerful ‘biomedical knowledge environment’. Some of this information will be secured for restricted access, whilst much of the raw data and the derived knowledge should be free for everyone to search and exploit.

The system will enable extensive scientific reporting and discussions, it will provide a core reference platform for medical practice, and it will open exciting new operational vistas for journals, industry, and funders. It will provide for and underpin activities in biomedical research, biotechnology, drug development, and personalised healthcare. And it will probably even impact our basic cultural practices (e.g., insurance, the law, employment policies) as society comes to grips with the immense power and relevance of genetics to the human state. But this envisaged future is nothing like the world we presently live in.

No system yet exists that even begins to approximate to a ‘biomedical knowledge environment’ properly able to support G2P data gathering and analysis. There are instead a limited number of unconnected G2P databases that are mostly at rather early stages in their development, with no agreed structured way of effectively modelling phenotype data or G2P relationships, and no convenient mode for passing data from discovery laboratories into the database world. A few recent initiatives are building large databases to host individual-specific genotypes and phenotypes to support some high-throughput disease association studies, but these do not have a global remit, have not engaged with the extensive existing knowledge from Medelian disorders, and are not focused on all the research and clinical communities around G2P. Most progress has arguably been made with locus-specific databases (LSDBs) that target specific diseases or genes, but the vast majority of the several hundred LSDBs that do exist are rudimentary in design and implementation, and operationally isolated from one another. This all contrasts with the situation for databases concerned with purely genetic data (without phenotype association), of which there are many, including several large data warehouses and genome browsers that act as central repositories and search centres for all the human and model organism genome sequences, variants, and feature annotations yet produced.

There are a number of reasons why the G2P database field is so poorly developed. Problems include the complexity/diversity of the pertinent data elements, the contemporary nature of the challenge, and certain practical/cultural issues. However, perhaps the most critical obstacle is the overwhelming scale of the problem. Whereas the genome is a bounded domain of only ~3,000,000,000 nucleotides and ~25,000 genes (in man), there is essentially no limit to the number of G2P relationships that can be examined, each by multiple different procedures. The former is thus relatively straightforward and can be managed and hosted in one or a few large data depositories (as has been accomplished). In contrast, the latter is too large in scale and scope to handle in this way.

There is virtually no limit to how many G2P data will eventually be created, or to their diversity or purpose. The database solutions for G2P information must therefore be based upon new ways of thinking and organising the field’s development - emphasising standards, integration, federation, and broad community participation from the very outset.

The GEN2PHEN Strategy

The GEN2PHEN project has the overall ambition of unifying human and model organism genetic variation databases, and doing this in such a way that the resulting holistic view of G2P data can be blended with all other biomedical database domains via one or more central genome browsers. The project will put in place the main building blocks needed to move substantially from today’s G2P database situation towards the ultimate future of a complete biomedical knowledge environment. The project will then utilise these building blocks to construct a first-generation version of a G2P knowledge environment by the project’s end. This will consist of a European-centred but globally networked hierarchy of bioinformatics GRID-linked databases, tools and standards, all tied into the Ensembl genome browser. To ensure the project builds something that truly works and tangibly benefits the community, rather than merely devising potentially useful technologies, we have focussed the project’s objectives on the three essential components of a functioning G2P database system. These can be viewed as three legs of a ‘stool’, each of which must be robust for the stool to properly function (see Figure 1).




We recognise that other work is going on in the field, and that different users have related but different needs. Our Consortium comprises representatives of each sector currently building G2P databases, and we have many deep connections into the broader G2P community. We shall utilise these skills and relationships to ensure that our activities match the latest needs and progress of others, and to gain community trust and acceptance of the GEN2PHEN system. This will be achieved by broad opinion gathering and open discussion with the community from the projects outset. This will lead to state-of-the-art documents that describe the general progress of the field and the specific data models and technologies that are particularly favoured and effective. GEN2PHEN itself will almost certainly have a big influence on these things, but we will adapt our work as necessary to maximally interoperate with external developments.


From an intimate knowledge of what others are doing, we will develop data models, nomenclature, and technology standards that will be building blocks for us and the community. We will not develop ontologies ourselves, but connect to and be led by the various expert groups doing this for the G2P domain. Each finalised standard will be formally documented, and wherever possible registered with independent bodies to make them official global standards.


Based upon GEN2PHEN-derived and other emerging standards, we will build generic database components and a deeply networked infrastructure (one leg of the stool). This will include solutions for genetic (gene or disease-specific) and genomic (whole genome) databases, with appropriately styled interfaces for the target communities: namely, for biomedical researchers, clinical practitioners, and the general public. The genetic database work will concentrate upon providing one or more ‘LSDB-in-a-box’ applications, so enabling anyone to easily set up an LSDB for their gene/disease of interest. We will also establish an LSDB hosting service for those that prefer this way of proceeding. The genomics database work will concentrate upon providing components for flexible and future-proof database implementations that support summary-level G2P datasets. We will not target support for individual-level G2P datasets as databases for these are already being constructed to support large-scale genetic association studies and medical re-sequencing projects. Instead, we are already partnered with such groups and we will ensure compatibility between their and our developments. At least one major genomics G2P database will be brought into operation by our Consortium. The components of this will be passed on to others so that many such databases can be put in place by the end of the project. The genomics databases will be designed to function towards the top of hierarchies wherein resources towards the bottom carry increasingly detailed datasets. As such, GEN2PHEN databases will help compile and channel information from the wide community into the Ensembl browser. A range of integration technologies and data exchange procedures/conventions will underpin, surround, and infiltrate the databases we wish to build, thus bringing interoperability within the project and with the broader G2P database field.


A standardised and integrated database layer will make it possible to provide sophisticated and powerful search functionality across an ever greater fraction of all G2P knowledge (another leg of the stool). The databases will be able to reuse common search tools, query interfaces, and data output formats, giving the system the benefits of both branding and familiarity. Search functions and tools will be designed with various different users in mind (especially researchers, clinicians, and the general public), with special emphasis on the needs of the medical/diagnostic community. The most unifying aspect of the project, however, will entail providing support for pan-resource searching via the Ensembl platform. This will be achieved by a range of standardised data output/exchange protocols and new browser capabilities, anchored on GRID technologies and further development of the Mart system. User interfaces across the system will be tailored to meet the needs of the relevant communities. This implies providing, at various overlapping levels in the GEN2PHEN system, a gene-to-disease perspective with entity concepts that will be mostly used by researchers, as well as a disease-to-gene view that is built around medical terminologies that will be more relevant to clinicians. A further view that uses lay terms and simpler interrogation systems will be provided for the general public, and this will further connect to other websites that provide medico-genetic data to the public. Atop all of this we will establish chat and discussion fora, by which anyone can debate relevant subject matters, even down to the level of commenting on individual database records. These community inputs will then be made visible alongside core search results when the database network is searched.


By both tool development and community interactions, we will proactively seek to populate the G2P domain with valuable data (the third leg of the stool), much of which will not otherwise be brought forward (e.g., negative data) or suitably packaged (e.g., the content of journals or raw datasets). We will additionally seek to devise pipelines and protocols that will enable highly-informative diagnostic laboratory genetic data to also flow into public G2P databases. Success on these undertakings will be apparent by the growth in data content of the GEN2PHEN databases.


To provide a global focus for G2P database activities and developments, we will construct a ‘GEN2PHEN Knowledge Centre’. This will be an internet domain that not only summarises our project activities and provides downloads of all our available code/software but also provide access to many other sources of relevant information, host calendars and diaries of meeting/activities, enable chat amongst field participants, and offer personalised and holistic search capabilities to the complete G2P internet domain, tailored to the needs of the different communities. This will be seamlessly joined to many of the functions we will set up with Ensembl. Citations and website hits will be used to track the value and usage of this novel ‘main G2P data portal’. A particularly important feature of this Knowledge Center will be that it will include a system that enables users to directly comment upon, and thereby update, contest, or launch a public discussion about, any database record, group of records, or reported observation in the total G2P domain. This totally original G2P feature will help bring the GEN2PHEN system ‘to life’ and inspire healthy debate which is the hallmark of productive science.


As the technology development work proceeds, we will take steps to interest the community in those developments and enable the community to adopt and use them. Many strategies will be used to achieve this, not least outreach via the GEN2PHEN Knowledge Centre. Much of this deployment work will be based upon the ‘database federation’ concept - the cultural equivalent of the integration technologies that we will be developing. For LSDBs in particular, a community already exists that has started to grow in this direction. Especially in the second half of the project, we expect to devote substantial resources to advertising, explaining, and training researchers with the uptake of our solutions.


Questions of durability must be considered. The standards and software we devise will survive for as long as they remain useful. But G2P databases (ours and others) can only survive given a funding stream to resource their maintenance and ongoing development. In the future, many G2P databases may have to be supported by new business models beyond those of academic funding. Academic and industry members of the Consortium will, together, explore this question, and explore ways by which the innate value of G2P data can be ethically and effectively leveraged to keep such databases growing self-sustainably. The solution may entail devising ways to provide incentives for stakeholders to value and contribute to the G2P database future. A ‘Bio-Resource Impact Factor’ may be relevant here (i.e., an index that quantifies the impact of a given bio-resource), and an ethics panel will work within GEN2PHEN and with other EU projects to actively explore this possibility and report their findings.


To objectively track progress and deficiencies in the GEN2PHEN project we will continually cycle versions of a ‘System Utility and Validation’ pilot project. This will focus upon specific genes/diseases of interest in clinical medicine, starting from the perspectives/needs of the diagnostic laboratory. A team will attempt to use GEN2PHEN systems to explore and interpret "thematic areas" of current and important biomedical importance – for example, genetic aspects of cancer. The objective will be to use the GEN2PHEN system to glean a complete picture of what is known or predictable about the thematic area of interest. This will span questions of immediate interest to medical clinicians/diagnosticians, and also move into basic research questions, animal model evidence, and perhaps even other non-DNA domains of biology. Besides providing assessment of the usefulness of the system in a ‘real-life-like scenario’, it will also judge the relevance/utility of GEN2PHEN training activities, and the GEN2PHEN Knowledge Centre. This assessment pilot will be run every 12-20 months, delivering reports that will be carefully considered and used to refine and redirect GEN2PHEN activities as necessary.