LSDB minimal requirements from Johan. Discussion with the GEN2PHEN Community 20 Jan 2009, Helsinki
== LSDB minimal requirements from Johan. Discussion with the GEN2PHEN Community 20 Jan 2009, Helsinki ==
Record of discussion and comments.
Common themes. A common way of representing unknown, no data, not assayed etc with some definitions etc would be useful throughout this document and also for data exchange purposes.
=== Variant/Exon (number) - recommended. ===
What to do when there are multiple transcripts? Must be in the context of the LRG, or other reference sequence.
Exon/introns are both relevant - there is no good word for this.
Is a region of DNA and the role - exon or intron. Or 'splice transcript component' has been used.
This is redundant info, people use to sort the data (e.g. in LOVD useful), sort on exon name is convenient. Re: LRGs - lab staff want this as this is how people think.
Mauno: This is redundant and can get this programmatically RNA/protein (Christophe thinks that this is usually predictive for proteins which are functionally assayed rather than sequenced)
Ivo:depends whether this is predicted or experimental info
Mauno:suggest that we add some info in the quality and source of the data - assayed/prediction.
Ivo:HGVS includes this, use parenthesis.
Christophe:exon not needed for data exchange, is useful for diagnostic use, for data exchange we can regenerate it.
Ray:0-1 or 0-1e, second is redunant. Default is that if no 'i' included we assume this is exonic.
Andy:The 'i' is a problem as it changes the field from an integer to a string - in LOVD is text anyway. If this becomes std then this is fine. We have an integer, we don't add intron/exon as it is clear from the cDNA nomenclature.
Christophe: if you report the mutation at the cDNA, if genomic there is no transcript info - not useful. Sentence could be added to reflect this.
Jan:why not store genomic position?
Andy:genomic seqs change too much and are hard to work with as the numbers are too large.
Ray:if you have cDNA coords for splice site mutations, HGVS nomenclature is clear that this is donor/acceptor site not clear in genomic coordinates.
Christophe: Still use locally in UMD this info - but here we are to discuss exchange of data. Think about the minumum.
Mauno:different levels are important cDNA, genomic and protein, we use all of them.
=== Variant/DNA genomic ===
Mummi: would be nice to do this from a genomic view rather than a reference sequence
Ray:as genome assemblies are revised these change. LRG will provide stability, and maps to an assembly is handled outside this.
Fiona:The other way is to do as in dbSNP with the flanking region, always would need to be remapped.
Andy: should we have obligatory DNA_genomic OR DNA_coding, not obligatory genomic and recommended DNA coding.
Helen:depends on the reference sequence.
Ray:I use genomic reference seq and present in terms of cDNA coordinates.
Fiona:is that the norm?
Ivo:in LOVD we host, there are no genes specifing genomic.
Christophe:same for UMD. But we can translate these using LRG. Again it's about exchange so redundant.
Fiona:we can translate after exchange if that will make the data flow easier.
Christophe:LRG was designed for genomic reference sequences, we can translate.
Ray:intended to move to LRG ref sequences. But still also have cDNA coordinates in the context of viewing.
Fiona:will contain the cDNA and genomic seq so is the link between the two.
Ray:avoid some of both, be consistent.
Helen:what will the overhead be on moving to LRG.
Christophe:we will just add another field, we will keep the cDNA.
Mauno:issue where evidence should be provided. Predicted/assayed.
Tomasz:how often filled?
heikki:redundancy can help catch typos. Before validate the protein need the RNA variant, based on the reference sequence.
Chrsit:can report multiple splice variants, needs these.
Ray:B globin, is an AA substitution and a splice site. Get some globin with AA subs and other additional frame shift transcripts. One to many needed here.
Ray:is null allowed - not analysed. Needs to be clarified what null means. Ask Johan to be clearer all the way through the document.
Andy:how tell difference between observed and predicted proteins.
Ivo:most people don't test on protein, in LOVD usually without proof and then people don't use brackets. Can check detection technique and see how assayed.
Andy:some genes we have this and is predicted.
Ray:3 letter AA code is useful. Ask Johan if the three letter code is mandatory, thinks is more useful, or if single letter code is also allowed. Are both equally valid?
Andy: we need to be clear what 'not analysed' is across all fields - i.e. difference between no value and a ? .
Needs a validation code as does RNA predicted/assayed.
We think database source plus the dbid is needed and also version if specified at source database.
Mike:does that mean that each instance of the same variant has the same id.
Ivo:In LOVD it does, will get multiple patients.
Christophe:when we talk about this was to get a link back to the LSDB, can put anything you want. If you add this field will get a link to the patients with this mutation. Not designed as a unique id to a record. used as a link to get to patients with this record.
Pubmed id or DOI or dbSNP, OMIM?
Need to be clear that there may be several references and the more the better
Heikki:this is merging two things, a database reference and a citation. dbSNP etc is a cross ref to a variant.
Suggest that this is called bibliographic id and we split this from a database reference. May be many of both.
Rasko:Also seen that pubmed etc are treated as citations - these are just xrefs.
Mummi:you still need to resolve this so need a service.
suggest that we have:
and that use both recommended
Christophe:do we need dbSNP when we can get this manually?
Helen:What to do when a dbSNP doesn't map to ensembl?
Fiona:we have failed maps as well so can map to that.
Fiona:do we need both?
Ray:dbSNP maps to a specific substition, a pubmed id maps to many. They are different things.
Helen:if you were modelling this would split according to cardinalities citations are different from the dbxrefs, get multiple citations as well per variant.
Juha:do we need to acknowledge the first paper?
Mauno:need first and subsequent.
Ray:sometimes cite two papers for a single mutation. and first publication is simply a list - need to consider when modelling.
Helen:could reference OMIM for that.
Ray:if they are in OMIM.
Mauno:does OMIM ever publish OMIM that's not in a paper? Is a secondary reference.
Heikki:doesn't contain anything not published.
Ray:if this is the only place where we can add external references this is variant and patient reference. Could be split elsewhere.
recommended --> optional
Better name for this 'legacy description'
Legacy numbering systems are in the LRG specification, therefore not essential for exchange, suggest optional.
Mauno:not useful. No indication what was the reference for the paper.
Ray:counter argument: when people started numbering aa in collagens decided to number these at 1, at first glycine of triple helical region. Ignored upstream stuff. We now number from initiation M of translated product. Literature are described old way, added a custom field for the legacy numbering system. Without that confusion on a reader who looks at a paper.
Christophe:same experience with CFTR - should we reproduce this, or should we go forward to the correct system. I moved people to the new nomenclature don't want to go back is a concern.
Ray:feel same way. Want to use the new system, at the OI conference, people were not happy.
Andy:CFTR is a good example, clinicians use old labels, we need this. BIC doesn't use HGVS nomenclature, would want to label maps to old system
Ray:suggest change this to optional.
Fiona:people are using old systems still?
YES people still do this.
Fiona:In LRG we can add other naming schemes.
Morris:This can be a multiple reference.
Andy:this is not a clear name, not clear what it's there for - other or legacy is better.
Fiona:other naming is used in LRG.
Christophe:some people are using the wrong reference sequence and made an error in the case where you know it's wrong in the legacy should this be propagated?
Ray:template and method can be missing common in human mutation papers.
WE NEED A COMMON WAY OF DEFINING UNKNOWN across this domain.
Ray:we have not defined in OI database. Is always a value in that case.
Ivo:this is important, if RNA was analysed, provides pathogenicity importance, opinion on this has additional evidence
Christophe;then you will have this in the naming system for RNA nomenclature. People want to know if the whole gene or a subset of the gene was tested. This is different info. RNA is not often sequenced. Useful info may be how much of the sequence has been read. p53 - common exon, then bias in the info. This is redundant with nomenclature.
Ray:If describe at the cDNA level doesn't say what was analyzed.
Christophe:if you seq RNA and cDNA report both nomenclature so is redundant.
NO RESOLUTION REACHED
Ray:also need to say that the technique was not recorded.
Andy:with an LSDB it is implied in the data that only the relevant gene or even a part of gene is sequenced. When we transfer to large datasets we need to show what part of the genome was sequenced.
Andy: although mutiple techniques may be used for a sample we do not give the history, we just describe the technology used to actually characterise the variant as it is reported.
Andy:have a list of techniques which can be used for this (can compare these to Johan's list).
Cross project connectivity - techniques could be added to the Protocol database from Ulf Landegran.
===Variant/DNA_remark ->remark or comment===
Mike:is there an example of this, what is this.
Heikki:any comment so DNA can be removed from the name. Typically is always about variant, not patients. e.g. family of three analysed and all were identifical.
Ray:mutation incorrectly described in x citation.
Add what population is relevant - local DB content or external source e.g. HapMap
Mummi:rather than a min info in free text, report a relative frequency.
Helen:do you need to say what the population is? Needs to be clear if this is local or global.
Andy:is there a standard way to represent this data in dbSNP? Should align with dbSNP.
Remove sporadic as a separate term. Purpose needs clarification - is this inheritance or the origin of the mutation. How did it arise vs how inherited. Also many more options in the list in the appendix, maternal and paternal. This is patient variant info, not just variant. Needs to be clearer. List is redundant with allele, these two are not cleanly separated. Is the disease and the variant assumed in this case, needs to be clearer.
Ray:Are sporadic/de novo different?
Helen:is this including the parent?
Ivo:is in the allele field.
Ray:is inherited/not inherited, get germ line mosaicism as well. also needs to be covered. Also in cancer causing genes these are somatic.
Mike:clear from survey that people want segregation info, not sure how useful this is if doesn't also say that they have the disease.
NO RESOLUTION, see comments
Typically a list of sites, clarify definition.
Some redundancy with Variant/origin needs to be clarified. Suggestion that this can be optional. Need a way to link two variants with two parents. Labelling confusing. Suggests is a property of variant and is a property of the variable and allele.
Heikki:most common disorders are not imprinted so this may be optional.
Andy:if have two variants need to link two variants to two parents. This text doesn't do that. need a way to link two records together.
Andy: is it likely that we will all agree on the same pathogenicity values?
Ray:this is not the phenotype is about the variant, not linked to a patient.
Mauno:add evidence - e.g. list of GO codes that may work in this space.
Purpose needs clarifying what about cases where there are multiple samples taken, does this need another field. Is this an internal id? external id? Make a recommendation that the language is changed to include anonymised or non identifiable ids NOT a lab id. Use these consistently when reporting.
Ray:where there is an id in a citation I record that id. Where there is no id I don't add that. I add an anon id to people who are submitting data to the database, map to anon no. They keep the mapping.
Christophe:same in UMD is anonymised id. Not a lab ID.
Mauno:maybe that is not available, change to recommended.
Ray:there are cases where there is no patient id.
Heikki:people ignore the samples per patient issue.
Problem that the initial phenotype and e.g. updated phenotypes - e.g. when a patient matures and gets onset of disease. Need a way to update this in light of diagnosis.
Helen: presented as singular, is probably multiple.
Andy:reason the patient is referred may be a diagnosis which may change - Phenotype is a list of observable facts.
Helen:description using OMIM is probably the worst thing that can be used. Make a general statement about a recognised vocab.
Christophe:we do not store the original info - people can misdescribe the patient. If I put DMD this is wrong. We collect only validated information.
Ray:Need a way then to describe the primary phenotype and a concluded one.
unclear - change text to anything about the patient, phenotype suggests redudancy with the above statement. Needs work.
No problem with the idea of a comment.
We need to know what origin means - does it means the place they were born, town, country. Need clarity.
Christophe:if you add too much info you may be able to trace the info if rare enough.
Needs a vocab clearly.
MH:looking at the NCI meta data repository, there is a difference between race and ethnicity. E.g. hispanic, and race can get white, asian. Hispanic has many origins. caDSR.
Andy:clinical geneticists came up with 'additional relevant ancestry'. Not record unless relevant to the disease and can be very specific. Ashkenazi jewish for e.g. is relevant, in other cases is not. Change to include this.
Heikki:use it when you have to, use an existing vovab.
Christophe:this is sensitive data, use it when you are allowed to, when you have consent for this. May reflect on recommended.
ISO std for this that includes the genetic disorders. Need to add unspecified.
Andy adds - standard is as follows: 'The most common internationally recognized standard for gender is ISO 5218. It is limited to values corresponding to Male (1), Female (2), Not Known (0) and Not Specified (9).' Not Known means no information has been given, not specified means that the gender is non-specific (i.e. cannot be determined as male or female for whatever reason).
Why is this obligatory? Does this add anything if exchanged, surely this can be tracked other ways. Many people curate from the literature.
Issues with credit, and also sharing contact details - maybe change to recommended as people may choose not to have this shared. If only an id is needed and not contact info then this adds little for data exchange.
Mummi:contributor concept issues of contributions. Microattribution get dozens of names per variant. Is this the minimum?
Additional Item; which regions were assayed when detecting a mutation, sequences, exons?
Andy and Christophe want to know what was sequenced/assayed e.g. genome, part gene, a single exon an an extra field in the proposed list.