Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relating PRO to UniProt #165

Open
nataled opened this issue Oct 2, 2019 · 8 comments
Open

Relating PRO to UniProt #165

nataled opened this issue Oct 2, 2019 · 8 comments
Assignees
Labels
Policy Discussions about PRO policies

Comments

@nataled
Copy link
Collaborator

nataled commented Oct 2, 2019

This issue is a continuation of the discussion here:
geneontology/neo#34

This thread will focus on:

  1. What do the UniProt PURLs denote: database entry, protein class, or sequence?
  2. How does PRO relate to UniProt?

Interested parties (so far):
@JervenBolleman
@cmungall
@goodb
@alanruttenberg

@cmungall
Copy link

cmungall commented Oct 2, 2019

I would very much like there to be a single URI for a concept like "human Shh protein" (or at least two equivalent interchangeable URIs).

@nataled
Copy link
Collaborator Author

nataled commented Oct 2, 2019

This will be possible once we find out just what the UniProt PURLS intend to mean. I recall @JervenBolleman saying he considers them to mean the same as PRO when he gives talks, but I'm not sure there's agreement on that (several people on the previous thread--myself included--indicated that they consider them as referring to database entries). In PRO we consider them exactly that--database entries that are about some protein class (for example, http://purl.uniprot.org/uniprot/P05067 is_about http://purl.obofoundry.org/obo/PR_P05067).

My main concern is that the UniProt PURLs might be overloaded in meaning. That is, some people consider them to refer to classes of proteins, some say they refer to database entries, and others might consider them as referring to sequences . If they are database entries, fine, but for PRO purposes we'll need a way to refer to the sequence. If they are protein classes, fine, we'll provide the appropriate equivalency statements, but we'll still need a way to refer to the sequence. If they are sequences, fine, we'll make the appropriate connection. I recall @cmungall suggesting that for the sequences we use a URL such as https://www.uniprot.org/uniprot/P05067.fasta?version=1. That would be fine, but there are also these things: http://purl.uniprot.org/isoforms/P05067-1. I asked if that PURL is intended to represent the (current) sequence, or intended to represent the class of proteins derived from that isoform. I did not get an answer.

@cmungall
Copy link

cmungall commented Oct 2, 2019

[broken record]
I think the whole referring to database entries is a red herring. http://purl.obolibrary.org/obo/GO_0097194 refers to a database entry, for a term in GO. It has databasey properties like identifiers, and xrefs, and information about which curator created it. But it's also a representation of a repeatable thing in nature. Ultimately we're all in the business of representing things in nature here, and at the same time doing database/ontology curation.

Our IDs can do dual duties as representing database entities and things in nature. There is no need to get meta and introduce an extra layer of indirection. Or at least I am not aware of such a use case, where someone really needs to track both these things and keep them distinct.
[/broken record]

I think the sequence vs protein molecule aspect is a bit more nuanced

@nataled
Copy link
Collaborator Author

nataled commented Oct 2, 2019

I believe you missed my point. It isn't that I am introducing a layer. The question is "What kind of entity does UniProt consider its entries to be?" And one possible answer is..."Database entries."

@nataled
Copy link
Collaborator Author

nataled commented Oct 3, 2019

@cmungall asked "What are the semantics of a non-GCRP trembl ID according to PRO?"

TrEMBL entries fall into the following types:

A) If there already exists a Swiss-Prot entry describing the products of some gene G (SP_of_G), then the TrEMBL entry describing a product of the same gene (Tr_of_G) can be:

  1. A sequence variant (allele) of G. These would be Tr_of_G is_a SP_of_G
  2. An isoform of G. These would be Tr_of_G is_a SP_of_G

B) If no Swiss-Prot entry describes the products of the TrEMBL gene, then the TrEMBL entry describing a product of that gene (Tr_of_G) can be:

  1. The 'proto-canonical' sequence (either because there is no other entry describing a product of that gene, or because it has the longest sequence among all TrEMBL entries with that gene). We'll call these TrC_of_G. In this case TrC_of_G is_a protein (or whatever level is appropriate). I describe this only for completeness; these are (or should be) part of the GCRP set.
  2. A sequence variant (allele) of that gene (TrV_of_G). Then, TrV_of_G is_a TrC_of_G.
  3. An isoform of G (TrI_of_G). Then, TrI_of_G is_a TrC_of_G.

C) If no gene is indicated in the TrEMBL entry (call it TrX), then...

  1. TrX is_a protein (if no species non-specific parent can be found).
  2. TrX is_a =species non-specific parent=

Technically speaking, TrEMBL entries (like some Swiss-Prot) can also describe fragments.

@cmungall
Copy link

cmungall commented Oct 4, 2019

I'm going to post a strawman proposal:

PRO gene-level protein classes and UniProt canonical/GCRP entries are to be considered equivalent in the strict OWL sense. (ergo the URIs could be collapsed with no loss of logical entailment and no introduction of inconsistency. This would be a win as the community would not have to make an arbitrary selection between two distinct PURLs/CURIEs)

Ontologically these are protein classes, which are material entity classes (as is currently the case in PRO)

(The uniprot docs talk about these as sequences, which is perfectly valid as the main use case for these involves treating them as sequences, but in the ontological treatment, the sequence would be a property of the material entity)

They are the superclasses of isoform classes (as they are now, in PRO)

The isoform level classes in PRO would be equivalent to the uniprot isoform entries (e.g. P12345-1)

There could be some kind of has-canonical-form relationship between the main class and isoform-1 (see http://purl.obolibrary.org/obo/RO_0002214)

Note that at the database level, the canonical entry will have annotations for things such as protein domains, functions, etc. At the ontological level this will not be taken to mean that all instances of that protein have those properties. Otherwise we end up with logical inconsistencies. Instead it will be a some-some.

Note that neither resource needs to make any changes to implement this. It would be a semantic MOU about ontological commitment of PURLs. And both would agree not to publish logical axioms that introduce logical inconsistencies.

However, if both parties agree, then there is a strong case for PRO switching from PRO purls for gene-level to instead use uniprot PURLs.

@cmungall
Copy link

I don't know if this will be discussed at the PRO meeting this week, I may not have time after today for any Qs, but @goodb, @balhoff, @ukemi, @deustp01 may be able to help

@nataled
Copy link
Collaborator Author

nataled commented Oct 21, 2019

Unfortunately, the PRO meeting is heavily focused on preparing for work proposed as part of an upcoming grant, and will be rather high level. It is possible (and likely) that this can be discussed with a few people outside the meeting, but there just isn't time to do so during the meeting itself (plus, we won't have the required stakeholders present). Given my schedule, I myself will not be able to address your proposal for another few weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Policy Discussions about PRO policies
Projects
None yet
Development

No branches or pull requests

2 participants