Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please use http://purl.uniprot.org/uniprot or http://purl.uniprot.org/isoform/ IRIs for UniProt concepts #34

Open
JervenBolleman opened this issue Oct 22, 2018 · 28 comments

Comments

@JervenBolleman
Copy link

This will make it easier to link the UniProt data with the GO (A) data on RDF and OWL level.

Mostly, it will make it easier for us to introduce Noctea compatible modelling for UniProt->GO term Relations. With the benefit of users loading both data not getting duplicate triples just because we don't use the same IRIs.

@cmungall
Copy link
Member

What does the uniprot PURL denote? If other graphs assert it's an IAO ICE we end up with incoherency. We need to treat it as a material entity (not that identifiers.org is clear on this)

@nataled are you using uniprot PURLs as ICEs in PRO?

@nataled
Copy link

nataled commented Oct 26, 2018

At the moment we don't use them for anything other than what they are: database entries. In PRO they are only used for cross-references and evidences. However, going beyond that, they would be considered ICEs. Think of the distinction between SO and MSO: UniProtKB would be akin to SO, while PRO would be akin to MSO.

@cmungall
Copy link
Member

Thanks! I'm most interested in what they are asserted or entailed to be in OWL.

If you have axioms that cause a uniprot PURL (as in an actual purl.uniprot.org PURL) to be entailed as an ICE (for example, through use of an object property with domain/range constraints) then the combined knowledge graph with GO-CAM will have inconsistencies.

I believe this is the case. I believe also that @JervenBolleman who is the authority on what the purls mean would say that these denote database records, not proteins. Both of these facts indicate that we should not use these purls for the neo classes (funnily enough, the PRO class has the intended semantics, but GO annotators want identifiers with UniProtKB prefixes, and we need all of at least swiss-prot materialized, which means we can't use PRO).

Note that SO classes are not subclasses of ICEs, many SO classes have instances that exist independently of database records.

@nataled
Copy link

nataled commented Oct 26, 2018

Any Swiss-Prot entry can be trivially materialized in PRO. Some will need special treatment, of course, but we know how to deal with those. The only thing that stops us from doing it is that we've not had a request to do so. But, go ahead and try it. Take any Swiss-Prot accession that doesn't have a corresponding PRO, and prefix it with a PRO PURL (purl.obolibrary.org/obo/PR_). Works for TrEMBL too.

@JervenBolleman
Copy link
Author

OWL-DL speaking -> identifiers.org states

<http://identifiers.org/uniprot/P05067> owl:sameAs <http://purl.uniprot.org/uniprot/P05067>

so this is not a modelling change.

However, for SPARQL ease of use doing federated queries it helps a lot for the practical adoption of Noctea models if we can cross query them. IRI conversion in SPARQL queries is possible but a pain that we would rather not have.

DataRecord = owl:Class when I a talk. It means that a single UniProt record/class represents between 0 and practically infinite numbers of molecules, similar to a PRO class. Changing from rdf:type uniprot:Protein to rdfs:subClassOf uniprot:Protein is still on our todo list. However, with the current state of reasoners our users would have serious problems with the billions of axioms.

@cmungall @nataled using PRO or UniProt is a separate discussion from this bug report and I suggest you open an issue and discussion of that separately. IMHO The more flexible semantics of UniProt is actually key for Noctea success as PRO semantic limits make it invalid to express some desired annotations (especially regarding the function of secreted proteins).

@cmungall
Copy link
Member

OWL-DL speaking -> identifiers.org states
http://identifiers.org/uniprot/P05067 owl:sameAs http://purl.uniprot.org/uniprot/P05067

Where does this axiom come from?

curl  -H "Accept: text/turtle"  http://identifiers.org/uniprot/P05067 

doesn't return anything

If there is a sameAs axiom, formally it doesn't affect us, since we're in OWL-DL and protected by punning (sameAs only applies to individuals, and we're using classes).

DataRecord = owl:Class when I a talk.

OK, I will try and mentally translate but this extra layer confuses me. Would you apply this to GO too? To me, every GO class represents a process or cellular entity type. Yes, the class is also an information entity but this is implicit. It's most parsimonious to leave out talk of information entities when modeling unless one explicitly wants to talk about information entities.

IMHO The more flexible semantics of UniProt is actually key for Noctea success as PRO semantic limits make it invalid to express some desired annotations (especially regarding the function of secreted proteins

Actually flexible semantics is not good for us, and much as I want easy federated querying, if we don't have logically consistent models, reasoning doesn't work and we rely on reasoning for everything. We need precise semantics.

Can you explain what you mean about secreted proteins? I don't see any challenges representing this as a GO-CAM (in fact we have axiomatized classes like renin secretion in the ontology using PRO semantics).

To summarize, we need pro-like semantics (proteins like 'human shh' as classes), but uniprotkb prefixes, as the community wants to annotate to uniprot.

It sounds like you might be open to providing the semantics we need, but are blocked by this:

Changing from rdf:type uniprot:Protein to rdfs:subClassOf uniprot:Protein is still on our todo list. However, with the current state of reasoners our users would have serious problems with the billions of axioms

What reasoners are you using? This seems like a fairly tractable technical challenge. And there may be options like using a tbox shadowed in the abox for internal reasoning but publishing as a tbox.

@alanruttenberg
Copy link

It seems clear that the UniProt concept and the PRO class are different sorts. Why can't the interoperability be handled in NEO's interface? E.g. Accept either ID as input. Use PRO ids internally, display ids according to preference for one or the other. Generate RDF/OWL suitable for integration into UniProt's SPARQL endpoint that matches UniProt's policy for how a PRO ID maps to a UniProt record.

There will be issues to address since PRO isn't strictly one to one with UniProt even at the organism-gene level. but those issues won't be addressed by simply equating the two. Exposing those assumptions clearly, and having the tool users understand what they are accepting by choosing one or the other identifiers would be quite a good thing insofar as making clearer the relation of UniProt to PRO.

Using the UniProt ids for protein classes also has the consequence that we no longer have an identifier for the information content entity that is the UniProt record, which we could otherwise use in different ways. For example, the canonical sequence (as information artifact) is part of the UniProt record, but it isn't the sequence of all all the proteins in the class.

As an example of an issue consider the relation of the isoform to the organism-gene level. We use a subclass relation, but as far as I can tell, UniProt does not. I think it would be hard, and require substantial commitment, to coordinate the RDF/OWL in the sense of being able to simply add a piece of OBO RDF/XML to UniProt RDF/XML and expect the result to make sense. If we're not going to be able to do that it isn't clear what benefit there is to using the same identifier.

@alanruttenberg
Copy link

BTW, I'm happy to chat and discuss the issues, if you are interested.

@cmungall
Copy link
Member

cmungall commented Nov 6, 2018

It seems clear that the UniProt concept and the PRO class are different sorts

I would like to explore this further, as it's not totally clear to me that they are. Sorry I missed the call.

Why can't the interoperability be handled in NEO's interface? E.g. Accept either ID as input. Use PRO ids internally, display ids according to preference for one or the other.

I would like to do this, but this would involve multiple exceptions into the code at different points, increasing overall fragility. On top of that, there are member groups of the GOC who have expressed that they want to annotate to UniProtKB IDs (including prefix) and I have to respect that.

There will be issues to address since PRO isn't strictly one to one with UniProt even at the organism-gene level. but those issues won't be addressed by simply equating the two. Exposing those assumptions clearly, and having the tool users understand what they are accepting by choosing one or the other identifiers would be quite a good thing insofar as making clearer the relation of UniProt to PRO.

Let's take the case of a GCRP swissprot entry and the corresponding entry. There are definitely issues to address here (e.g. sometimes GCRP will include trembl, but at least for human we should be 99% in agreement), but I think these are separable (and they are already being discussed elsewhere).

What would it mean to expose the different assumptions between

To a biologist and the users of Noctua these seem to indicate the same thing. And to me as well: I believe they are intended to denote the same thing, the PR purl is just clearer and more explicit about OWL commitments and relationships to other OBO entities.

Using the UniProt ids for protein classes also has the consequence that we no longer have an identifier for the information content entity that is the UniProt record, which we could otherwise use in different ways. For example, the canonical sequence (as information artifact) is part of the UniProt record, but it isn't the sequence of all all the proteins in the class.

I'm not convinced you need this level of meta-representation, but in any case I believe you want to use a PURL with the sequence version embedded for this use case, the sequence in the db may change over time.

E.g These differ by one residue:

https://www.uniprot.org/uniprot/Q9FXT6.fasta?version=1
https://www.uniprot.org/uniprot/Q9FXT6.fasta?version=74

So if you want to explicitly and logically represent an alignment relative to a sequence you'd need to use the version IRIs, or just encode the string directly.

As an example of an issue consider the relation of the isoform to the organism-gene level. We use a subclass relation, but as far as I can tell, UniProt does not.

Yes, this could cause big problems, if asserting a subclass introduces inconsistency.

I think it would be hard, and require substantial commitment, to coordinate the RDF/OWL in the sense of being able to simply add a piece of OBO RDF/XML to UniProt RDF/XML and expect the result to make sense. If we're not going to be able to do that it isn't clear what benefit there is to using the same identifier.

I think this is the crux of the issue. I agree that if the results of doing the combination are incoherency then it won't work (see the first comment from me in this ticket). At the moment these is a certain amount of shielding due to the punning, but that's not quite satisfactory (although that is a potential long term strategy here).

We need to know more about plans for OWL commitments on the uniprot PURLs from their maintainers. Comments above from Jerven like "Changing from rdf:type uniprot:Protein to rdfs:subClassOf uniprot:Protein is still on our todo list." suggest things are moving in the direction of compatibility, so I am hopeful.

@alanruttenberg
Copy link

I come to my conclusion about them being distinct sorts from two directions. First, as you say, PRO is very clear about what their entities denote. UniProt is not. Not because they can't or don't want to, but because they view their resource as a database, not an ontology. Without understanding exactly what their entities denote (and verifying that their logical assertions regarding them concord), we can't adequately compare them to PRO.

Second, where I have looked for implicit commitments as evidenced in assertions in their RDF, I find incompatibilities. We agree that combining our and their RDF will be incoherent.

in any case I believe you want to use a PURL with the sequence version embedded for this use case, the sequence in the db may change over time.

My presumption was that UniProt's RDF gave distinct sequences distinct PURLs. If so, then those would be adequate. If not, we would do whatever we have to in order to properly record sequence, but that would also expose another way in which the commitments of the two resources differ.

On the matter of respecting your users, I understand that need, but that seems to be something that you need to address with in the tool, not necessarily in the ontologies. I haven't really looked at Noctua/NEO other than what I've seen in a couple of presentations and so at the moment, I don't understand it's model and logical commitments. Because of that I can't speak to the use of UniProt IRIs there. What I do know is that, insofar as OBO ontologies go, these IRIs represent different things.

I would like to do this, but this would involve multiple exceptions into the code at different points, increasing overall fragility. On top of that, there are member groups of the GOC who have expressed that they want to annotate to UniProtKB IDs (including prefix) and I have to respect that.

SMOP. I have trouble sympathizing with the idea that in order to alleviate some bit of programming we should introduce substantial confusion about ontology. From my point of view, there is a perfectly coherent view of UniProt as database consisting of ICEs, and PRO as ontology, a view which is in concordance with what the developers of each resource.

Regarding the multiple exceptions, if you are interested we could look at the code together and brainstorm to find a way to handle the interconversion in a clean and minimally disruptive manner.

--

If, at some point, UniProt were to decide that they want the resource to be understood as an OBO ontology, something I would love them to do (I've said so in the past), then that would reopen the question for me. A good collaboration between UniProt and PRO might be to undertake that effort assuming all parties were interested and committed, and that the effort could be funded.

@cmungall
Copy link
Member

cmungall commented Nov 7, 2018

On the matter of respecting your users, I understand that need, but that seems to be something that you need to address with in the tool, not necessarily in the ontologies

No, the requirement is that uniprotkb is used, regardless of tooling.

@JervenBolleman
Copy link
Author

JervenBolleman commented Nov 7, 2018 via email

@cmungall
Copy link
Member

cmungall commented Nov 8, 2018 via email

@cmungall
Copy link
Member

cmungall commented Nov 8, 2018

synthetic peptides: we don't annotate to these, only gene products of genes, so these would not be in neo

pointer to good intro material?
@balhoff's presentation from RO mtg:
https://buffalo.app.box.com/s/spp9iam2zjoe0hmxjur5fvlyssvp56vl

@alanruttenberg
Copy link

This issue was very specific regarding IRI's for uniprot resources. Where I
have a large preference to use the resources IRIs directly if they have an RDF form. If for logical
reasons a different concept is required my preference is to have a new IRI that relates to our IRI with as clear a semantics as possible. e.g. something like this
http://example.org/noctea-(re)interperation-of-uniprot/P05067
skos:closeMatch http://purl.uniprot.org/uniprot/P05067

I think that using the PRO URIs in combination with skos:closeMatch is the best of both worlds. PRO terms have clear semantics and is already mapping, where appropriate, to UniProt. Using skos:closeMatch is a good bridge between OBO ontology terms and a more RDF-oriented view.

What do you think, @nataled

@nataled
Copy link

nataled commented Nov 21, 2018

After further rumination and discussion, I come to the conclusion that the main problem (for PRO) is that the scientific community uses UniProtKB identifiers to mean two different things. One, exemplified by GOA, is that they are basically the same as PRO, that is, that they represent actual proteins that can be annotated with functions, etc. The other, exemplified by Pfam and other protein classification projects, is that they represent the sequences of those proteins. My concern about usage of UniProt vs PRO centers on the need (by PRO) for that latter interpretation, and that imposing the former interpretation on the uniprot purls would leave us without a way to talk about the sequences themselves. So, a question to @JervenBolleman: assuming that http://purl.uniprot.org/uniprot/P05067 refers to a class of proteins, how would you refer to, say, the canonical sequence of that class? If there is a way to separate the two interpretations that solves the immediate problem. Bear in mind the following:

  1. Personally, I think the right way to refer to the specific sequence is via UniParc. However, we are constrained by the fact that there are almost zero resources that refer to UniParc identifiers, and since we wish to import information (for example, classification into protein families), we have to use the same identifiers as those resources; that is, UniProtKB.

  2. Using the version IRIs for sequences suffers from a related but different concern; namely, that we don't know which version is used by these resources. We understand that sometimes sequences are refined without changing the UniProtKB accession, and we are prepared to deal with that.

@cmungall
Copy link
Member

In the uniprot triplestore there is a up:sequence property that connects an entry to isoform entries. But I think what is required is a PURL for the sequence specifically, e.g. having a PURL for https://www.uniprot.org/uniprot/Q9FXT6.fasta?version=1

@alanruttenberg
Copy link

alanruttenberg commented Nov 22, 2018 via email

@nataled
Copy link

nataled commented Nov 28, 2018

uniprot purls
Looking at an actual example rdf (https://www.uniprot.org/uniprot/P10403.rdf), it seems that the isoforms PURL is the sequence. I take as evidence of this two things (see attached screenshot):

  1. "A" shows that the value of the isoform PURL is the sequence.
  2. "B" shows that the isoform PURL is the resource used for positions

My interpretation of this is that http://purl.uniprot.org/uniprot/P10403 can (does?) refer to the protein entity (in the PRO sense) while http://purl.unitprot.org/isoforms/P10403-1 refers to (what happens to be) the canonical sequence of that protein. @JervenBolleman can you confirm?

I also note the following:

  1. The word 'isoform' in the isoform PURL given in this ticket is singular, but in practice it is plural.
  2. Neither version actually resolves to anything useful

An open question involves whether or not http://purl.uniprot.org/uniprot/P10403-1 is a valid PURL for the protein (material) entity that refers to that specific isoform.

@goodb
Copy link

goodb commented Sep 30, 2019

Mainly repinging folks working on this thread. Wondering if we could try again for a consensus decision as its impacting GO work in multiple projects.

For what its worth, after reading through the above it seems that there is a consensus that PRO OWL semantics are a better match for the Noctua use case than what we get from UniProt (RDF) now.

I see two things stopping us from switching over. 1) PRO would need to add all of the proteins needed by GOC annotators. According to @nataled above (regarding trembl) it sounds like this would be possible. 2) Either GOC folks are convinced to use the PRO ids (sounds unlikely) or through a SMOP they see what they want to see in the Noctua UI (for selecting genes) and in the Noctua output (especially the flatfile GPAD output). The SMOP would be greatly enabled if PRO maintained a clear semantic structure mapping from PRO classes to UniProt records. (xref is not sufficiently clear in meaning).

?

@cmungall
Copy link
Member

cmungall commented Sep 30, 2019 via email

@goodb
Copy link

goodb commented Sep 30, 2019

Sorry, I saw @alanruttenberg 's use of SMOP (small matter of programming) above and liked its connotations.. If we have the mappings from PRO to uniprot up front, I don't think its terrible to handle the translations in the Noctua code. I have a cut at doing this for the reactome entities -> uniprot for GPAD working in noctua-dev now.

I don't see how you avoid loading the whole protein universe without a Noctua stack architecture change??? Whether its a PRO expansion or UniProt being ingested into neo, we still end up with a gigantic OWL file.

For other's information, as it stands now, Noctua is driven from a 1.45gb merged OWL file (go-lego) of which 1.12gb is neo. This contains all of the classes that can be used to type the instances in the go-cam models, with neo containing the gene product classes. Although it introduces some technical hassle (e.g. that the entire file is loaded by default when attempting to load a GO-CAM owl model into protege or other) it actually works just fine for the Noctua application right now. Its probably drifting off topic here, but if there was a way to grow neo based on curator demand (e.g. one protein at a time as they needed it), we might be able to solve the giant OWL file problem.

@JervenBolleman
Copy link
Author

@goodb and @cmungall could you please open separate issues for separate concerns? This issue was quite focussed in it's request and now asks a zillion different things in your discussions.

Basically, my request is -> if you annotate UniProt entries use UniProt purls. If you are annotating something else, use something else.

Don't have users annotate UniProt but use PRO, nor have users annotate PRO and use UniProt. Not every UniProt case can be represented in PRO (or the other way around), nor are these the only two databases that users of noctua might wish to use. e.g. nextprot and ensembl protein's are valid IRI targets for GO-CAM annotation as well.

@cmungall
Copy link
Member

cmungall commented Oct 2, 2019 via email

@nataled
Copy link

nataled commented Oct 2, 2019

I'm fine with using the PRO tracker, even though pretty much all the unanswered questions are about UniProt. Here are the topics discussed (probably missed a few):

  1. What do the UniProt PURLs denote: database entry, protein class, or sequence?
  2. How does PRO relate to UniProt?
  3. User needs: a SMOP, or address ontologically?

Topics 1 and 2 will be further addressed here: PROconsortium/PRoteinOntology#165

Finally, one point of clarification:

Don't have users annotate UniProt but use PRO, nor have users annotate PRO and use UniProt. Not every UniProt case can be represented in PRO (or the other way around), nor are these the only two databases that users of noctua might wish to use. e.g. nextprot and ensembl protein's are valid IRI targets for GO-CAM annotation as well.

Actually, every UniProt case CAN be represented in PRO. It's just that a small subset has to be done manually.

@cmungall
Copy link
Member

cmungall commented Oct 2, 2019 via email

@pgaudet
Copy link

pgaudet commented Feb 10, 2022

What's the status of this?

@JervenBolleman
Copy link
Author

JervenBolleman commented Feb 10, 2022

No progress. Still open. I prefer that Neo uses http://purl.uniprot.org/uniprot/A0A024BTL2 instead of http://identifiers.org/uniprot/A0A024BTL2 when talking about UniProt entries/classes. Especially now that identifiers.org does not recommend their own IRI pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants