Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
dosumis authored Oct 2, 2024
1 parent abf0b63 commit f6e392e
Showing 1 changed file with 0 additions and 27 deletions.
27 changes: 0 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,4 @@

Documentation for the CL_KG including access guide, query guide and links to documentation of schema and use cases can be found at https://cellular-semantics.github.io/CL_KG/

### This repo

This repo contains Pipeline code for building the Cell Ontology knowledge graph

Components:
* Hierarchies of nested cell sets defined by author category cell type annotations, combined with CL annotation.
* The Cell Ontology and interlinked OBO ontologies - initial emphasis on GO & Pro.
* Gene/Protein Annotations to GO terms that:
* Are supported by strong experimental evidence
* Cover mouse and human genes (we may consider adding other mammals)
* Are directly linked to CL terms or are closely linked via some defined pattern (e.g. if a cell type has a cellular component, then we should also pull annotations to terms for assembly and maintenance of that cell type).
* Sources of assertions about cell type markers: LLMs, CL, GO, CAS.
* Curated information about cell types and the processes they are involved in derived from LLM-based piplines.
* In all cases, we will capture which publications support/sources support assertions.
* Standard model for linking Gene/Protein/Transcript IDs. TBD. Initially at least I suggest aggregating to single Neo4J nodes and using APs.
* For all markers found via any route, validate against annotated data using CxG Census query & storing sumamry statistics - mean, median, variance, entropy? This information can be stored in edge annotation.

Pipelines:
* Pandasaurus extracts cell sets linked to CL terms following standard schemas
* Python script to QuickGO API to extract relevant GO annotations

Use cases:
* Mining CxG for missing CL terms and CL annotations (Cypher queries to be defined)
* Cell Type marker query service
* Define Cypher queries
* Build API
* Build LLM query layer
* Input to ML algorithms assigning cell type. This is experimental. We need partners early enough in development do guide and avoid making poor choices. It is probably worth being aware of existing options for generating embeddings (e.g. node2vec)

0 comments on commit f6e392e

Please sign in to comment.