Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an encrypted DNA ancestry using Concrete ML #95

Closed
zaccherinij opened this issue Feb 9, 2024 · 4 comments
Closed

Create an encrypted DNA ancestry using Concrete ML #95

zaccherinij opened this issue Feb 9, 2024 · 4 comments
Assignees
Labels
🎯 Bounty This bounty is currently open 📁 Concrete ML library targeted: Concrete ML

Comments

@zaccherinij
Copy link
Collaborator

zaccherinij commented Feb 9, 2024

Concrete ML simplifies the use of FHE for data scientists to help them automatically turn machine learning models into their homomorphic equivalent. FHE can be particularly useful to protect users health care data, and is a perfect candidate to solve the privacy risks with using genealogy analysis websites.

Over 30 million people have taken DNA tests to determine their ancestry through computer genetic genealogy. By processing the digitized sequences of DNA bases, sophisticated computer algorithms can identify if one’s ancestors came from a number of ethnic groups. DNA is sensitive personal identification as it can identify an individual uniquely and leaks of DNA data have already happened.

DNA ancestry identification is a complex process that involves multiple steps. First, DNA phasing assigns alleles (the As, Cs, Ts and Gs in DNA strands) to the paternal and maternal chromosomes. Second, ancestry can be determined by referencing specific segments of the DNA with large databases of DNA of known ancestry. An alternative is to use machine learning to classify each such segment and, finally, to aggregate the ancestry of each individual segment into a final classification.

Using Fully Homomorphic Encryption we think determining ancestry can be done on encrypted DNA sequences, preserving the security of users’ DNA. Most published machine-learning based methods for ancestry identification typically perform local ancestry inference. Global ancestry inference tries to compute the genome-wide average of the population contributions while local ancestry inference (LAI) tries to identify the regional ancestry of a genomic segment, which is more amenable to machine learning. To build the global ancestry from local decisions, LAI algorithms use machine learning also in a second step, taking ancestry classifications of different segments and fusing them into a single classification for a person.

Many types of machine learning models were proposed for local ancestry inference: neural networks [1], hidden markov models [2], decision trees or logistic regression [3] (the G-nomix project). A great hands-on resource on machine learning for ancestry is the AI Sandbox github.

Submission

1️⃣ Want to solve this bounty? Register here.
2️⃣ Ready to submit your solution? Submit here.
🗓️ Submission deadline: May 12th, 2024.

Overview

The goal of this bounty is to train ancestry classifiers using Concrete ML so they can execute on encrypted data. You can assume the input DNA is phased and in the proper format. As mentioned above, most approaches are two-stage. First, classifiers are trained for individual genomics windows. Second, a smoother is trained which combines the predictions of the individual classifiers.

You can use any datasets that you want as long as you abide by their license agreements. Some examples are the 1000 Genomes Project, the Simons Genome Diversity Project and the Human Genome Diversity Project.

What we expect

  • A report explaining how you built the project and the accuracy and timings that you obtain Models that can identify ancestry in by using the DNA of chromosome 22 .

Important

To qualify for the maximum prize, the FHE application should perform both stages of the classification in FHE.
Partial prizes will be awarded if only one stage of the pipeline is in FHE, but you can assume preprocessing such as phasing is done in the clear in a separate step (you can use phased DNA directly).

  • An evaluation of the FHE model’s performance using FHE (only partial prizes are awarded if the algorithm can only partially be run with FHE but works with FHE simulation)
  • An evaluation of the floating-point equivalent (non-FHE) model’s performance for comparison
  • A clean and documented code, as well as a straightforward README.md file showing how to install the project as well as run and evaluate the models

Implementation guide

Reward

🥇Best submission: up to €5,000.

To be considered best submission, a solution must be efficient, effective and demonstrate a deep understanding of the core problem. Alongside the technical correctness, it should also be submitted with a clean code, clear explanations and a complete documentation.

🥈Second-best submission: up to €3,000.

For a solution to be considered the second best submission, it should be both efficient and effective. The code should be neat and readable, while its documentation might not be as exhaustive as the best submission, it should cover the key aspects of the solution.

🥉Third-best submission: up to €2,000.

The third best submission is one that presents a solution that effectively tackles the challenge at hand, even if it may have certain areas of improvement in terms of efficiency or depth of understanding. Documentation should be present, covering the essential components of the solution.

Reward amounts are decided based on code quality, model accuracy scores and speed performance on a m6i.metal AWS server. When multiple solutions of comparable scope are submitted they are compared based on the accuracy metrics and computation times.

Related links and references

[1] Benet Oriol Sabat, Daniel Mas Montserrat, Xavier Giro-i-Nieto, Alexander G Ioannidis, SALAI-Net: species-agnostic local ancestry inference network, Bioinformatics, Volume 38, Issue Supplement_2, September 2022, Pages ii27–ii33,

[2] Wei Y, Zhi D, Zhang S. Fast and accurate local ancestry inference with Recomb-Mix. bioRxiv [Preprint]. 2023 Nov 19:2023.11.17.567650. doi: 10.1101/2023.11.17.567650. PMID: 38014185; PMCID: PMC10680832.

[3] Helgi Hilmarsson, Arvind S. Kumar, Richa Rastogi, Carlos D. Bustamante, Daniel Mas Montserrat, Alexander G. Ioannidis, High Resolution Ancestry Deconvolution for Next Generation Genomic Data, bioRxiv 2021.09.19.460980

Submission

1️⃣ Want to solve this bounty? Register here.
2️⃣ Ready to submit your solution? Submit here.
🗓️ Submission deadline: May 12th, 2024.

Questions?

Do you have a specific question about this bounty? Join the live conversation on the FHE.org discord server here. You can also send us an email at: [email protected]

@zaccherinij zaccherinij added 🎯 Bounty This bounty is currently open 📁 Concrete ML library targeted: Concrete ML labels Feb 9, 2024
@zaccherinij zaccherinij reopened this Feb 9, 2024
@zaccherinij zaccherinij changed the title Create an encrypted DNA classifier using Concrete ML Create an encrypted DNA ancestry using Concrete ML Feb 14, 2024
@zaccherinij
Copy link
Collaborator Author

A friendly reminder that the Submission deadline is May 12th, 2024 at 23:59 AoE (Anywhere on Earth). Good luck!

@alephzerox
Copy link

I'm not sure how to submit my solution (the link above leads to a general page) but here it is:

https:/alephzerox/ancestry-fhe

@zaccherinij
Copy link
Collaborator Author

zaccherinij commented May 12, 2024 via email

@zaccherinij
Copy link
Collaborator Author

Thank you to everyone who submitted to the Zama Bounty Program Season 5. Our team will review all submissions and give some initial feedbacks in the coming days!
Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🎯 Bounty This bounty is currently open 📁 Concrete ML library targeted: Concrete ML
Projects
Status: Awarded Contributions
Development

No branches or pull requests

3 participants