README.Rmd

---
title: "README"
author: "Aurélie Gabriel"
date: "2023-10-11"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This readme describes how the code (available in the scripts/ folder) was used to perform the analyses described in the following paper:
A AG Gabriel, J Racle, M Falquet, C Jandus, D Gfeller. 2024. Robust estimation of cancer and immune cell-type proportions from bulk tumor ATAC-Seq data. DOI: https://doi.org/10.7554/eLife.94833.3

In the following sections we will refer to a *data/* folder containing files that were too large 
to be hosted on the github repository. The complete data/ folder can be retrieved on zenodo (https://zenodo.org/record/13132868, additional_data.zip). 

## Preprocessing of the bulk ATAC-Seq data

The following command line was used to process the ATAC-Seq samples listed in *data/markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt*. 
Note that the `git_repo_path` variable corresponds to the path to the location where this GitHub repository has been cloned. 
The `genome_folder_path` variable corresponds to the folder containing the 
refgenie genome assets which are PEPATAC prerequisites for ATAC-Seq data processing, 
please refer to the PEPATAC documentation to download these data (http://pepatac.databio.org/en/latest/assets/).
`sra_output_folder` corresponds to the folder where SRA data will be downloaded.

```{bash preprocess_bulk_ATAC, eval = FALSE, echo = TRUE}
git_repo_path=user_defined_path # path to the folder where this github repository was cloned
path_to_data_folder=user_defined_path # path to the data folder donwloaded from zenodo
genome_folder_path=user_defined_path 
sra_output_folder=user_defined_path 
pepatac_output_folder=user_defined_path # folder where the processed data are saved 

nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/ATAC_processing.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config -profile singularity \
--samples_metadata ${path_to_data_folder}/markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt \
--sra_folder ${sra_output_folder} \   
--output_folder ${pepatac_output_folder} \
--genome_version hg38 \
--genome_folder ${genome_folder_path} \
--multiqc_config ${git_repo_path}scripts/bulk_preprocessing_scripts/multiqc_config.yaml \
--blacklist_file ${git_repo_path}scripts/bulk_preprocessing_scripts/hg38-blacklist.v2.bed \
--original_config ${git_repo_path}scripts/bulk_preprocessing_scripts/pepatac_original.yaml \
--samples_origin SRA --adapters ${git_repo_path}scripts/bulk_preprocessing_scripts/all_adapters_PE.fa \
--preprocess true --getcounts false -bg -resume

```

The same pipeline was used to process the ENCODE samples listed in *data/markers_identification_input_files/samples/ENCODE/SRA_samples_metadata.txt*

## Definition of the set of consensus peaks and counts extraction

The same pipeline can then be used to identify a set of consensus peaks across studies and cell-types and extract the raw counts for each peak and 
sample with the following options `--preprocess false` and `--getcounts true`.

```{bash extract_counts, eval = FALSE, echo = TRUE}
raw_counts_output_folder=user_defined_path # folder where the counts are saved

nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/ATAC_processing.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config -profile singularity \
--samples_metadata ${path_to_data_folder}markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt \
--output_folder ${pepatac_output_folder} \
--genome_version hg38 --peak_score_thr 2 \
--genome_folder ${genome_folder_path} \
--output_folder_counts ${raw_counts_output_folder} \
--pepatac_config ${git_repo_path}scripts/bulk_preprocessing_scripts/pepatac_config.yaml \
--preprocess false --getcounts true -bg -resume

```
Raw counts for each peak and sample and the metadata associated to each sample are provided as output in *raw_counts_output_folder* (raw_counts.txt and metadata.txt).
These two files are used later in "Computing reference profiles and identifying marker peaks" and are also provided in the *data/markers_identification_input_files/* folder.

We extracted the raw counts from the ENCODE data for the peaks (listed in the .saf file) identified in the reference samples using the following command line.
The resulting counts and the encode metadata are located in *data/markers_identification_input_files/encode_counts.txt* and *data/markers_identification_input_files/encode_metadata.txt*. 

```{bash extract_counts_ENCODE, eval = FALSE, echo = TRUE}
nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/extract_peaks_counts.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config \
-profile singularity \
--bam_folder bam_files_ENCODE/ \
--output_folder_counts ${raw_counts_output_folder}ENCODE/ \
--saf_file ${raw_counts_output_folder}all_consensusPeaks.saf -bg -resume

# bam_files_ENCODE/ contains all bam files obtained from the preprocessing of the ENCODE ATAC-Seq data
```


## Computing reference profiles and identifying marker peaks
Markers are identified in 10 subsets of the reference samples (list of samples in each subset located in *data/markers_identification_input_files/samples/Ref/*)
and a consensus of these markers are retrieved. 
The following command line will generate reference profiles and identify cell-type specific marker peaks for EPIC-ATAC.
It will also use CIBERSORTx and DeconPeaker to build reference profiles. For CIBERSORTx a singularity image is required and can be retrieved on CIBERSORTx website by requesting a token access
on the CIBERSORTx website. The singularity image, the token and the associated username must be provided in the workflow parameters (`CIBERSORTx_singularity`, `cibersortx_token`, `CIBERSORTx_username`). 

```{bash build_profiles, eval = FALSE, echo = TRUE}

git_repo_path=user_defined_path # path to the folder where this github repository was cloned 
path_to_data_folder=user_defined_path # path to the data folder donwloaded from zenodo
my_cibersortx_token=user_defined_path # path to CIBERSORTx token (obtained from CIBERSORTx website)
cibersortx_container=user_defined_path # path to CIBERSORTx container image (obtained from CIBERSORTx website)
cibersortx_username=user_defined # CIBERSORTx username associated to the token file

# Without considering Tcell subtypes  
ref_output_path=user_defined_path # path where the output files are saved

nextflow run ${git_repo_path}scripts/reference_profiles_scripts/build_references.nf \
-c ${git_repo_path}scripts/reference_profiles_scripts/nextflow.config -profile singularity \
--counts_path ${path_to_data_folder}markers_identification_input_files/raw_counts.txt \
--metadata_path ${path_to_data_folder}markers_identification_input_files/metadata.txt \
--output_folder ${ref_output_path} \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--cross_validation_files ${path_to_data_folder}markers_identification_input_files/samples/Ref/subsample \
--encode_count_file ${path_to_data_folder}markers_identification_input_files/encode_counts.txt \
--encode_metadata_file ${path_to_data_folder}markers_identification_input_files/encode_metadata.txt \
--TCGA_path ${path_to_data_folder}markers_identification_input_files/TCGA_data.rds \
--HA_path ${path_to_data_folder}markers_identification_input_files/Human_Atlas_peaks.txt \
-bg -resume


# Considering Tcell subtypes
ref_output_path=user_defined_path # path where the output files are saved

nextflow run ${git_repo_path}scripts/reference_profiles_scripts/build_references.nf \
-c ${git_repo_path}scripts/reference_profiles_scripts/nextflow.config -profile singularity \
--counts_path ${path_to_data_folder}markers_identification_input_files/raw_counts.txt \
--metadata_path ${path_to_data_folder}markers_identification_input_files/metadata.txt \
--output_folder ${ref_output_path} \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--cross_validation_files ${path_to_data_folder}markers_identification_input_files/samples/Ref/subsample \
--encode_count_file ${path_to_data_folder}markers_identification_input_files/encode_counts.txt \
--encode_metadata_file ${path_to_data_folder}markers_identification_input_files/encode_metadata.txt \
--TCGA_path ${path_to_data_folder}markers_identification_input_files/TCGA_data.rds \
--HA_path ${path_to_data_folder}markers_identification_input_files/Human_Atlas_peaks.txt \
--major_groups FALSE \
-bg -resume

```

## Evaluating EPIC-ATAC and reproducing the manuscript figures

```{bash main_analyses, eval = FALSE, echo = TRUE}

path_to_data_folder=user_defined_path
my_cibersortx_token=user_defined_path 
cibersortx_container=user_defined_path 
cibersortx_username=user_defined_path 
output_path=user_defined_path 

# Without considering the T cell subtypes
nextflow run ${git_repo_path}scripts/main_analyses/main_analyses.nf \
-c ${git_repo_path}scripts/main_analyses/nextflow.config \
-profile singularity \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--data_path ${path_to_data_folder} --withSubtypes FALSE \
--output_path ${output_path} -bg -resume

# Considering the T cell subtypes
nextflow run ${git_repo_path}scripts/main_analyses/main_analyses.nf \
-c ${git_repo_path}scripts/main_analyses/nextflow.config \
-profile singularity \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--data_path ${path_to_data_folder} --withSubtypes TRUE \
--output_path ${output_path} -bg -resume

```