-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
173 lines (140 loc) · 9.67 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
title: "README"
author: "Aurélie Gabriel"
date: "2023-10-11"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This readme describes how the code (available in the scripts/ folder) was used to perform the analyses described in the following paper:
A AG Gabriel, J Racle, M Falquet, C Jandus, D Gfeller. 2024. Robust estimation of cancer and immune cell-type proportions from bulk tumor ATAC-Seq data. DOI: https://doi.org/10.7554/eLife.94833.3
In the following sections we will refer to a *data/* folder containing files that were too large
to be hosted on the github repository. The complete data/ folder can be retrieved on zenodo (https://zenodo.org/record/13132868, additional_data.zip).
## Preprocessing of the bulk ATAC-Seq data
The following command line was used to process the ATAC-Seq samples listed in *data/markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt*.
Note that the `git_repo_path` variable corresponds to the path to the location where this GitHub repository has been cloned.
The `genome_folder_path` variable corresponds to the folder containing the
refgenie genome assets which are PEPATAC prerequisites for ATAC-Seq data processing,
please refer to the PEPATAC documentation to download these data (http://pepatac.databio.org/en/latest/assets/).
`sra_output_folder` corresponds to the folder where SRA data will be downloaded.
```{bash preprocess_bulk_ATAC, eval = FALSE, echo = TRUE}
git_repo_path=user_defined_path # path to the folder where this github repository was cloned
path_to_data_folder=user_defined_path # path to the data folder donwloaded from zenodo
genome_folder_path=user_defined_path
sra_output_folder=user_defined_path
pepatac_output_folder=user_defined_path # folder where the processed data are saved
nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/ATAC_processing.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config -profile singularity \
--samples_metadata ${path_to_data_folder}/markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt \
--sra_folder ${sra_output_folder} \
--output_folder ${pepatac_output_folder} \
--genome_version hg38 \
--genome_folder ${genome_folder_path} \
--multiqc_config ${git_repo_path}scripts/bulk_preprocessing_scripts/multiqc_config.yaml \
--blacklist_file ${git_repo_path}scripts/bulk_preprocessing_scripts/hg38-blacklist.v2.bed \
--original_config ${git_repo_path}scripts/bulk_preprocessing_scripts/pepatac_original.yaml \
--samples_origin SRA --adapters ${git_repo_path}scripts/bulk_preprocessing_scripts/all_adapters_PE.fa \
--preprocess true --getcounts false -bg -resume
```
The same pipeline was used to process the ENCODE samples listed in *data/markers_identification_input_files/samples/ENCODE/SRA_samples_metadata.txt*
## Definition of the set of consensus peaks and counts extraction
The same pipeline can then be used to identify a set of consensus peaks across studies and cell-types and extract the raw counts for each peak and
sample with the following options `--preprocess false` and `--getcounts true`.
```{bash extract_counts, eval = FALSE, echo = TRUE}
raw_counts_output_folder=user_defined_path # folder where the counts are saved
nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/ATAC_processing.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config -profile singularity \
--samples_metadata ${path_to_data_folder}markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt \
--output_folder ${pepatac_output_folder} \
--genome_version hg38 --peak_score_thr 2 \
--genome_folder ${genome_folder_path} \
--output_folder_counts ${raw_counts_output_folder} \
--pepatac_config ${git_repo_path}scripts/bulk_preprocessing_scripts/pepatac_config.yaml \
--preprocess false --getcounts true -bg -resume
```
Raw counts for each peak and sample and the metadata associated to each sample are provided as output in *raw_counts_output_folder* (raw_counts.txt and metadata.txt).
These two files are used later in "Computing reference profiles and identifying marker peaks" and are also provided in the *data/markers_identification_input_files/* folder.
We extracted the raw counts from the ENCODE data for the peaks (listed in the .saf file) identified in the reference samples using the following command line.
The resulting counts and the encode metadata are located in *data/markers_identification_input_files/encode_counts.txt* and *data/markers_identification_input_files/encode_metadata.txt*.
```{bash extract_counts_ENCODE, eval = FALSE, echo = TRUE}
nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/extract_peaks_counts.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config \
-profile singularity \
--bam_folder bam_files_ENCODE/ \
--output_folder_counts ${raw_counts_output_folder}ENCODE/ \
--saf_file ${raw_counts_output_folder}all_consensusPeaks.saf -bg -resume
# bam_files_ENCODE/ contains all bam files obtained from the preprocessing of the ENCODE ATAC-Seq data
```
## Computing reference profiles and identifying marker peaks
Markers are identified in 10 subsets of the reference samples (list of samples in each subset located in *data/markers_identification_input_files/samples/Ref/*)
and a consensus of these markers are retrieved.
The following command line will generate reference profiles and identify cell-type specific marker peaks for EPIC-ATAC.
It will also use CIBERSORTx and DeconPeaker to build reference profiles. For CIBERSORTx a singularity image is required and can be retrieved on CIBERSORTx website by requesting a token access
on the CIBERSORTx website. The singularity image, the token and the associated username must be provided in the workflow parameters (`CIBERSORTx_singularity`, `cibersortx_token`, `CIBERSORTx_username`).
```{bash build_profiles, eval = FALSE, echo = TRUE}
git_repo_path=user_defined_path # path to the folder where this github repository was cloned
path_to_data_folder=user_defined_path # path to the data folder donwloaded from zenodo
my_cibersortx_token=user_defined_path # path to CIBERSORTx token (obtained from CIBERSORTx website)
cibersortx_container=user_defined_path # path to CIBERSORTx container image (obtained from CIBERSORTx website)
cibersortx_username=user_defined # CIBERSORTx username associated to the token file
# Without considering Tcell subtypes
ref_output_path=user_defined_path # path where the output files are saved
nextflow run ${git_repo_path}scripts/reference_profiles_scripts/build_references.nf \
-c ${git_repo_path}scripts/reference_profiles_scripts/nextflow.config -profile singularity \
--counts_path ${path_to_data_folder}markers_identification_input_files/raw_counts.txt \
--metadata_path ${path_to_data_folder}markers_identification_input_files/metadata.txt \
--output_folder ${ref_output_path} \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--cross_validation_files ${path_to_data_folder}markers_identification_input_files/samples/Ref/subsample \
--encode_count_file ${path_to_data_folder}markers_identification_input_files/encode_counts.txt \
--encode_metadata_file ${path_to_data_folder}markers_identification_input_files/encode_metadata.txt \
--TCGA_path ${path_to_data_folder}markers_identification_input_files/TCGA_data.rds \
--HA_path ${path_to_data_folder}markers_identification_input_files/Human_Atlas_peaks.txt \
-bg -resume
# Considering Tcell subtypes
ref_output_path=user_defined_path # path where the output files are saved
nextflow run ${git_repo_path}scripts/reference_profiles_scripts/build_references.nf \
-c ${git_repo_path}scripts/reference_profiles_scripts/nextflow.config -profile singularity \
--counts_path ${path_to_data_folder}markers_identification_input_files/raw_counts.txt \
--metadata_path ${path_to_data_folder}markers_identification_input_files/metadata.txt \
--output_folder ${ref_output_path} \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--cross_validation_files ${path_to_data_folder}markers_identification_input_files/samples/Ref/subsample \
--encode_count_file ${path_to_data_folder}markers_identification_input_files/encode_counts.txt \
--encode_metadata_file ${path_to_data_folder}markers_identification_input_files/encode_metadata.txt \
--TCGA_path ${path_to_data_folder}markers_identification_input_files/TCGA_data.rds \
--HA_path ${path_to_data_folder}markers_identification_input_files/Human_Atlas_peaks.txt \
--major_groups FALSE \
-bg -resume
```
## Evaluating EPIC-ATAC and reproducing the manuscript figures
```{bash main_analyses, eval = FALSE, echo = TRUE}
path_to_data_folder=user_defined_path
my_cibersortx_token=user_defined_path
cibersortx_container=user_defined_path
cibersortx_username=user_defined_path
output_path=user_defined_path
# Without considering the T cell subtypes
nextflow run ${git_repo_path}scripts/main_analyses/main_analyses.nf \
-c ${git_repo_path}scripts/main_analyses/nextflow.config \
-profile singularity \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--data_path ${path_to_data_folder} --withSubtypes FALSE \
--output_path ${output_path} -bg -resume
# Considering the T cell subtypes
nextflow run ${git_repo_path}scripts/main_analyses/main_analyses.nf \
-c ${git_repo_path}scripts/main_analyses/nextflow.config \
-profile singularity \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--data_path ${path_to_data_folder} --withSubtypes TRUE \
--output_path ${output_path} -bg -resume
```