Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about generating IDRs from EvoDiff-Seq #48

Open
zhang-bo-lilly opened this issue Sep 21, 2024 · 3 comments
Open

Question about generating IDRs from EvoDiff-Seq #48

zhang-bo-lilly opened this issue Sep 21, 2024 · 3 comments

Comments

@zhang-bo-lilly
Copy link

Hello, I am having trouble executing the example in the Generating intrinsically disordered regions of the README file.

Per #41, I downloaded the dataset needed from https://zenodo.org/records/5146063, extracted the human_idr_homologues.zip, and saved it as human_protein_alignments directory, so the layout of the directory looks like this

data/
├── blosum62-special-MSA.mat
├── human_idr_alignments
│   ├── human_idr_boundaries_gap.tsv
│   ├── human_idr_boundaries.tsv
│   └── human_protein_alignments
│       ├── HUMAN00009_1to68.fasta
│       ├── HUMAN00009_633to749.fasta
│       ├── HUMAN00009_92to145.fasta
...

From the root directory of the repository, I executed and observed the following

export AMLT_OUTPUT_DIR=./test_output
python evodiff/conditional_generation_msa.py --model-type msa_oa_dm_maxsub --cond-task idr --num-seqs 1 --amlt
INDEX FILE LEN 10634
Traceback (most recent call last):
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 1065, in <module>
    main()
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 150, in main
    src, start_idx, end_idx, original_msa, num_sequences, b_src, b_start_idx, b_end_idx, oma_id = get_IDR_MSAs(index_file, data_top_dir,
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 826, in get_IDR_MSAs
    msa_data, new_start_idx, new_end_idx, num_sequences, b_start_idx, b_end_idx, oma_id = subsample_IDR_MSA(index_file, tokenizer, max_seq_len=max_seq_len, n_sequences=n_sequences,
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 893, in subsample_IDR_MSA
    query_idx = [i for i, name in enumerate(msa_names) if name == row['OMA_ID']][0]  # get query index
IndexError: list index out of range

I stepped through PDB and found these

(Pdb) p index_file.loc[index]
OMA_ID                                                HUMAN04185
UNIPROT_ID                                                Q96K76
START                                                        424
END                                                          479
IDR_SEQ        EDEKSPQTESCTDSGAENEGSCHSDQMSNDFSNDDGVDEGICLETN...
LENGTHS                                                       55
GAP START                                                    997
GAP END                                                     1141
GAP LENGTHS                                                  144
(Pdb) p row['OMA_ID']
'HUMAN04185'
(Pdb) p [file for i, file in enumerate(all_files) if 'HUMAN04185' in file]
['HUMAN04185_1to38.fasta', 'HUMAN04185_424to479.fasta', 'HUMAN04185_839to1026.fasta']
(Pdb) aa, bb=parse_fasta(data_dir + 'human_protein_alignments/HUMAN04185_1to38.fasta', return_names=True)
(Pdb) bb
['BRAFL21358 0 to 5', 'EPTBU02539 0 to 0', 'LEPOC10560 3 to 40', 'ANATE13683 3 to 20', 'SERDU25819 0 to 11', 'SCOMX25917 1 to 18', 'GASAC17394 1 to 37', 'TAKRU19760 1 to 40', 'TETNG11216 1 to 37', 'ORYLA12382 1 to 37', 'ORYME02443 0 to 14', 'NOTFU11912 3 to 20', 'CYPVA13923 3 to 20', 'POEFO06820 1 to 37', 'XIPMA06130 3 to 20', 'ORENI17527 1 to 38', 'AMPOC21119 3 to 20', 'HIPCM02252 3 to 20', 'GADMO19517 1 to 38', 'ASTMX08999 5 to 38', 'PYGNA16253 0 to 12', 'ICTPU01019 9 to 31', 'DANRE39301 3 to 20', 'LATCH10026 1 to 38', 'ORNAN18050 0 to 26', 'PROCA13584 0 to 25', 'LOXAF12537 1 to 39', 'ECHTE14028 0 to 25', 'RABIT01068 1 to 38', 'OCHPR15109 0 to 25', 'DIPOR05931 0 to 0', 'FUKDA04471 0 to 5', 'HETGA12775 0 to 25', 'CAVAP13955 0 to 17', 'CAVPO05047 0 to 25', 'CHILA04061 1 to 38', 'OCTDE12798 1 to 38', 'JACJA01745 0 to 25', 'CRIGR16916 1 to 38', 'MOUSE45885 1 to 18', 'RATNO01797 1 to 38', 'NANGA02552 1 to 38', 'CERAT32976 1 to 13', 'CHLSB00649 1 to 18', 'MACFA09490 1 to 13', 'MACMU07436 1 to 38', 'MACNE29351 1 to 13', 'MANLE36987 1 to 13', 'PAPAN05860 0 to 25', 'COLAP32362 1 to 13', 'RHIBE07503 1 to 13', 'RHIRO33601 0 to 0', 'GORGO03243 0 to 6', 'HUMAN04185 1 to 38', 'PANPA06196 0 to 0', 'PANTR02333 0 to 0', 'PONAB01347 1 to 38', 'NOMLE01511 1 to 18', 'AOTNA04675 1 to 13', 'SAIBB00262 1 to 13', 'TARSY11018 0 to 25', 'PROCO03960 1 to 13', 'OTOGA19308 0 to 25', 'TUPBE14316 0 to 0', 'CANLF08543 0 to 12', 'VULVU21503 0 to 0', 'MUSPF13712 0 to 24', 'AILME06514 0 to 39', 'URSAM01994 0 to 0', 'URSMA27578 0 to 12', 'FELCA11798 1 to 39', 'TURTR04946 0 to 21', 'BOVIN04360 0 to 38', 'SHEEP06239 0 to 39', 'PIGXX17664 1 to 38', 'VICPA03255 0 to 25', 'PTEVA15708 0 to 25', 'MYOLU05549 1 to 39', 'ERIEU12752 0 to 21', 'HORSE18107 0 to 25', 'DASNO16007 0 to 38', 'CHOHO10481 0 to 5', 'SARHA06263 1 to 18', 'MONDO10274 1 to 38', 'MACEU07613 0 to 25', 'PHACI02145 1 to 38', 'ANAPL07288 0 to 25', 'MELGA10549 0 to 25', 'CHICK11008 0 to 43', 'FICAL13955 0 to 0', 'TAEGU16862 0 to 25', 'CHRPI18449 1 to 38', 'SPHPU04621 0 to 6', 'ANOCA16740 1 to 38', 'XENTR16027 0 to 24', 'CIOSA04555 0 to 0', 'STRPU17710 1 to 56', 'STRMM09003 1 to 20', 'DAPPU07360 0 to 0', 'ORCCI04184 11 to 56', 'DROPE01541 1 to 2', 'DROPS09123 1 to 2', 'LUCCU03187 10 to 33', 'CULSO18336 1 to 4', 'ANOGA02647 1 to 4', 'AEDAE08107 1 to 4', 'CULQU04626 1 to 13', 'APIME11570 0 to 5', 'BOMIM10786 0 to 5', 'LINHU12916 0 to 5', 'OOCBI04348 0 to 5', 'CAMFO12507 0 to 5', 'ATTCE04431 0 to 5', 'SOLIN10701 0 to 0', 'HARSA07974 0 to 5', 'RHOPR10225 0 to 0', 'PEDHC04140 31 to 113', 'ZOONE05774 1 to 20', 'LINUN26257 1 to 18', 'CRAGI03987 1 to 61', 'OCTBM24223 1 to 18', 'NEMVE01956 1 to 18', 'HYDVU05760 1 to 12', 'AMPQE22746 9 to 63']
(Pdb) [i for i, name in enumerate(bb) if name == 'HUMAN04185']
[]
(Pdb) [i for i, name in enumerate(bb) if 'HUMAN04185' in name]
[53]
(Pdb) aa, bb=parse_fasta(data_dir + 'human_protein_alignments/HUMAN04185_424to479.fasta', return_names=True)
(Pdb) [i for i, name in enumerate(bb) if name == 'HUMAN04185']
[]
(Pdb) aa, bb=parse_fasta(data_dir + 'human_protein_alignments/HUMAN04185_839to1026.fasta', return_names=True)
(Pdb) [i for i, name in enumerate(bb) if name == 'HUMAN04185']
[]

It seems to me that

query_idx = [i for i, name in enumerate(msa_names) if name == row['OMA_ID']][0] # get query index

needs to be changed to row['OMA_ID'] in name. Is this correct?

@zhang-bo-lilly
Copy link
Author

Additionally,

row = index_file.loc[index]

this line should be iloc instead of loc as the labels in the human_idr_boundaries_gap.tsv file are not consecutive. This seems to be also aligned with the following commented code.

# index = random.randint(0, len(index_file) - 1)
#
# data_dir = data_top_dir + 'human_idr_alignments/'
# all_files = os.listdir(data_dir + 'human_protein_alignments')
# if not os.path.exists(data_dir + 'human_idr_boundaries_gap.tsv'):
# preprocess_IDR_data(data_top_dir)
# print("USING INDEX", index)
# row = index_file.iloc[index]
# # Get MSA

@zhang-bo-lilly
Copy link
Author

Next, assume the following line is changed to iloc

row = index_file.loc[index]

The execution will throw another error at

p = preds[:, random_x, random_y, :]

(Pdb) n
IndexError: index 662 is out of bounds for dimension 2 with size 201
> /home/c271831/evodiff/evodiff/conditional_generation_msa.py(686)generate_idr_msa()
-> p = preds[:, random_x, random_y, :]
(Pdb) p random_x
33
(Pdb) p random_y
662
(Pdb) p preds.shape
torch.Size([1, 64, 201, 31])

Appreciate help on getting the code running.

@sarahalamdari
Copy link
Collaborator

I don't think this is the correct dataset (the folder should contain alignments not single fasta files) - I have messaged the authors to get the IDR alignments uploaded to their Zenodo - in the meantime please shoot me an email ([email protected]) so I can share the correct dataset with you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants