Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when passing encoder_outputs as tuple to EncoderDecoder models #15536

Closed
jsnfly opened this issue Feb 6, 2022 · 7 comments · Fixed by #16814
Closed

Error when passing encoder_outputs as tuple to EncoderDecoder models #15536

jsnfly opened this issue Feb 6, 2022 · 7 comments · Fixed by #16814

Comments

@jsnfly
Copy link
Contributor

jsnfly commented Feb 6, 2022

Environment info

  • transformers version: 4.17.0.dev0
  • Platform: Linux-5.13.0-27-generic-x86_64-with-glibc2.34
  • Python version: 3.9.7
  • PyTorch version (GPU?): 1.10.1+cu102 (True)
  • Tensorflow version (GPU?): 2.7.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.3.6 (cpu)
  • Jax version: 0.2.26
  • JaxLib version: 0.1.75
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help

@patrickvonplaten

Information

In EncoderDecoder models one can pass encoder_outputs as a tuple of Tensors . However, if you do that this line will fail with

AttributeError: 'tuple' object has no attribute 'last_hidden_state'

since the tuple isn't modified in the forward method.
So if it is a tuple, encoder_outputs could maybe wrapped in a ModelOutput class or something similar. Or handle the tuple somehow explicitly.

On a slight tangent

I made a SpeechEncoderDecoderModel for the robust speech challenge: https://huggingface.co/jsnfly/wav2vec2-large-xlsr-53-german-gpt2. I found that adding the position embeddings of the decoder model to the outputs of the encoder model improved performance significantly (basically didn't work without it).
This needs small modifications to the __init__ and forward methods of the SpeechEncoderDecoderModel.

At the moment this seems to me too much of a "hack" to add it to the SpeechEncoderDecoderModel class generally (for example via a flag), because it may differ for different decoder models and probably also needs more verification. @patrickvonplaten showed some interest that this could be included in Transformers nonetheless. What do you think?

@huggingface huggingface deleted a comment from github-actions bot Mar 9, 2022
@huggingface huggingface deleted a comment from github-actions bot Apr 6, 2022
@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Apr 6, 2022

Hey @jsnfly,

Regarding the first point - agree, it'd be good to check if the input is a tuple and if it is we can wrap it into a ModelOutput object. Would you be interested in opening a PR for this? :-)

Regarding the 2nd point - that's very interesting (cc @sanchit-gandhi). Also makes a lot of sense since ASR by itself is monotonic so knowing the order of words to transcribe together with the encoder speech frames seems like a sensible design architecture. Thanks a lot for sharing this here!

@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented Apr 6, 2022

The embedding hack is a really neat find - nice one @jsnfly! It's something we're going to take a look into in our ASR experiments! It seems like it could help with alignment in a much cleaner and more compact way than the encoder-decoder cross-attention mechanism.

@jsnfly
Copy link
Contributor Author

jsnfly commented Apr 18, 2022

Regarding the first point - agree, it'd be good to check if the input is a tuple and if it is we can wrap it into a ModelOutput object. Would you be interested in opening a PR for this? :-)

I have opened one - feel free to take a look.

Regarding the 2nd point - that's very interesting (cc @sanchit-gandhi). Also makes a lot of sense since ASR by itself is monotonic so knowing the order of words to transcribe together with the encoder speech frames seems like a sensible design architecture. Thanks a lot for sharing this here!

The embedding hack is a really neat find - nice one @jsnfly! It's something we're going to take a look into in our ASR experiments! It seems like it could help with alignment in a much cleaner and more compact way than the encoder-decoder cross-attention mechanism.

Thanks for your feedback :) I will also try to experiment with this a bit more and let you know if I get some more results.

@espoirMur
Copy link

espoirMur commented Jun 25, 2022

@jsnfly , thank you for this PR... Is it possible to do this fix for a T5 model as well.. It is also a sequence to sequence model and sometime we may want to pass a tuple to the decoder.

If you guys don't see any issue I can do that.

For context, I am playing with the Fusion In Decoder, which is a version of the T5 model.

The encoder, a tuple which is the hidden state of all encoder blocks concatenated as one vector, but the code is failing because it is expecting a tuple.

I am going to apply this fix to the T5 model locally and see how it behaves..

@patrickvonplaten, let me know what you think ..

@ZIZUN
Copy link

ZIZUN commented Jul 31, 2022

@espoirMur
FID's requirement is transformers 3.0.2
so, this version's model output is formed as tuple.
you can fix this issue if add input variable 'return_dict=False' at model input. (on transformers latest version)
as follow
train_loss = model( input_ids=context_ids.cuda(), attention_mask=context_mask.cuda(), labels=labels.cuda(), return_dict=False )[0]

@Baxkiller
Copy link

@espoirMur FID's requirement is transformers 3.0.2 so, this version's model output is formed as tuple. you can fix this issue if add input variable 'return_dict=False' at model input. (on transformers latest version) as follow train_loss = model( input_ids=context_ids.cuda(), attention_mask=context_mask.cuda(), labels=labels.cuda(), return_dict=False )[0]

Thanks for your response and it helps a lot~

@Tan-Hexiang
Copy link

@espoirMur FID's requirement is transformers 3.0.2 so, this version's model output is formed as tuple. you can fix this issue if add input variable 'return_dict=False' at model input. (on transformers latest version) as follow train_loss = model( input_ids=context_ids.cuda(), attention_mask=context_mask.cuda(), labels=labels.cuda(), return_dict=False )[0]

Thanks! helps me a lot!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants