The co-attention block injects attention-weighted vectors of another modality (linguistic, for example) into the hidden representations of the current modality (visual).įinally, there’s also LXMERT (Tan and Mohit 2019), another pre-trained transformer model that, as of Transformers version 3.1.0, is implemented as part of the library. In the case of ViLBERT, the authors also introduce a co-attention transformer layer (shown below) to define the attention mechanism between the modalities explicitly. The differences are the pre-training tasks the models are trained on and slight additions to the transformer. Īll these models use the bidirectional transformer model that is the backbone of BERT. Given the image and text, if we mask out dog, then the model should be able to use the unmasked visual information to correctly predict the masked word to be dog.
0 Comments
Leave a Reply. |