VitNet

class eyefeatures.deep.models.VitNet(CNN, RNN, fusion_mode='concat', activation=None, embed_dim=32)[source]

Bases: Module

Parent class for a vision-and-text network that fuses CNN and RNN-based representations using concatenation or addition.

Parameters:
  • CNN – (nn.Module) CNN backbone for processing image data.

  • RNN – (nn.Module) RNN backbone for processing sequence data.

  • fusion_mode – (str, optional) Fusion mode (‘concat’ or ‘add’). Default is ‘concat’.

  • activation – (nn.Module, optional) Activation function applied after fusion. Default is None.

  • embed_dim – (int, optional) Embedding dimension for the projected features. Default is 128.