transformer_implementation.blocks.layers package

Submodules

transformer_implementation.blocks.layers.FeedForward module

class transformer_implementation.blocks.layers.FeedForward.FeedForward(config)

Bases: Module

A position-wise Feed Forward Neural Network (FFNN) class for transformer models.

The class implementing a position-wise FFNN. The FFNN consists of two linear transformations with a GELU activation in between, followed by a dropout for regularization.

Attributes

c_fctorch.nn.Linear

The first fully connected layer of the feed-forward network. It takes as input a tensor with n_embd features and returns a tensor with 4 * n_embd features.

gelutorch.nn.GELU

The Gaussian Error Linear Unit activation function.

c_projtorch.nn.Linear

The second fully connected layer of the feed-forward network. It takes as input a tensor with 4 * n_embd features and returns a tensor with n_embd features.

dropouttorch.nn.Dropout

The dropout layer for regularization. The dropout rate is specified in the configuration.

Methods

forward(x: torch.Tensor) -> torch.Tensor:

Computes the forward pass of the network.

Parameters

configobject
A configuration object with the following attribute:

n_embd (int): The size of the input and output feature vectors. bias (bool): If True, the linear layers will include a bias term. dropout (float): The dropout rate to use for regularization.

forward(x) Tensor

Implements the forward pass of the feed-forward network.

Parameters

xtorch.Tensor

The input tensor with a size of n_embd.

Returns

torch.Tensor

The output tensor, post-processed by the feed-forward network.

transformer_implementation.blocks.layers.LayerNorm module

class transformer_implementation.blocks.layers.LayerNorm.LayerNorm(ndim: int, bias: bool)

Bases: Module

A Layer Normalization module with optional bias.

This implementation of Layer Normalization allows turning off the bias term, which is not directly supported by PyTorch’s layer normalization function.

Attributes

weighttorch.nn.Parameter

A learnable scale factor initialized to one. This has the same shape as the input feature dimension.

biastorch.nn.Parameter

A learnable bias term initialized to zero if bias is True, else None. This has the same shape as the input feature dimension.

Methods

forward(input: torch.Tensor) -> torch.Tensor:

Applies layer normalization to the input tensor.

Parameters

ndimint

The feature dimension size of the input tensor.

biasbool

If True, adds a learnable bias to the output.

forward(input)

Implements the forward pass of the LayerNorm module.

Parameters

inputtorch.Tensor

The input tensor that will be normalized.

Returns

torch.Tensor

The normalized output tensor.

transformer_implementation.blocks.layers.MultiHeadAttention module

class transformer_implementation.blocks.layers.MultiHeadAttention.MultiHeadAttention(config)

Bases: Module

Implements a multi-head attention module in PyTorch.

This class is a child of the PyTorch nn.Module class. It uses scaled dot product attention mechanism and includes dropout for regularization.

Attributes

n_headint

The number of attention heads.

n_embdint

The size of the input and output feature vectors.

dropoutfloat

The dropout rate to use for regularization.

block_sizeint

The size of the block to use for the attention mask.

q_attntorch.nn.Linear

The query projection layer.

k_attntorch.nn.Linear

The key projection layer.

v_attntorch.nn.Linear

The value projection layer.

c_projtorch.nn.Linear

The output projection layer.

attn_dropouttorch.nn.Dropout

The dropout layer for the attention mechanism.

resid_dropouttorch.nn.Dropout

The dropout layer for the output.

biastorch.Tensor

The attention mask to ensure causal attention.

Methods

scaled_dot_product_attention(q, k, v, mask: bool = None):

Computes the scaled dot product attention.

forward(q_x, k_x, v_x, mask = None, is_masked = False):

Computes the forward pass of the multi-head attention.

Parameters

configobject
A configuration object with the following attributes:

n_head (int): The number of attention heads. n_embd (int): The size of the input and output feature vectors. bias (bool): If True, the linear layers will include a bias term. dropout (float): The dropout rate to use for regularization. block_size (int): The size of the block to use for the attention mask.

forward(q_x, k_x, v_x, mask=None, is_masked=False)

Implements the forward pass of the multi-head attention.

Parameters

q_xtorch.Tensor

The input query tensor.

k_xtorch.Tensor

The input key tensor.

v_xtorch.Tensor

The input value tensor.

maskbool, optional

The attention mask. If None, no mask is applied. Default is None.

is_maskedbool, optional

Define if this MHA is a Masked MHA. Do we have to add or not a triangular mask ?

Returns

tuple

The output tensor and the attention weights.

scaled_dot_product_attention(q, k, v, mask: bool = None)

Computes the scaled dot product attention.

Parameters

qtorch.Tensor

The query tensor.

ktorch.Tensor

The key tensor.

vtorch.Tensor

The value tensor.

maskbool, optional

The attention mask. If None, no mask is applied. Default is None.

Returns

tuple

The output tensor and the attention weights.

Module contents