transformer_implementation package

Subpackages

Submodules

transformer_implementation.DataLoaderFactory module

class transformer_implementation.DataLoaderFactory.DataLoaderFactory(block_size, batch_size, tokenizer, device, train_dataset_size=5000000)

Bases: object

Factory class to create dataloaders for training, validation, and testing datasets.

It initializes the datasets and dataloaders for the given block size, batch size, tokenizer, and device. The dataloaders can be accessed directly through the corresponding attributes.

Attributes

train_dataTranslationDataset

The training dataset.

val_dataTranslationDataset

The validation dataset.

test_dataTranslationDataset

The testing dataset.

dataloader_traintorch.utils.data.DataLoader

Dataloader for the training dataset.

dataloader_valtorch.utils.data.DataLoader

Dataloader for the validation dataset.

dataloader_testtorch.utils.data.DataLoader

Dataloader for the testing dataset.

Methods

__len__() -> int:

Prints and returns the number of items in each dataset and total.

get_batch(split: str) -> dict:

Returns a generator that iterates over the batches in the specified split.

get_batch(split)

Returns a generator that iterates over the batches in the specified split.

Parameters

splitstr

The split to use. Must be one of ‘train’, ‘val’, or ‘test’.

Yields

dict

The next batch in the specified split. Each batch is a dictionary that contains tensors moved to the specified device and the ‘translation’ field.

class transformer_implementation.DataLoaderFactory.TranslationDataset(dataset, tokenizer, block_size)

Bases: Dataset

Dataset for English to French translation tasks.

The dataset includes ‘translation’ field which is a dict that contains source text in English (‘en’) and target text in French (‘fr’).

Attributes

datasetobject

The dataset object containing translations.

tokenizerobject

The tokenizer object used for encoding the translations.

block_sizeint

The maximum length of the tokenized sequences.

Methods

__getitem__(index: int) -> dict:

Returns the tokenized input and target sequences, their corresponding masks, and the original translation for a given index.

__len__() -> int:

Returns the number of items in the dataset.

transformer_implementation.Decoder module

class transformer_implementation.Decoder.Decoder(config)

Bases: Module

Implements a decoder module in PyTorch.

This class is a child of the PyTorch nn.Module class. The decoder uses embeddings for both the vocabulary and the positions. The core of the decoder is a sequence of DecoderBlock modules. The output of the decoder is then processed by a linear layer to produce the final output.

Attributes

configobject

A configuration object with the necessary attributes for the decoder.

decodertorch.nn.ModuleDict

A dictionary containing the decoder components, including the embeddings, dropout, DecoderBlock sequence, and LayerNorm.

lm_headtorch.nn.Linear

A linear layer for producing the final output of the decoder.

Methods

get_num_params(non_embedding: bool = True) -> int:

Returns the number of parameters in the decoder.

_init_weights(module):

Initializes the weights of the specified module.

forward(idx, enc_output=None, src_mask=None, tgt_mask=None):

Computes the forward pass of the decoder.

Parameters

configobject

A configuration object with necessary attributes, including vocab_size, n_embd, block_size, dropout, n_layer, and bias.

forward(idx, enc_output=None, src_mask=None, tgt_mask=None)

Computes the forward pass of the decoder.

Parameters

idxtorch.Tensor

The input tensor with token indices.

enc_outputtorch.Tensor, optional

The output of the encoder. Default is None.

src_masktorch.Tensor, optional

The mask for the source sequence. Default is None.

tgt_masktorch.Tensor, optional

The mask for the target sequence. Default is None.

Returns

Tuple[torch.Tensor, List[torch.Tensor], List[torch.Tensor]]

A tuple containing the output tensor, a list of attention scores from the decoder blocks, and a list of cross-attention scores.

get_num_params(non_embedding: bool = True) int

Returns the number of parameters in the model. For non-embedding count (default), the position embeddings get subtracted. The token embeddings would too, except due to the parameter sharing these params are actually used as weights in the final layer, so we include them.

Parameters

non_embeddingbool, optional

If True, does not count the parameters of the position embedding layer. Default is True.

Returns

int

The number of parameters in the decoder.

transformer_implementation.Encoder module

class transformer_implementation.Encoder.Encoder(config)

Bases: Module

The Encoder class implements a multi-layer transformer encoder.

This class inherits from the PyTorch nn.Module class and includes token and position embeddings, dropout, multiple encoder blocks, and layer normalization.

Attributes

configobject
A configuration object with the following attributes:

vocab_size (int): The size of the vocabulary. block_size (int): The maximum sequence length. n_embd (int): The dimension of the embeddings. dropout (float): The dropout rate. n_layer (int): The number of transformer layers. bias (bool): If True, the linear layers will include a bias term.

encodertorch.nn.ModuleDict
A dictionary-like module of several layers:

wte (torch.nn.Embedding): The token embeddings layer. wpe (torch.nn.Embedding): The position embeddings layer. drop (torch.nn.Dropout): The dropout layer. h (torch.nn.ModuleList): The list of transformer layers. ln_f (LayerNorm): The final layer normalization.

Methods

get_num_params(non_embedding: bool = True) -> int:

Returns the total number of parameters.

_init_weights(module):

Initializes the weights of the specified module.

forward(idx, mask=None):

Computes the forward pass of the encoder.

Parameters

configobject
A configuration object with the following attributes:

vocab_size (int): The size of the vocabulary. block_size (int): The maximum sequence length. n_embd (int): The dimension of the embeddings. dropout (float): The dropout rate. n_layer (int): The number of transformer layers. bias (bool): If True, the linear layers will include a bias term.

forward(idx, mask=None)

Implements the forward pass of the encoder.

Parameters

idxtorch.Tensor

The input tensor with indices of tokens in the sequence.

masktorch.Tensor, optional

The mask tensor. If provided, it should have the same size as idx.

Returns

Tuple[torch.Tensor, List[torch.Tensor]]

The output tensor after layer normalization and the list of attention matrices from each transformer layer.

get_num_params(non_embedding: bool = True)

Returns the number of parameters in the model. For non-embedding count (default), the position embeddings get subtracted. The token embeddings would too, except due to the parameter sharing these params are actually used as weights in the final layer, so we include them.

Parameters

non_embeddingbool, optional

If True, subtracts the number of embedding parameters from the total.

Returns

int

The total number of parameters.

transformer_implementation.Tokenizer module

class transformer_implementation.Tokenizer.Tokenizer

Bases: object

Implements a Tokenizer based on the tiktoken library for encoding and decoding sequences.

The tokenizer has special tokens for the beginning of sentence (BOS), end of sentence (EOS), and padding (PAD). The sequence can be padded to a fixed length for processing in batch.

Attributes

BOS_IDXint

The index of the Beginning of Sentence (BOS) token.

EOS_IDXint

The index of the End of Sentence (EOS) token.

PAD_IDXint

The index of the Padding (PAD) token.

encodertiktoken.Encoding

The encoding object used for converting sequences to and from tokens.

Methods

vocab_size() -> int:

Returns the size of the vocabulary.

sequence_padding(sequence, max_size: int, device: str) -> torch.Tensor:

Returns the padded sequence as a tensor.

sequence_cleaner(sequence) -> list:

Returns the cleaned sequence without any special tokens.

generate_padding_mask(seq, triu: bool, device: str) -> torch.Tensor:

Returns a mask for the padding tokens in the sequence.

tokenize(sequence, device: str) -> list:

Returns the tokenized sequence.

tokenize_from_str(sequence, device: str) -> list:

Returns the tokenized sequence for a given string.

generate_padding_mask(seq, triu=False, device='cpu')

Generates a mask for the padding tokens in the sequence.

Parameters

seqtorch.Tensor

The sequence for which the mask will be generated.

triubool, optional

If True, the mask will be a upper triangular matrix. Defaults to False.

devicestr, optional

The device where the tensor will be allocated. Defaults to “cpu”.

Returns

torch.Tensor

The mask for the sequence.

sequence_cleaner(sequence)

Removes the special tokens from the sequence.

Parameters

sequenceUnion[torch.Tensor, list]

The sequence to be cleaned.

Returns

list

The cleaned sequence.

sequence_padding(sequence, max_size: int = 512, device: str = 'cpu') Tensor

Pads the sequence to the max_size with the PAD token.

Parameters

sequenceUnion[torch.Tensor, list]

The sequence to be padded.

max_sizeint, optional

The maximum size of the sequence after padding. Defaults to 512.

devicestr, optional

The device where the tensor will be allocated. Defaults to “cpu”.

Returns

torch.Tensor

The padded sequence.

tokenize(sequence, device='cpu') list

Tokenizes the sequence using the encoder.

Parameters

sequenceUnion[torch.Tensor, list]

The sequence to be tokenized.

devicestr, optional

The device where the tensor will be allocated. Defaults to “cpu”.

Returns

list

The tokenized sequence.

tokenize_from_str(sequence, device='cpu') list

Tokenizes the string sequence using the encoder.

Parameters

sequencestr

The string sequence to be tokenized.

devicestr, optional

The device where the tensor will be allocated. Defaults to “cpu”.

Returns

list

The tokenized sequence.

vocab_size() int

Returns the size of the vocabulary.

Returns

int

The size of the vocabulary.

transformer_implementation.Transformer module

class transformer_implementation.Transformer.Transformer(config)

Bases: Module

A PyTorch implementation of a Transformer model.

The Transformer model consists of an Encoder and a Decoder. It supports functionalities like forward pass, token generation, optimizer configuration, and save/load model state.

Attributes

configobject

A configuration object with necessary attributes for Transformer model.

encoderEncoder

The encoder part of the Transformer model.

decoderDecoder

The decoder part of the Transformer model.

Methods

forward(src, tgt, src_mask=None, tgt_mask=None):

Implements the forward pass of the Transformer model and returns the output and loss.

generate(src, idx, src_mask=None, max_new_tokens=128, temperature=1.0, top_k=None):

Generates new tokens given a source tensor.

configure_optimizers(weight_decay, learning_rate, betas, device_type, eps):

Configures the AdamW optimizer for the Transformer model.

save_model(path: str):

Saves the model state to the given file path.

load_model(path: str):

Loads the model state from the given file path.

Parameters

configobject
A configuration object with necessary parameters for Transformer model. It includes:

vocab_size (int): The size of vocabulary. block_size (int): The size of a block for Transformer. PAD_IDX (int): The index representing padding in token sequence.

configure_optimizers(weight_decay, learning_rate, betas, device_type, eps)

Configures the AdamW optimizer for the Transformer model.

Parameters

weight_decayfloat

The L2 penalty (regularization) coefficient.

learning_ratefloat

The learning rate for AdamW optimizer.

betastuple(float, float)

Coefficients used for computing running averages of gradient and its square.

device_typestr

The device type for the optimizer, either “cpu” or “cuda”.

epsfloat

A term added to the denominator to improve numerical stability.

Returns

torch.optim.AdamW

The AdamW optimizer configured for the Transformer model.

forward(src, tgt, src_mask=None, tgt_mask=None)

Implements the forward pass of the Transformer model.

Parameters

srctorch.Tensor

The input tensor for the source sequence.

tgttorch.Tensor

The input tensor for the target sequence.

src_masktorch.Tensor, optional

The input tensor for source sequence masking.

tgt_masktorch.Tensor, optional

The input tensor for target sequence masking.

Returns

torch.Tensor, torch.Tensor

The output tensor post-processed by the Transformer model and the calculated loss.

generate(src, idx, src_mask=None, max_new_tokens=128, temperature=1.0, top_k=None)

Generates new tokens given a source tensor.

Parameters

srctorch.Tensor

The input tensor for the source sequence.

idxtorch.Tensor

The input tensor with indices in the current context.

src_masktorch.Tensor, optional

The input tensor for source sequence masking.

max_new_tokensint, optional

The maximum number of new tokens to be generated.

temperaturefloat, optional

The softmax temperature for controlling the randomness of predictions.

top_kint, optional

The number of highest probability vocabulary tokens to keep for next step prediction.

Returns

torch.Tensor, dict

The tensor with new generated token indices and a dictionary with attentions.

load_model(path: str)

Loads the model state from the given file path.

Parameters

pathstr

The file path from where the model state is to be loaded.

save_model(path: str)

Saves the model state to the given file path.

Parameters

pathstr

The file path where the model state is to be saved.

transformer_implementation.TransformerConfig module

class transformer_implementation.TransformerConfig.TransformerConfig(vocab_size: int = 0, BOS_IDX: int = -1, EOS_IDX: int = -1, PAD_IDX: int = -1, block_size: int = 256, batch_size: int = 12, train_data_size: int = 5000000, grad_accumulation_steps: int = 40, n_layer: int = 2, n_head: int = 4, n_embd: int = 256, dropout: float = 0.1, bias: bool = False, max_epochs: int = 100, max_iters: int = 2000, eval_iters: int = 20, learning_rate: int = 0.0006, beta1: int = 0.9, beta2: int = 0.95, weight_decay: int = 0.1, eps: int = 1e-09, device_type: str = 'cpu', device: str = 'cpu', dtype: str = 'float16', compile: bool = True, backend: str = 'nccl')

Bases: object

Data class that stores the configuration for a Transformer model.

Parameters:
  • vocab_size (int, optional) – Total size of the tokenizer vocabulary.

  • BOS_IDX (int, optional) – Index of the BOS token, defaults to -1

  • EOS_IDX (int, optional) – Index of the EOS token, defaults to -1

  • PAD_IDX (int, optional) – Index of the PAD token, defaults to -1

  • block_size (int, optional) – Number of tokens in each sequence, defaults to 256

  • batch_size (int, optional) – Number of sequences in each batch, defaults to 12

  • train_data_size (int, optional) – Size of train data, defaults to 5000000

  • grad_accumulation_steps (int, optional) – Number of batch accumulates during training, defaults to 2

  • n_layer (int, optional) – Number of transformer encoder and decoder blocks (N), defaults to 2

  • n_head (int, optional) – Number of heads in each attention block, defaults to 4

  • n_embd (int, optional) – Token embedding size, defaults to 256

  • dropout (float, optional) – Dropout rate to use in the Transformer model, defaults to 0.1

  • bias (bool, optional) – Indicates whether to use bias in Linears and LayerNorms, defaults to False

  • max_epochs (int, optional) – Number of training epochs, defaults to 100

  • max_iters (int, optional) – Number of training steps, defaults to 2000

  • eval_iters (int, optional) – Number of validation epochs, defaults to 20

  • learning_rate (float, optional) – Learning rate for the model optimization, defaults to 6e-4

  • beta1 (float, optional) – Beta1 for the AdamW optimizer, defaults to 0.9

  • beta2 (float, optional) – Beta2 for the AdamW optimizer, defaults to 0.95

  • weight_decay (float, optional) – Weight decay for the AdamW optimizer, defaults to 1e-1

  • eps (float, optional) – Epsilon for the AdamW optimizer, defaults to 1e-9

  • device (str, optional) – The device to run the model on, defaults to ‘cpu’. ‘cuda’ is used if a GPU is available.

  • dtype (str, optional) – The data type for the model, defaults to ‘bfloat16’ if GPU is available and supports ‘bfloat16’, otherwise ‘float16’

  • compile (bool, optional) – If set to True, use PyTorch 2.0 to compile the model to be faster, defaults to True

  • backend (str, optional) – Backend for DDP settings, defaults to ‘nccl’

  • ddp (bool, optional) – If set to True, this is a DDP run, defaults to the evaluation of the environment variable ‘RANK’ != -1

BOS_IDX: int = -1
EOS_IDX: int = -1
PAD_IDX: int = -1
backend: str = 'nccl'
batch_size: int = 12
beta1: int = 0.9
beta2: int = 0.95
bias: bool = False
block_size: int = 256
compile: bool = True
ddp = False
device: str = 'cpu'
device_type: str = 'cpu'
dropout: float = 0.1
dtype: str = 'float16'
eps: int = 1e-09
eval_iters: int = 20
grad_accumulation_steps: int = 40
learning_rate: int = 0.0006
max_epochs: int = 100
max_iters: int = 2000
n_embd: int = 256
n_head: int = 4
n_layer: int = 2
train_data_size: int = 5000000
vocab_size: int = 0
weight_decay: int = 0.1

Module contents