transformer_implementation package¶

Subpackages¶

transformer_implementation.blocks package

Submodules¶

transformer_implementation.DataLoaderFactory module¶

class transformer_implementation.DataLoaderFactory.DataLoaderFactory(block_size, batch_size, tokenizer, device, train_dataset_size=5000000)¶

Bases: object

Factory class to create dataloaders for training, validation, and testing datasets.

It initializes the datasets and dataloaders for the given block size, batch size, tokenizer, and device. The dataloaders can be accessed directly through the corresponding attributes.

Attributes¶

train_dataTranslationDataset: The training dataset.
val_dataTranslationDataset: The validation dataset.
test_dataTranslationDataset: The testing dataset.
dataloader_traintorch.utils.data.DataLoader: Dataloader for the training dataset.
dataloader_valtorch.utils.data.DataLoader: Dataloader for the validation dataset.
dataloader_testtorch.utils.data.DataLoader: Dataloader for the testing dataset.

Methods¶

__len__() -> int:: Prints and returns the number of items in each dataset and total.
get_batch(split: str) -> dict:: Returns a generator that iterates over the batches in the specified split.

get_batch(split)¶

Returns a generator that iterates over the batches in the specified split.

Parameters¶

splitstr: The split to use. Must be one of ‘train’, ‘val’, or ‘test’.

Yields¶

dict: The next batch in the specified split. Each batch is a dictionary that contains tensors moved to the specified device and the ‘translation’ field.

class transformer_implementation.DataLoaderFactory.TranslationDataset(dataset, tokenizer, block_size)¶

Bases: Dataset

Dataset for English to French translation tasks.

The dataset includes ‘translation’ field which is a dict that contains source text in English (‘en’) and target text in French (‘fr’).

Attributes¶

datasetobject: The dataset object containing translations.
tokenizerobject: The tokenizer object used for encoding the translations.
block_sizeint: The maximum length of the tokenized sequences.

Methods¶

__getitem__(index: int) -> dict:: Returns the tokenized input and target sequences, their corresponding masks, and the original translation for a given index.
__len__() -> int:: Returns the number of items in the dataset.

transformer_implementation.Decoder module¶

class transformer_implementation.Decoder.Decoder(config)¶

Bases: Module

Implements a decoder module in PyTorch.

This class is a child of the PyTorch nn.Module class. The decoder uses embeddings for both the vocabulary and the positions. The core of the decoder is a sequence of DecoderBlock modules. The output of the decoder is then processed by a linear layer to produce the final output.

Attributes¶

configobject: A configuration object with the necessary attributes for the decoder.
decodertorch.nn.ModuleDict: A dictionary containing the decoder components, including the embeddings, dropout, DecoderBlock sequence, and LayerNorm.
lm_headtorch.nn.Linear: A linear layer for producing the final output of the decoder.

Methods¶

get_num_params(non_embedding: bool = True) -> int:: Returns the number of parameters in the decoder.
_init_weights(module):: Initializes the weights of the specified module.
forward(idx, enc_output=None, src_mask=None, tgt_mask=None):: Computes the forward pass of the decoder.

Parameters¶

configobject: A configuration object with necessary attributes, including vocab_size, n_embd, block_size, dropout, n_layer, and bias.

forward(idx, enc_output=None, src_mask=None, tgt_mask=None)¶

Computes the forward pass of the decoder.

Parameters¶

idxtorch.Tensor: The input tensor with token indices.
enc_outputtorch.Tensor, optional: The output of the encoder. Default is None.
src_masktorch.Tensor, optional: The mask for the source sequence. Default is None.
tgt_masktorch.Tensor, optional: The mask for the target sequence. Default is None.

Returns¶

Tuple[torch.Tensor, List[torch.Tensor], List[torch.Tensor]]: A tuple containing the output tensor, a list of attention scores from the decoder blocks, and a list of cross-attention scores.

get_num_params(non_embedding: bool = True) → int¶

Returns the number of parameters in the model. For non-embedding count (default), the position embeddings get subtracted. The token embeddings would too, except due to the parameter sharing these params are actually used as weights in the final layer, so we include them.

Parameters¶

non_embeddingbool, optional: If True, does not count the parameters of the position embedding layer. Default is True.

Returns¶

int: The number of parameters in the decoder.

transformer_implementation.Encoder module¶

class transformer_implementation.Encoder.Encoder(config)¶

Bases: Module

The Encoder class implements a multi-layer transformer encoder.

This class inherits from the PyTorch nn.Module class and includes token and position embeddings, dropout, multiple encoder blocks, and layer normalization.

Attributes¶

configobject

A configuration object with the following attributes:: vocab_size (int): The size of the vocabulary. block_size (int): The maximum sequence length. n_embd (int): The dimension of the embeddings. dropout (float): The dropout rate. n_layer (int): The number of transformer layers. bias (bool): If True, the linear layers will include a bias term.

encodertorch.nn.ModuleDict

A dictionary-like module of several layers:: wte (torch.nn.Embedding): The token embeddings layer. wpe (torch.nn.Embedding): The position embeddings layer. drop (torch.nn.Dropout): The dropout layer. h (torch.nn.ModuleList): The list of transformer layers. ln_f (LayerNorm): The final layer normalization.

Methods¶

get_num_params(non_embedding: bool = True) -> int:: Returns the total number of parameters.
_init_weights(module):: Initializes the weights of the specified module.
forward(idx, mask=None):: Computes the forward pass of the encoder.

Parameters¶

configobject

A configuration object with the following attributes:: vocab_size (int): The size of the vocabulary. block_size (int): The maximum sequence length. n_embd (int): The dimension of the embeddings. dropout (float): The dropout rate. n_layer (int): The number of transformer layers. bias (bool): If True, the linear layers will include a bias term.

forward(idx, mask=None)¶

Implements the forward pass of the encoder.

Parameters¶

idxtorch.Tensor: The input tensor with indices of tokens in the sequence.
masktorch.Tensor, optional: The mask tensor. If provided, it should have the same size as idx.

Returns¶

Tuple[torch.Tensor, List[torch.Tensor]]: The output tensor after layer normalization and the list of attention matrices from each transformer layer.

get_num_params(non_embedding: bool = True)¶

Returns the number of parameters in the model. For non-embedding count (default), the position embeddings get subtracted. The token embeddings would too, except due to the parameter sharing these params are actually used as weights in the final layer, so we include them.

Parameters¶

non_embeddingbool, optional: If True, subtracts the number of embedding parameters from the total.

Returns¶

int: The total number of parameters.

transformer_implementation.Tokenizer module¶

class transformer_implementation.Tokenizer.Tokenizer¶

Bases: object

Implements a Tokenizer based on the tiktoken library for encoding and decoding sequences.

The tokenizer has special tokens for the beginning of sentence (BOS), end of sentence (EOS), and padding (PAD). The sequence can be padded to a fixed length for processing in batch.

Attributes¶

BOS_IDXint: The index of the Beginning of Sentence (BOS) token.
EOS_IDXint: The index of the End of Sentence (EOS) token.
PAD_IDXint: The index of the Padding (PAD) token.
encodertiktoken.Encoding: The encoding object used for converting sequences to and from tokens.

Methods¶

vocab_size() -> int:: Returns the size of the vocabulary.
sequence_padding(sequence, max_size: int, device: str) -> torch.Tensor:: Returns the padded sequence as a tensor.
sequence_cleaner(sequence) -> list:: Returns the cleaned sequence without any special tokens.
generate_padding_mask(seq, triu: bool, device: str) -> torch.Tensor:: Returns a mask for the padding tokens in the sequence.
tokenize(sequence, device: str) -> list:: Returns the tokenized sequence.
tokenize_from_str(sequence, device: str) -> list:: Returns the tokenized sequence for a given string.

generate_padding_mask(seq, triu=False, device='cpu')¶

Generates a mask for the padding tokens in the sequence.

Parameters¶

seqtorch.Tensor: The sequence for which the mask will be generated.
triubool, optional: If True, the mask will be a upper triangular matrix. Defaults to False.
devicestr, optional: The device where the tensor will be allocated. Defaults to “cpu”.

Returns¶

torch.Tensor: The mask for the sequence.

sequence_cleaner(sequence)¶

Removes the special tokens from the sequence.

Parameters¶

sequenceUnion[torch.Tensor, list]: The sequence to be cleaned.

Returns¶

list: The cleaned sequence.

sequence_padding(sequence, max_size: int = 512, device: str = 'cpu') → Tensor¶

Pads the sequence to the max_size with the PAD token.

Parameters¶

sequenceUnion[torch.Tensor, list]: The sequence to be padded.
max_sizeint, optional: The maximum size of the sequence after padding. Defaults to 512.
devicestr, optional: The device where the tensor will be allocated. Defaults to “cpu”.

Returns¶

torch.Tensor: The padded sequence.

tokenize(sequence, device='cpu') → list¶

Tokenizes the sequence using the encoder.

Parameters¶

sequenceUnion[torch.Tensor, list]: The sequence to be tokenized.
devicestr, optional: The device where the tensor will be allocated. Defaults to “cpu”.

Returns¶

list: The tokenized sequence.

tokenize_from_str(sequence, device='cpu') → list¶

Tokenizes the string sequence using the encoder.

Parameters¶

sequencestr: The string sequence to be tokenized.
devicestr, optional: The device where the tensor will be allocated. Defaults to “cpu”.

Returns¶

list: The tokenized sequence.

vocab_size() → int¶

Returns the size of the vocabulary.

Returns¶

int: The size of the vocabulary.

transformer_implementation.Transformer module¶

class transformer_implementation.Transformer.Transformer(config)¶

Bases: Module

A PyTorch implementation of a Transformer model.

The Transformer model consists of an Encoder and a Decoder. It supports functionalities like forward pass, token generation, optimizer configuration, and save/load model state.

Attributes¶

configobject: A configuration object with necessary attributes for Transformer model.
encoderEncoder: The encoder part of the Transformer model.
decoderDecoder: The decoder part of the Transformer model.

Methods¶

forward(src, tgt, src_mask=None, tgt_mask=None):: Implements the forward pass of the Transformer model and returns the output and loss.
generate(src, idx, src_mask=None, max_new_tokens=128, temperature=1.0, top_k=None):: Generates new tokens given a source tensor.
configure_optimizers(weight_decay, learning_rate, betas, device_type, eps):: Configures the AdamW optimizer for the Transformer model.
save_model(path: str):: Saves the model state to the given file path.
load_model(path: str):: Loads the model state from the given file path.

Parameters¶

configobject

A configuration object with necessary parameters for Transformer model. It includes:: vocab_size (int): The size of vocabulary. block_size (int): The size of a block for Transformer. PAD_IDX (int): The index representing padding in token sequence.

configure_optimizers(weight_decay, learning_rate, betas, device_type, eps)¶

Configures the AdamW optimizer for the Transformer model.

Parameters¶

weight_decayfloat: The L2 penalty (regularization) coefficient.
learning_ratefloat: The learning rate for AdamW optimizer.
betastuple(float, float): Coefficients used for computing running averages of gradient and its square.
device_typestr: The device type for the optimizer, either “cpu” or “cuda”.
epsfloat: A term added to the denominator to improve numerical stability.

Returns¶

torch.optim.AdamW: The AdamW optimizer configured for the Transformer model.

forward(src, tgt, src_mask=None, tgt_mask=None)¶

Implements the forward pass of the Transformer model.

Parameters¶

srctorch.Tensor: The input tensor for the source sequence.
tgttorch.Tensor: The input tensor for the target sequence.
src_masktorch.Tensor, optional: The input tensor for source sequence masking.
tgt_masktorch.Tensor, optional: The input tensor for target sequence masking.

Returns¶

torch.Tensor, torch.Tensor: The output tensor post-processed by the Transformer model and the calculated loss.

generate(src, idx, src_mask=None, max_new_tokens=128, temperature=1.0, top_k=None)¶

Generates new tokens given a source tensor.

Parameters¶

srctorch.Tensor: The input tensor for the source sequence.
idxtorch.Tensor: The input tensor with indices in the current context.
src_masktorch.Tensor, optional: The input tensor for source sequence masking.
max_new_tokensint, optional: The maximum number of new tokens to be generated.
temperaturefloat, optional: The softmax temperature for controlling the randomness of predictions.
top_kint, optional: The number of highest probability vocabulary tokens to keep for next step prediction.

Returns¶

torch.Tensor, dict: The tensor with new generated token indices and a dictionary with attentions.

load_model(path: str)¶

Loads the model state from the given file path.

Parameters¶

pathstr: The file path from where the model state is to be loaded.

save_model(path: str)¶

Saves the model state to the given file path.

Parameters¶

pathstr: The file path where the model state is to be saved.

transformer_implementation.TransformerConfig module¶

class transformer_implementation.TransformerConfig.TransformerConfig(vocab_size: int = 0, BOS_IDX: int = -1, EOS_IDX: int = -1, PAD_IDX: int = -1, block_size: int = 256, batch_size: int = 12, train_data_size: int = 5000000, grad_accumulation_steps: int = 40, n_layer: int = 2, n_head: int = 4, n_embd: int = 256, dropout: float = 0.1, bias: bool = False, max_epochs: int = 100, max_iters: int = 2000, eval_iters: int = 20, learning_rate: int = 0.0006, beta1: int = 0.9, beta2: int = 0.95, weight_decay: int = 0.1, eps: int = 1e-09, device_type: str = 'cpu', device: str = 'cpu', dtype: str = 'float16', compile: bool = True, backend: str = 'nccl')¶

Bases: object

Data class that stores the configuration for a Transformer model.

Parameters:

vocab_size (int, optional) – Total size of the tokenizer vocabulary.
BOS_IDX (int, optional) – Index of the BOS token, defaults to -1
EOS_IDX (int, optional) – Index of the EOS token, defaults to -1
PAD_IDX (int, optional) – Index of the PAD token, defaults to -1
block_size (int, optional) – Number of tokens in each sequence, defaults to 256
batch_size (int, optional) – Number of sequences in each batch, defaults to 12
train_data_size (int, optional) – Size of train data, defaults to 5000000
grad_accumulation_steps (int, optional) – Number of batch accumulates during training, defaults to 2
n_layer (int, optional) – Number of transformer encoder and decoder blocks (N), defaults to 2
n_head (int, optional) – Number of heads in each attention block, defaults to 4
n_embd (int, optional) – Token embedding size, defaults to 256
dropout (float, optional) – Dropout rate to use in the Transformer model, defaults to 0.1
bias (bool, optional) – Indicates whether to use bias in Linears and LayerNorms, defaults to False
max_epochs (int, optional) – Number of training epochs, defaults to 100
max_iters (int, optional) – Number of training steps, defaults to 2000
eval_iters (int, optional) – Number of validation epochs, defaults to 20
learning_rate (float, optional) – Learning rate for the model optimization, defaults to 6e-4
beta1 (float, optional) – Beta1 for the AdamW optimizer, defaults to 0.9
beta2 (float, optional) – Beta2 for the AdamW optimizer, defaults to 0.95
weight_decay (float, optional) – Weight decay for the AdamW optimizer, defaults to 1e-1
eps (float, optional) – Epsilon for the AdamW optimizer, defaults to 1e-9
device (str, optional) – The device to run the model on, defaults to ‘cpu’. ‘cuda’ is used if a GPU is available.
dtype (str, optional) – The data type for the model, defaults to ‘bfloat16’ if GPU is available and supports ‘bfloat16’, otherwise ‘float16’
compile (bool, optional) – If set to True, use PyTorch 2.0 to compile the model to be faster, defaults to True
backend (str, optional) – Backend for DDP settings, defaults to ‘nccl’
ddp (bool, optional) – If set to True, this is a DDP run, defaults to the evaluation of the environment variable ‘RANK’ != -1

BOS_IDX: int = -1¶

EOS_IDX: int = -1¶

PAD_IDX: int = -1¶

backend: str = 'nccl'¶

batch_size: int = 12¶

beta1: int = 0.9¶

beta2: int = 0.95¶

bias: bool = False¶

block_size: int = 256¶

compile: bool = True¶

ddp = False¶

device: str = 'cpu'¶

device_type: str = 'cpu'¶

dropout: float = 0.1¶

dtype: str = 'float16'¶

eps: int = 1e-09¶

eval_iters: int = 20¶

grad_accumulation_steps: int = 40¶

learning_rate: int = 0.0006¶

max_epochs: int = 100¶

max_iters: int = 2000¶

n_embd: int = 256¶

n_head: int = 4¶

n_layer: int = 2¶

train_data_size: int = 5000000¶

vocab_size: int = 0¶

weight_decay: int = 0.1¶

transformer_implementation package¶

Subpackages¶

Submodules¶

transformer_implementation.DataLoaderFactory module¶

Attributes¶

Methods¶

Parameters¶

Yields¶

Attributes¶

Methods¶

transformer_implementation.Decoder module¶

Attributes¶

Methods¶

Parameters¶

Parameters¶

Returns¶

Parameters¶

Returns¶

transformer_implementation.Encoder module¶

Attributes¶

Methods¶

Parameters¶

Parameters¶

Returns¶

Parameters¶

Returns¶

transformer_implementation.Tokenizer module¶

Attributes¶

Methods¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Returns¶

transformer_implementation.Transformer module¶

Attributes¶

Methods¶

Parameters¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Parameters¶

transformer_implementation.TransformerConfig module¶

Module contents¶