transformer_implementation package¶
Subpackages¶
Submodules¶
transformer_implementation.DataLoaderFactory module¶
- class transformer_implementation.DataLoaderFactory.DataLoaderFactory(block_size, batch_size, tokenizer, device, train_dataset_size=5000000)¶
Bases:
object
Factory class to create dataloaders for training, validation, and testing datasets.
It initializes the datasets and dataloaders for the given block size, batch size, tokenizer, and device. The dataloaders can be accessed directly through the corresponding attributes.
Attributes¶
- train_dataTranslationDataset
The training dataset.
- val_dataTranslationDataset
The validation dataset.
- test_dataTranslationDataset
The testing dataset.
- dataloader_traintorch.utils.data.DataLoader
Dataloader for the training dataset.
- dataloader_valtorch.utils.data.DataLoader
Dataloader for the validation dataset.
- dataloader_testtorch.utils.data.DataLoader
Dataloader for the testing dataset.
Methods¶
- __len__() -> int:
Prints and returns the number of items in each dataset and total.
- get_batch(split: str) -> dict:
Returns a generator that iterates over the batches in the specified split.
- get_batch(split)¶
Returns a generator that iterates over the batches in the specified split.
Parameters¶
- splitstr
The split to use. Must be one of ‘train’, ‘val’, or ‘test’.
Yields¶
- dict
The next batch in the specified split. Each batch is a dictionary that contains tensors moved to the specified device and the ‘translation’ field.
- class transformer_implementation.DataLoaderFactory.TranslationDataset(dataset, tokenizer, block_size)¶
Bases:
Dataset
Dataset for English to French translation tasks.
The dataset includes ‘translation’ field which is a dict that contains source text in English (‘en’) and target text in French (‘fr’).
Attributes¶
- datasetobject
The dataset object containing translations.
- tokenizerobject
The tokenizer object used for encoding the translations.
- block_sizeint
The maximum length of the tokenized sequences.
Methods¶
- __getitem__(index: int) -> dict:
Returns the tokenized input and target sequences, their corresponding masks, and the original translation for a given index.
- __len__() -> int:
Returns the number of items in the dataset.
transformer_implementation.Decoder module¶
- class transformer_implementation.Decoder.Decoder(config)¶
Bases:
Module
Implements a decoder module in PyTorch.
This class is a child of the PyTorch nn.Module class. The decoder uses embeddings for both the vocabulary and the positions. The core of the decoder is a sequence of DecoderBlock modules. The output of the decoder is then processed by a linear layer to produce the final output.
Attributes¶
- configobject
A configuration object with the necessary attributes for the decoder.
- decodertorch.nn.ModuleDict
A dictionary containing the decoder components, including the embeddings, dropout, DecoderBlock sequence, and LayerNorm.
- lm_headtorch.nn.Linear
A linear layer for producing the final output of the decoder.
Methods¶
- get_num_params(non_embedding: bool = True) -> int:
Returns the number of parameters in the decoder.
- _init_weights(module):
Initializes the weights of the specified module.
- forward(idx, enc_output=None, src_mask=None, tgt_mask=None):
Computes the forward pass of the decoder.
Parameters¶
- configobject
A configuration object with necessary attributes, including vocab_size, n_embd, block_size, dropout, n_layer, and bias.
- forward(idx, enc_output=None, src_mask=None, tgt_mask=None)¶
Computes the forward pass of the decoder.
Parameters¶
- idxtorch.Tensor
The input tensor with token indices.
- enc_outputtorch.Tensor, optional
The output of the encoder. Default is None.
- src_masktorch.Tensor, optional
The mask for the source sequence. Default is None.
- tgt_masktorch.Tensor, optional
The mask for the target sequence. Default is None.
Returns¶
- Tuple[torch.Tensor, List[torch.Tensor], List[torch.Tensor]]
A tuple containing the output tensor, a list of attention scores from the decoder blocks, and a list of cross-attention scores.
- get_num_params(non_embedding: bool = True) int ¶
Returns the number of parameters in the model. For non-embedding count (default), the position embeddings get subtracted. The token embeddings would too, except due to the parameter sharing these params are actually used as weights in the final layer, so we include them.
Parameters¶
- non_embeddingbool, optional
If True, does not count the parameters of the position embedding layer. Default is True.
Returns¶
- int
The number of parameters in the decoder.
transformer_implementation.Encoder module¶
- class transformer_implementation.Encoder.Encoder(config)¶
Bases:
Module
The Encoder class implements a multi-layer transformer encoder.
This class inherits from the PyTorch nn.Module class and includes token and position embeddings, dropout, multiple encoder blocks, and layer normalization.
Attributes¶
- configobject
- A configuration object with the following attributes:
vocab_size (int): The size of the vocabulary. block_size (int): The maximum sequence length. n_embd (int): The dimension of the embeddings. dropout (float): The dropout rate. n_layer (int): The number of transformer layers. bias (bool): If True, the linear layers will include a bias term.
- encodertorch.nn.ModuleDict
- A dictionary-like module of several layers:
wte (torch.nn.Embedding): The token embeddings layer. wpe (torch.nn.Embedding): The position embeddings layer. drop (torch.nn.Dropout): The dropout layer. h (torch.nn.ModuleList): The list of transformer layers. ln_f (LayerNorm): The final layer normalization.
Methods¶
- get_num_params(non_embedding: bool = True) -> int:
Returns the total number of parameters.
- _init_weights(module):
Initializes the weights of the specified module.
- forward(idx, mask=None):
Computes the forward pass of the encoder.
Parameters¶
- configobject
- A configuration object with the following attributes:
vocab_size (int): The size of the vocabulary. block_size (int): The maximum sequence length. n_embd (int): The dimension of the embeddings. dropout (float): The dropout rate. n_layer (int): The number of transformer layers. bias (bool): If True, the linear layers will include a bias term.
- forward(idx, mask=None)¶
Implements the forward pass of the encoder.
Parameters¶
- idxtorch.Tensor
The input tensor with indices of tokens in the sequence.
- masktorch.Tensor, optional
The mask tensor. If provided, it should have the same size as idx.
Returns¶
- Tuple[torch.Tensor, List[torch.Tensor]]
The output tensor after layer normalization and the list of attention matrices from each transformer layer.
- get_num_params(non_embedding: bool = True)¶
Returns the number of parameters in the model. For non-embedding count (default), the position embeddings get subtracted. The token embeddings would too, except due to the parameter sharing these params are actually used as weights in the final layer, so we include them.
Parameters¶
- non_embeddingbool, optional
If True, subtracts the number of embedding parameters from the total.
Returns¶
- int
The total number of parameters.
transformer_implementation.Tokenizer module¶
- class transformer_implementation.Tokenizer.Tokenizer¶
Bases:
object
Implements a Tokenizer based on the tiktoken library for encoding and decoding sequences.
The tokenizer has special tokens for the beginning of sentence (BOS), end of sentence (EOS), and padding (PAD). The sequence can be padded to a fixed length for processing in batch.
Attributes¶
- BOS_IDXint
The index of the Beginning of Sentence (BOS) token.
- EOS_IDXint
The index of the End of Sentence (EOS) token.
- PAD_IDXint
The index of the Padding (PAD) token.
- encodertiktoken.Encoding
The encoding object used for converting sequences to and from tokens.
Methods¶
- vocab_size() -> int:
Returns the size of the vocabulary.
- sequence_padding(sequence, max_size: int, device: str) -> torch.Tensor:
Returns the padded sequence as a tensor.
- sequence_cleaner(sequence) -> list:
Returns the cleaned sequence without any special tokens.
- generate_padding_mask(seq, triu: bool, device: str) -> torch.Tensor:
Returns a mask for the padding tokens in the sequence.
- tokenize(sequence, device: str) -> list:
Returns the tokenized sequence.
- tokenize_from_str(sequence, device: str) -> list:
Returns the tokenized sequence for a given string.
- generate_padding_mask(seq, triu=False, device='cpu')¶
Generates a mask for the padding tokens in the sequence.
Parameters¶
- seqtorch.Tensor
The sequence for which the mask will be generated.
- triubool, optional
If True, the mask will be a upper triangular matrix. Defaults to False.
- devicestr, optional
The device where the tensor will be allocated. Defaults to “cpu”.
Returns¶
- torch.Tensor
The mask for the sequence.
- sequence_cleaner(sequence)¶
Removes the special tokens from the sequence.
Parameters¶
- sequenceUnion[torch.Tensor, list]
The sequence to be cleaned.
Returns¶
- list
The cleaned sequence.
- sequence_padding(sequence, max_size: int = 512, device: str = 'cpu') Tensor ¶
Pads the sequence to the max_size with the PAD token.
Parameters¶
- sequenceUnion[torch.Tensor, list]
The sequence to be padded.
- max_sizeint, optional
The maximum size of the sequence after padding. Defaults to 512.
- devicestr, optional
The device where the tensor will be allocated. Defaults to “cpu”.
Returns¶
- torch.Tensor
The padded sequence.
- tokenize(sequence, device='cpu') list ¶
Tokenizes the sequence using the encoder.
Parameters¶
- sequenceUnion[torch.Tensor, list]
The sequence to be tokenized.
- devicestr, optional
The device where the tensor will be allocated. Defaults to “cpu”.
Returns¶
- list
The tokenized sequence.
transformer_implementation.Transformer module¶
- class transformer_implementation.Transformer.Transformer(config)¶
Bases:
Module
A PyTorch implementation of a Transformer model.
The Transformer model consists of an Encoder and a Decoder. It supports functionalities like forward pass, token generation, optimizer configuration, and save/load model state.
Attributes¶
- configobject
A configuration object with necessary attributes for Transformer model.
- encoderEncoder
The encoder part of the Transformer model.
- decoderDecoder
The decoder part of the Transformer model.
Methods¶
- forward(src, tgt, src_mask=None, tgt_mask=None):
Implements the forward pass of the Transformer model and returns the output and loss.
- generate(src, idx, src_mask=None, max_new_tokens=128, temperature=1.0, top_k=None):
Generates new tokens given a source tensor.
- configure_optimizers(weight_decay, learning_rate, betas, device_type, eps):
Configures the AdamW optimizer for the Transformer model.
- save_model(path: str):
Saves the model state to the given file path.
- load_model(path: str):
Loads the model state from the given file path.
Parameters¶
- configobject
- A configuration object with necessary parameters for Transformer model. It includes:
vocab_size (int): The size of vocabulary. block_size (int): The size of a block for Transformer. PAD_IDX (int): The index representing padding in token sequence.
- configure_optimizers(weight_decay, learning_rate, betas, device_type, eps)¶
Configures the AdamW optimizer for the Transformer model.
Parameters¶
- weight_decayfloat
The L2 penalty (regularization) coefficient.
- learning_ratefloat
The learning rate for AdamW optimizer.
- betastuple(float, float)
Coefficients used for computing running averages of gradient and its square.
- device_typestr
The device type for the optimizer, either “cpu” or “cuda”.
- epsfloat
A term added to the denominator to improve numerical stability.
Returns¶
- torch.optim.AdamW
The AdamW optimizer configured for the Transformer model.
- forward(src, tgt, src_mask=None, tgt_mask=None)¶
Implements the forward pass of the Transformer model.
Parameters¶
- srctorch.Tensor
The input tensor for the source sequence.
- tgttorch.Tensor
The input tensor for the target sequence.
- src_masktorch.Tensor, optional
The input tensor for source sequence masking.
- tgt_masktorch.Tensor, optional
The input tensor for target sequence masking.
Returns¶
- torch.Tensor, torch.Tensor
The output tensor post-processed by the Transformer model and the calculated loss.
- generate(src, idx, src_mask=None, max_new_tokens=128, temperature=1.0, top_k=None)¶
Generates new tokens given a source tensor.
Parameters¶
- srctorch.Tensor
The input tensor for the source sequence.
- idxtorch.Tensor
The input tensor with indices in the current context.
- src_masktorch.Tensor, optional
The input tensor for source sequence masking.
- max_new_tokensint, optional
The maximum number of new tokens to be generated.
- temperaturefloat, optional
The softmax temperature for controlling the randomness of predictions.
- top_kint, optional
The number of highest probability vocabulary tokens to keep for next step prediction.
Returns¶
- torch.Tensor, dict
The tensor with new generated token indices and a dictionary with attentions.
transformer_implementation.TransformerConfig module¶
- class transformer_implementation.TransformerConfig.TransformerConfig(vocab_size: int = 0, BOS_IDX: int = -1, EOS_IDX: int = -1, PAD_IDX: int = -1, block_size: int = 256, batch_size: int = 12, train_data_size: int = 5000000, grad_accumulation_steps: int = 40, n_layer: int = 2, n_head: int = 4, n_embd: int = 256, dropout: float = 0.1, bias: bool = False, max_epochs: int = 100, max_iters: int = 2000, eval_iters: int = 20, learning_rate: int = 0.0006, beta1: int = 0.9, beta2: int = 0.95, weight_decay: int = 0.1, eps: int = 1e-09, device_type: str = 'cpu', device: str = 'cpu', dtype: str = 'float16', compile: bool = True, backend: str = 'nccl')¶
Bases:
object
Data class that stores the configuration for a Transformer model.
- Parameters:
vocab_size (int, optional) – Total size of the tokenizer vocabulary.
BOS_IDX (int, optional) – Index of the BOS token, defaults to -1
EOS_IDX (int, optional) – Index of the EOS token, defaults to -1
PAD_IDX (int, optional) – Index of the PAD token, defaults to -1
block_size (int, optional) – Number of tokens in each sequence, defaults to 256
batch_size (int, optional) – Number of sequences in each batch, defaults to 12
train_data_size (int, optional) – Size of train data, defaults to 5000000
grad_accumulation_steps (int, optional) – Number of batch accumulates during training, defaults to 2
n_layer (int, optional) – Number of transformer encoder and decoder blocks (N), defaults to 2
n_head (int, optional) – Number of heads in each attention block, defaults to 4
n_embd (int, optional) – Token embedding size, defaults to 256
dropout (float, optional) – Dropout rate to use in the Transformer model, defaults to 0.1
bias (bool, optional) – Indicates whether to use bias in Linears and LayerNorms, defaults to False
max_epochs (int, optional) – Number of training epochs, defaults to 100
max_iters (int, optional) – Number of training steps, defaults to 2000
eval_iters (int, optional) – Number of validation epochs, defaults to 20
learning_rate (float, optional) – Learning rate for the model optimization, defaults to 6e-4
beta1 (float, optional) – Beta1 for the AdamW optimizer, defaults to 0.9
beta2 (float, optional) – Beta2 for the AdamW optimizer, defaults to 0.95
weight_decay (float, optional) – Weight decay for the AdamW optimizer, defaults to 1e-1
eps (float, optional) – Epsilon for the AdamW optimizer, defaults to 1e-9
device (str, optional) – The device to run the model on, defaults to ‘cpu’. ‘cuda’ is used if a GPU is available.
dtype (str, optional) – The data type for the model, defaults to ‘bfloat16’ if GPU is available and supports ‘bfloat16’, otherwise ‘float16’
compile (bool, optional) – If set to True, use PyTorch 2.0 to compile the model to be faster, defaults to True
backend (str, optional) – Backend for DDP settings, defaults to ‘nccl’
ddp (bool, optional) – If set to True, this is a DDP run, defaults to the evaluation of the environment variable ‘RANK’ != -1
- BOS_IDX: int = -1¶
- EOS_IDX: int = -1¶
- PAD_IDX: int = -1¶
- backend: str = 'nccl'¶
- batch_size: int = 12¶
- beta1: int = 0.9¶
- beta2: int = 0.95¶
- bias: bool = False¶
- block_size: int = 256¶
- compile: bool = True¶
- ddp = False¶
- device: str = 'cpu'¶
- device_type: str = 'cpu'¶
- dropout: float = 0.1¶
- dtype: str = 'float16'¶
- eps: int = 1e-09¶
- eval_iters: int = 20¶
- grad_accumulation_steps: int = 40¶
- learning_rate: int = 0.0006¶
- max_epochs: int = 100¶
- max_iters: int = 2000¶
- n_embd: int = 256¶
- n_head: int = 4¶
- n_layer: int = 2¶
- train_data_size: int = 5000000¶
- vocab_size: int = 0¶
- weight_decay: int = 0.1¶