Adding a New Model in Galvatron

This guide will teach you how to add a new model in Galvatron.

Directory Structure

The directory structure of a model in Galvatron is as follows:

MyModel/
├── meta_configs/                              # Directory for model configuration files
│   ├── __init__.py                            
│   ├── config_utils.py                        # Configuration utility functions
│   ├── MyModel-{MODEL_SIZE}b.json        # Model configuration
│   └── ...                                    # Other model size configurations
│
├── scripts/                                   # Directory for running scripts
│   ├── profile.sh                             # Profiling script
│   ├── train.sh                               # Training script
│   └── search.sh                              # Parallel strategy search script
│
├── __init__.py                                
├── arguments.py                               # Argument definitions
├── dataloader.py                              # Data loading implementation
├── profiler.py                                # Profiling entry point
├── search_dist.py                             # Parallel strategy search entry point
├── train.py                                   # Single-machine training entry point
├── train_dist.py                              # Distributed training entry point
├── train_dist_random.py                       # Random data training entry point
│
├── MyModelModel_checkpoint.py            # Checkpoint save/load
├── MyModelModel_hybrid_parallel.py       # Hybrid parallel implementation
├── MyModelModel_sequential.py            # Sequential model implementation
└── MyModelModel_tensor_parallel.py       # Tensor parallel implementation

Galvatron’s Hybrid Parallel Model Construction Process

Before adding a new model, let’s understand the general process Galvatron uses for constructing hybrid parallel models.

Galvatron builds models without manually defining the entire model structure. Instead, it uses corresponding model structures from transformers or flash attention. You can add the suffix hf or fa to MyModel to distinguish the backend you choose for the model structure. If you’re unsure which backend to choose, we recommend hf as Galvatron provides more comprehensive support for it (the fa model does not support the Ulysses-SP parallel method). The process of constructing a hybrid parallel model is detailed in construct_hybrid_parallel_model_api. The specific process is as follows:

Preprocessing Configuration: Obtain information such as hybrid parallel strategy and model configuration.
Communication Group Generation (Step 0): Generate communication groups required for various parallel strategies.
Build Tensor Parallel Model (Step 1): Use model-specific TP functions (defined in MyModelModel_tensor_parallel.py) to build a tensor parallel model.
Build Sequential Model (Step 2): Reconstruct the model using model-specific sequential functions (defined in MyModelModel_sequential.py).
Wrap Redistribution Modules (Step 3): Add data redistribution functionality to the model to ensure data distribution corresponds to the parallel strategy.
Build Pipeline Parallelism (Step 4): Construct a pipeline parallel model, placing different stages on corresponding devices.
Wrap Data Parallel Modules (Step 5): Wrap data parallel modules based on the FSDP library.
Add Checkpoint Wrapping (Step 6): Add checkpoint functionality to modules based on checkpoint configuration.

Only the API call and the implementations of Step 1 and Step 2 need to be completed using model-specific functions. The other steps are generally implemented by Galvatron.

Core File Descriptions

The core of adding a new model is the model implementation files. These are the main parts that developers need to implement, defining the structure and implementation of the model.

1. Tensor Parallel Implementation

The tensor parallel implementation is realized through the MyModelModel_tensor_parallel.py file, which defines the tensor parallel implementation of the model. Modules in the Sequential model need to be replaced with modules that support tensor parallelism. Galvatron provides different tensor parallel implementations based on different model backends. Specifically, hf uses Megatron-TP, and fa uses the TP provided by flash-attn.

For hf, you need to implement the MyModelLayer_tp class and the MyModelAttention_tp and MyModelMLP_tp classes. For fa, you can directly call the create_mixer_cls and create_mlp_cls methods from flash_attn. You also need to define the construct_tensor_parallel_model function to replace the TP model in the full model. Detailed examples can be found in gpt_hf and gpt_fa.

1.1 Transformer Layer (`hf` Model Format)

The Transformer layer is implemented through the MyModelLayer_tp class:

class MyModelLayer_tp(nn.Module):
    def __init__(self, config, layer_number, tp_group=None, sp_group=None):
        """
        Parameters:
            config: Model configuration object, TransformerConfig
            layer_number: Index number of the current layer
            tp_group: Tensor parallel communication group, CommGroup
            sp_group: Sequence parallel communication group, CommGroup
        """
        super().__init__()
        self.attention = MyModelAttention_tp(config, layer_number, tp_group, sp_group)
        self.mlp = MyModelMLP_tp(config, tp_group)
        self.idx = layer_number
        
    def forward(self, hidden_states, attention_mask=None):
        # ...
        pass

This class is mainly responsible for defining the implementation of a Transformer layer, including the attention mechanism and feedforward neural network. Note that defining self.idx is necessary for distinguishing layers later, and config directly uses the TransformerConfig class used when creating the model in the Transformer library.

1.2 Attention Layer (`hf` Model Format)

The attention layer is implemented through the MyModelAttention_tp class:

class MyModelAttention_tp(nn.Module):
    def __init__(self, config, layer_number, tp_group=None, sp_group=None):
        """
        Parameters:
            config: Model configuration object, TransformerConfig
            layer_number: Index number of the current layer
            tp_group: Tensor parallel communication group, CommGroup
            sp_group: Sequence parallel communication group, CommGroup
        """
        super().__init__()
        # ...
        megatron_config = core_transformer_config_from_args(args)
        self.attention = ParallelAttention(megatron_config, ...)
        # ...
    def forward(self, hidden_states, attention_mask):
        # ...
        pass

ParallelAttention is the attention layer implementation in Megatron-TP modified by Galvatron. In the original Megatron-TP attention layer implementation, three parameters are added: tp_group, sp_group, and use_ulysses, representing the tensor parallel communication group, sequence parallel communication group, and whether to use Ulysses sequence parallelism, respectively. Generally, you can directly refer to the example of gpt_hf for implementation.

1.3 Feedforward Neural Network Layer (`hf` Model Format)

The feedforward neural network layer is implemented through the MyModelMLP_tp class:

class MyModelMLP_tp(nn.Module):
    def __init__(self, config, tp_group=None):
        """
        Parameters:
            config: Model configuration object, TransformerConfig
            tp_group: Tensor parallel communication group, CommGroup
        """
        super().__init__()
        # ...
        megatron_config = core_transformer_config_from_args(get_args())
        self.mlp = ParallelMLP(megatron_config, tp_group = self.tp_group)
        # ...
    def forward(self, hidden_states):
        # ...
        pass

ParallelMLP is the feedforward neural network layer implementation in Megatron-TP modified by Galvatron. In the original Megatron-TP attention layer implementation, the tp_group parameter is added to represent the tensor parallel communication group. Generally, you can directly refer to the example of gpt_hf for implementation.

1.4 Constructing Tensor Parallel Model (`hf` Model Format)

The tensor parallel model is constructed through the construct_tensor_parallel_model function:

def construct_tensor_parallel_model(model, config, tp_groups_enc, sp_groups_enc):
    """
    Convert the model to a tensor parallel version
    
    Parameters:
        model: Original model instance
        config: Model configuration object, TransformerConfig
        tp_groups_enc: List of tensor parallel communication groups for each layer, List[CommGroup]
        sp_groups_enc: List of sequence parallel communication groups for each layer, List[CommGroup]
        
    Returns:
        Converted tensor parallel model
    """
    # ...
    pass

This function mainly performs three tasks: replacing the Transformer Layer in the model with MyModelLayer_tp, replacing the embedding layer in the model with VocabParallelEmbedding, and replacing the lm_head in the model with ColumnParallelLinear. VocabParallelEmbedding and ColumnParallelLinear are the embedding layer and linear layer implementations in Megatron-TP modified by Galvatron, with the tp_group and sp_group parameters added to represent the tensor parallel communication group and sequence parallel communication group. You can also directly refer to the example of gpt_hf for implementation.

Note: The communication groups used in these classes and functions are the CommGroup class customized by Galvatron. If you want to access communication groups generated by torch, please use tp_group.group and sp_group.group.

1.5 Constructing Tensor Parallel Model (`fa` Model Format)

For fa, you only need to implement the construct_tensor_parallel_model function. In this function, you need to replace the attention and mlp modules in the Transformer Layer with the create_mixer_cls and create_mlp_cls methods from flash_attn, replace the embedding layer with the ParallelGPT2Embeddings method from flash_attn, and replace the lm_head with the ColumnParallelLinear method from flash_attn. A detailed example can be found in gpt_fa.

2 Sequential Model Implementation

MyModelModel_sequential.py defines the sequential implementation of the model, including the implementation of the forward and backward propagation of the model.

For traditional Transformer models, you need to implement classes such as MyModelEmbeddings_, MyModelLayers_, MyModelPreNorm_, and MyModelCls_.

In addition, you need to implement the construct_sequential_model function to convert the model to a sequential model and the MyModelModelInfo class to define model-related information.

Specifically, the definition and format of each class are as follows:

2.1 Embedding Layer

The embedding layer is implemented through the MyModelEmbeddings_ class:

class MyModelEmbeddings_(nn.Module):
    def __init__(self, model):
            """
            Parameters:
                model: Model instance
            """
            super().__init__()
            # ...
        def forward(self, tokens, **kwargs):
            # ...
            pass

This class is mainly used to define the embedding layer in the model, including word embedding, position embedding, etc.

Here, the model passed into the __init__ function is the model obtained directly by calling transformers or flash-attn (the model in all APIs needs to be the model obtained by calling transformers or flash-attn).

To enhance the robustness of the code, this function also needs to support some additional features: Megatron sequence parallelism and Ulysses sequence parallelism (not supported by fa). Detailed examples can be found in gpt_hf and gpt_fa.

Note: When using the hf backend, for files with multiple types of Embeddings (e.g., GPT has both Vocab and Position Embeddings), you need to define different Embedding classes to distinguish between these different Embedding parameters. An example of this is shown in gpt_hf.

2.2 Transformer Layer

The Transformer layer is implemented through the MyModelLayers_ class:

class MyModelLayers_(nn.Module):
    def __init__(self, model, layer_idx):
        """
        Parameters:
            model: Model instance
            layer_idx: Index number of the current layer
        """
        super().__init__()
        # ...
    def forward(self, hidden_states, **kwargs):
        # ...
        pass

This class is mainly used to define the Transformer layer in the model, including the self-attention layer, feedforward neural network layer, etc.

For the fa backend, you need to decide whether to add residuals and dropout based on the actual model structure in the code.

2.3 Normalization Layer

The normalization layer is implemented through the MyModelPreNorm_ class:

class MyModelPreNorm_(nn.Module):
    def __init__(self, model):
        """
        Parameters:
            model: Model instance
        """
        super().__init__()
        # ...
    def forward(self, hidden_states, **kwargs):
        # ...
        pass

This class is mainly used to define the normalization layer before the output layer of the model.

2.4 Output Layer

The output layer is implemented through the MyModelCls_ class:

class MyModelCls_(nn.Module):
    def __init__(self, model):
        """
        Parameters:
            model: Model instance
        """
        super().__init__()
        # ...
    def forward(self, hidden_states, **kwargs):
        # ...
        pass

This class is mainly used to define the output layer of the model.

To enhance the robustness of the code, this function also needs to support some additional features: Megatron sequence parallelism, Ulysses sequence parallelism (not supported by fa), and parallel loss computation (not supported by fa). Detailed examples can be found in gpt_hf and gpt_fa.

Note: When using the hf backend, to obtain logits_parallel, you need to directly reference the .weight variable of the original model. This is not allowed in FSDP, so you can place the code for obtaining logits_parallel in a separate function, represented by MyModelLoss_. An example of this is shown in gpt_hf.

When implementing these layers, special attention should be paid to ensuring that the input and output tensors (excluding kwargs) of the forward function of the same type of layer in the Transformer layer have the same format and size. This is to facilitate updating model information to ensure the correctness of pipeline parallelism. For example, in gpt_hf, the input and output tensors of the forward function of the Transformer layer have the same format and size, both being hidden_states.

2.5 Constructing Sequential Model

The sequential model is constructed through the construct_sequential_model function:

def construct_sequential_model(model, config):
    """
    Convert the model to a sequential version
    
    Parameters:
        model: Original model instance
        config: Model configuration object, TransformerConfig
        
    Returns:
        Converted sequential model
    """
    model_ = PipeSequential()
    # ...

This function converts the model into a PipeSequential format, a special sequential container specifically for pipeline parallelism. Developers only need to add the model sequentially to PipeSequential using the add_module method.

Note: If MyModelLoss_ is used, you also need to add a reset_parameters method to ensure the model can be initialized correctly.

2.6 Model Information

Model information is implemented through the MyModelModelInfo class:

class MyModelModelInfo(ModelInfo):
    def __init__(self, config, args):
        super(MyModelModelInfo, self).__init__()
        # ...
        self.set_layernums(layernum_list)
        self.set_shapes(layer_shapes_list)
        self.set_dtypes(layer_dtypes_list)
        self.set_module_types(module_types)

In this class, you need to assign four variables: layernums, shapes, dtypes, and module_types, representing the number of each type of Transformer layer, the shape of input and output tensors for each type of layer, the data type of input and output tensors for each type of layer, and the name of each layer in the model, respectively.

For layernums, you need to assign a list, where each element represents the number of each type of Transformer layer. For example, for GPT, the length of the list is 1 because GPT only has one type of Decoder layer. But for T5, the length of the list is 2 because T5 contains both Encoder and Decoder layers, and these two types of layers have different structures.

For shapes, you need to assign a list, where each element represents the shape of input and output tensors for each type of Transformer layer. Typically, this is a list of size [x, y], where x represents the number of Transformer layer types, and y represents the number of input and output tensors per layer. Each value in the list stores the shape of the input and output tensors.

For dtypes, you need to assign a list, where each element represents the data type of input and output tensors for each type of Transformer layer. Typically, this is a list of size [x, y], where x represents the number of Transformer layer types, and y represents the number of input and output tensors per layer. Each value in the list stores the data type of the input and output tensors.

For module_types, you need to assign a list where each element sequentially represents the name of each layer in the model.

3 Hybrid Parallel Implementation

The hybrid parallel implementation is realized through the MyModelModel_hybrid_parallel.py file. This file acts as a bridge connecting the model with the Galvatron parallel system, mainly responsible for constructing model instances that support hybrid parallelism.

This file primarily implements four functions: get_hybrid_parallel_configs, construct_hybrid_parallel_model, get_mymodel_config, and mymodel_model_hp.

3.1 Getting Hybrid Parallel Configurations

The get_hybrid_parallel_configs function is used to obtain hybrid parallel strategies, with the implementation format as follows:

def get_hybrid_parallel_configs(model_config, training_args):
    hybrid_parallel_configs = get_hybrid_parallel_configs_api(model_config, training_args, MyModelModelInfo)
    return hybrid_parallel_configs

This function requires no modifications. It obtains hybrid parallel strategies by calling Galvatron’s get_hybrid_parallel_configs_api function and returns a dictionary containing hybrid parallel strategy information.

3.2 Constructing Hybrid Parallel Model

The construct_hybrid_parallel_model function is used to construct a hybrid parallel model, with the implementation format as follows:

def construct_hybrid_parallel_model(model, model_config, training_args, hybrid_parallel_configs):
    # ...
    hp_model = construct_hybrid_parallel_model_api(...)
    return hp_model

This function constructs a hybrid parallel model by calling Galvatron’s construct_hybrid_parallel_model_api function and returns a model instance that supports hybrid parallelism. Specifically, the parameters and format required by this API function are as follows:

def construct_hybrid_parallel_model_api(
    model, # Original model instance   
    model_config, # Model configuration object
    training_args, # Training parameters
    hybrid_parallel_configs, # Hybrid parallel configuration
    model_info, # Model information class
    construct_sequential_model, # Function to construct sequential model
    construct_tensor_parallel_model, # Function to construct tensor parallel model
    wrap_block_name=None, # List of module names to wrap with FSDP
    wrap_checkpoint_block_name=None, # List of module names to add checkpoints
    wrap_other_block_name=None, # List of other module names to wrap with FSDP
    tied_wte_attr_names=None, # List of attribute names for weight tying
    layernorm_name = [], # List of layer normalization names
    all_block_name = None, # List of all module names
    load_module_func = None, # Function to load module
):
    # ...
    pass

Parameters can be directly referenced from the implementation of gpt_hf and gpt_fa.

Here, we provide additional explanations for some optional parameters that may cause confusion:

wrap_block_name: A list of Transformer layer module classes that need to be wrapped with FSDP.
wrap_checkpoint_block_name: A list of module names that require checkpoints, usually Transformer layers.
wrap_other_block_name: A list of other module names that need to be wrapped with FSDP, usually layers other than Transformer layers. Note that if multiple Embedding classes are defined, all fine-grained Embedding classes need to be added to the list.
tied_wte_attr_names: A list of attribute names for weight tying. For some models, the parameters of the Vocab Embedding layer and the output layer are the same. For models requiring this feature, developers need to inform Galvatron how to access the Vocab Embedding layer in both the first and last layers of the model. For example, in gpt_hf, the Embedding layer accesses the GPTVocabEmbedding_ class via self.wte, while the output layer accesses it directly via self in the Cls layer. Therefore, tied_wte_attr_names is ['wte', ''].
layernorm_name: A list of names used to identify how Galvatron should access Layernorm in different layers (only the suffix is needed, not the full name). For example, in gpt_hf, Layernorm is accessed via self.LayerNorm in the GPTAttention_tp and GPTMLP_tp classes, and via self.ln in GPTPreNorm_. Therefore, layernorm_name is ['LayerNorm', 'ln'].
all_block_name: A list of all module names, usually the union of wrap_block_name and wrap_other_block_name.
load_module_func: A function to load the module, usually defined as the load_MyModel_module function in the MyModelModel_checkpoint.py file.

Note: Although wrap_block_name, wrap_checkpoint_block_name, wrap_other_block_name, and all_block_name are optional parameters in construct_hybrid_parallel_model_api, to ensure that the model can be initialized correctly, these parameters must be provided.

3.3 Getting Model Configuration

The get_mymodel_config function is used to get the model configuration, with the implementation format as follows:

def get_mymodel_config(args, overwrite_args=True):
    config = config_from_meta(args.model_size)
    config = set_model_config(config, args, overwrite_args)
    if hasattr(args, 'local_rank') and args.local_rank == 0:
        print(config)
    return config

3.4 Building Hybrid Parallel Model

The mymodel_model_hp function is used to build a hybrid parallel model, with the implementation format as follows:

def mymodel_model_hp(config, args):
    hybrid_parallel_configs = get_hybrid_parallel_configs(model_config=config, training_args=args)
    if args.local_rank == 0:
        print("Creating Model...")
    mymodel_model = MyModelModel_huggingface(config)
    model = construct_hybrid_parallel_model(
        model=mymodel_model, 
        model_config=config, 
        training_args=args, 
        hybrid_parallel_configs=hybrid_parallel_configs
    )
    return model

Note that MyModelModel_huggingface is the model obtained directly through transformers, not the Galvatron model. When selecting a model in huggingface, choose a model that includes the output layer.

4 Model Checkpoint Save and Load Implementation (Experimental, support hf)

The model checkpoint save and load implementation is realized through the MyModelModel_checkpoint.py file, which defines the implementation of model checkpoint saving and loading, including checkpoint save and load functions.

This file needs to implement the save_MyModel_module and load_MyModel_module functions to implement the saving and loading of model checkpoints.

Galvatron stores and loads model checkpoints layer by layer, so pay attention to loading and storing them layer by layer during implementation.

llama_hf demonstrates how to implement model checkpoint saving and loading.

Auxiliary File Descriptions

1 Model Configuration Files

Model configuration files define the model’s configuration, including the model’s structure, parameter size, etc.

1.1 Model Configuration Storage File

meta_configs/MyModel-{MODEL_SIZE}b.json: Model configuration file used to store model configuration information.

1.2 Model Configuration Processing File

meta_configs/config_utils.py: This file mainly handles functions related to model configuration, which mainly include three parts:
- Obtaining model configuration information: Obtain model configuration information by calling the config_from_meta function and write it into TransformerConfig.
- Modifying model configuration information: Modify model configuration information based on the passed arguments by calling the set_model_config function, and modify the model configuration information in the arguments through the overwrite_megatron_args and overwrite_model_args functions.
- Obtaining model-related information: Obtain the model name through the model_name function and obtain the configuration information of each layer of the model through the model_layer_configs function.

2 Training Files

Training files mainly define functions related to training, including data loading, model training, etc.

2.1 Main Training File

train_dist.py: This file mainly handles functions related to distributed training.

A complete example is as follows:

def train(args):
    # Initialize the distributed training environment
    local_rank = args.local_rank
    rank = torch.distributed.get_rank()
    torch.cuda.set_device(local_rank)
    device = torch.device("cuda", local_rank)
    world_size = torch.distributed.get_world_size()

    config = get_mymodel_config(args)
    model = mymodel_model_hp(config, args)

    # Create dataset
    if local_rank == 0:
        print("Creating Dataset...")
    
    # Set dataset-related parameters    
    set_megatron_args_for_dataset(args, model, 
                                 model.sp_groups_whole[0] if args.vocab_sp else model.tp_groups_whole[0], 
                                 model.dp_groups_whole[0])
    if local_rank == 0:
        _print_args("arguments", args)

    # Get data iterators
    train_data_iterator, valid_data_iterator, test_data_iterator = get_train_valid_test_data_iterators()
    
    # Create optimizer and learning rate scheduler
    optimizer, opt_param_scheduler = get_optimizer_and_param_scheduler(model, args)

    # Set profiler
    path = os.path.dirname(os.path.abspath(__file__))
    profiler = GalvatronProfiler(args)
    profiler.set_profiler_dist(path, model_layer_configs(config), model_name(config), start_iter=0)
    
    # Record memory usage after model creation
    profiler.profile_memory(0, "After creating model")
    if local_rank == 0:
        print("Start training...")

    # Training loop
    for iter in range(args.iteration, args.train_iters):
        # Get a batch of data
        tokens, kwargs, loss_func = get_batch(train_data_iterator)
        
        # Record start time and memory usage
        profiler.profile_time_start(iter)
        profiler.profile_memory(iter, "Before Forward")

        # Prepare input data
        input_ids = tokens
        batch = [input_ids]
        
        # Forward and backward propagation
        loss = model.forward_backward(batch, iter, profiler, 
                                      loss_func=loss_func,
                                      **kwargs)
        
        # Record memory usage after backward propagation
        profiler.profile_memory(iter, "After Backward")
        
        # Gradient clipping
        total_norm = clip_grad_norm(model, args.clip_grad)
        
        # Optimizer step
        optimizer.step()
        # Learning rate scheduler step
        opt_param_scheduler.step(increment=args.global_batch_size)
        
        # Record memory usage after optimizer step
        profiler.profile_memory(iter, "After optimizer_step")
        
        # Zero gradients
        optimizer.zero_grad()

        # Update profiler statistics
        profiler.post_profile_memory(iter)
        # Get current learning rate
        for param_group in optimizer.param_groups:
            learning_rate = param_group['lr']
        # Record performance metrics for this iteration
        profiler.profile_time_end(iter, loss, learning_rate, total_norm)
        
        # Synchronize all processes
        torch.distributed.barrier()

        # Periodically save model checkpoints
        if args.save != None and (iter + 1) % args.save_interval == 0:
            save_llama_module(args.save, model, optimizer, opt_param_scheduler, iter + 1, args)

if __name__ == '__main__':
    # Initialize Galvatron training environment
    args = initialize_galvatron(model_args, mode='train_dist')
    # Set random seed for reproducibility
    set_seed()
    # Start training
    train(args)

train_dist_random.py: This file mainly handles functions related to distributed training, similar to train_dist.py, but uses random data for training.

2.2 Data Loading Files

dataloader.py: This file mainly handles functions related to data loading, which mainly include two parts:

Random Data Loading: Create a dataset that generates random tokens and create a collate_fn function to convert random tokens into model inputs. Below is an example of random data loading:

def random_get_ltor_masks_and_position_ids(data):
"""Build masks and position id for left to right model."""
    micro_batch_size, seq_length = data.size()
    att_mask_batch = 1
    attention_mask = torch.tril(torch.ones(
        (att_mask_batch, seq_length, seq_length), device=data.device)).view(
            att_mask_batch, 1, seq_length, seq_length)
    attention_mask = (attention_mask < 0.5)

    return attention_mask

def random_collate_fn(batch):
    # Stack data in the batch and return data in the corresponding format
    tokens_ = torch.stack(batch, dim=0)
    labels = tokens_[:, 1:].contiguous()
    tokens = tokens_[:, :-1].contiguous()
    args = get_args()
    if not args.use_flash_attn:
        attention_mask = random_get_ltor_masks_and_position_ids(tokens)
    else:
        attention_mask = None
    return tokens, {"attention_mask":attention_mask, "labels" : labels}, None

class DataLoaderForMyModel(Dataset):
    def __init__(self, args, device, dataset_size = 2560 * 16):
        self.vocab_size = args.vocab_size
        self.sentence_length = args.seq_length
        self.dataset_size = dataset_size
        # Randomly generate the actual length of each sample (between 1 and the maximum length)
        self.data_length = np.random.randint(1,self.sentence_length+1,(self.dataset_size,))
        self.device = device

        # Generate random input data
        self.input_ids = []
        for i in range(self.dataset_size):
            sentence = np.random.randint(0,self.vocab_size,(self.sentence_length,))
            sentence[self.data_length[i]:] = 0
            mask = np.ones((self.sentence_length,))
            mask[self.data_length[i]:] = 0

            padding_sentence = np.zeros(self.sentence_length + 1, dtype=sentence.dtype)
            padding_sentence[:self.sentence_length] = sentence
            self.input_ids.append(padding_sentence)

        self.input_ids = np.array(self.input_ids)

    def __len__(self):
        return self.dataset_size

    def __getitem__(self, idx):
        if idx >= self.dataset_size:
            raise IndexError
        input_ids = torch.LongTensor(self.input_ids[idx]).to(self.device)
        return input_ids

The specific trainloader is created by the following code:

trainloader = distributed_dataloader(
    dataset=DataLoaderForGPT(args, device),
    global_bsz=args.global_train_batch_size,
    shuffle=True,
    args=args,
    group = model.dp_groups_whole[0].group,
    collate_fn = random_collate_fn
)

The distributed_dataloader function is a distributed data loader provided by Galvatron, used to create distributed data loaders.

Real Data Loading: Create a real data loader and design a loss calculation function.

The implementation of real data loading is based on the Megatron dataset and mainly includes functions such as train_valid_test_datasets_provider, get_train_valid_test_data_iterators, get_batch, and loss_func. A concrete implementation example can be found in gpt_hf.

The main point to note is that the get_batch function returns a tuple with three elements:

Input Data: Usually a sequence of tokens, of type torch.Tensor.
Other Input Data: Usually a dictionary type, containing position_ids, attention_mask, labels, etc.
Loss Calculation Function: The loss can be calculated directly by calling the loss_func(output_tensor) function.

Note: The input data here should be consistent with the input data format of the Embedding layer in the MyModelModel_sequential.py file. Other data is passed between model layers as **kwargs.

2.3 Profiling File

profiler.py: This file mainly handles functions related to profiling, with content as follows:

if __name__ == '__main__':
    # Initialize Galvatron profiling environment
    args = initialize_galvatron(model_args, mode='profile')
    
    # Load model configuration
    config = get_mymodel_config(args, overwrite_args=False)
    
    # Create profiler instance
    profiler = GalvatronProfiler(args)
    
    # Get the directory path of the current file
    path = os.path.dirname(os.path.abspath(__file__))
    
    # Set profiler launcher
    profiler.set_profiler_launcher(path, layernum_arg_names(), model_name(config))
    
    # Launch profiling scripts
    profiler.launch_profiling_scripts()
    
    # Process collected profiling data
    profiler.process_profiled_data()

2.4 Strategy Search File

search_dist.py: This file is primarily responsible for functions related to strategy search. Its contents are as follows:

if __name__ == '__main__':
    args = initialize_galvatron(model_args, mode='search')
    config = get_mymodel_config(args, overwrite_args=True)
    path = os.path.dirname(os.path.abspath(__file__))
    print(args)
    print(config)
    # Create an instance of the strategy search engine
    search_engine = GalvatronSearchEngine(args)
    
    # Set basic information for the search engine
    search_engine.set_search_engine_info(path, model_layer_configs(config), model_name(config))
    
    # Initialize the search engine
    search_engine.initialize_search_engine()

    # Perform strategy search
    search_engine.parallelism_optimization()

3 Script Files

The scripts folder mainly contains script files used to implement model training, performance analysis, strategy search, and other functions.

It mainly includes five different scripts:

profile_computation.sh: Used for performance analysis, calculating the computational performance of the model under different configurations.
profile_memory.sh: Used for performance analysis, calculating the memory usage of the model under different configurations.
search_dist.sh: Used for strategy search, finding the optimal strategy for the model under different configurations.
train_dist.sh: Used for model training.
train_dist_random.sh: Used for model training with random data.