AWS Trainium & Inferentia documentation

LoRA for Neuron

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

LoRA for Neuron

LoRA (Low-Rank Adaptation) implementation optimized for distributed training on AWS Trainium devices. This module provides efficient parameter-efficient fine-tuning with tensor parallelism and sequence parallelism support.

PEFT Model Classes

NeuronPeftModel

class optimum.neuron.peft.NeuronPeftModel

< >

( model: PreTrainedModel peft_config: PeftConfig adapter_name: str = 'default' autocast_adapter_dtype: bool = True **kwargs: Any )

NeuronPeftModelForCausalLM

class optimum.neuron.peft.NeuronPeftModelForCausalLM

< >

( model: PreTrainedModel peft_config: PeftConfig adapter_name: str = 'default' autocast_adapter_dtype: bool = True **kwargs: Any )

LoRA Layer Implementations

Base LoRA Layer

class optimum.neuron.peft.tuners.lora.layer.NeuronLoraLayer

< >

( base_layer: Module ephemeral_gpu_offload: bool = False **kwargs )

Parallel Linear LoRA

class optimum.neuron.peft.tuners.lora.layer.ParallelLinear

< >

( base_layer adapter_name: str r: int = 0 lora_alpha: int = 1 lora_dropout: float = 0.0 fan_in_fan_out: bool = False is_target_conv_1d_layer: bool = False init_lora_weights: bool | str = True use_rslora: bool = False use_dora: bool = False lora_bias: bool = False **kwargs )

merge

< >

( safe_merge: bool = False adapter_names: list[str] | None = None )

Parameters

  • safe_merge — If True, perform merge in a copy and check for NaNs before merging.
  • adapter_names — List of adapter names to merge. If None, all active adapters will be merged.

Merge the active adapter weights into the base weights.

This works with distributed parallel linear layers (RowParallelLinear, ColumnParallelLinear). The merge happens on the sharded weights - each rank merges its own shard.

unmerge

< >

( )

Unmerge all merged adapter layers from the base weights.

This works with distributed parallel linear layers (RowParallelLinear, ColumnParallelLinear). The unmerge happens on the sharded weights - each rank unmerges its own shard.

GQA QKV Column Parallel LoRA

class optimum.neuron.peft.tuners.lora.layer.GQAQKVColumnParallelLinear

< >

( base_layer adapter_name: str r: int = 0 lora_alpha: int = 1 lora_dropout: float = 0.0 fan_in_fan_out: bool = False is_target_conv_1d_layer: bool = False init_lora_weights: bool | str = True use_rslora: bool = False use_dora: bool = False lora_bias: bool = False **kwargs )

get_delta_weight

< >

( adapter: str )

Parameters

  • adapter — The name of the adapter for which the delta weight should be computed.

Compute the delta weights for Q, K, V for the given adapter.

Returns a dict with keys “q”, “k”, “v” (or “qkv” if fused) containing the delta tensors.

merge

< >

( safe_merge: bool = False adapter_names: list[str] | None = None )

Parameters

  • safe_merge — If True, perform merge in a copy and check for NaNs before merging.
  • adapter_names — List of adapter names to merge. If None, all active adapters will be merged.

Merge the active adapter weights into the base Q, K, V weights.

This works with GQAQKVColumnParallelLinear layers. The merge happens on the sharded weights - each rank merges its own shard.

unmerge

< >

( )

Unmerge all merged adapter layers from the base Q, K, V weights.

This works with GQAQKVColumnParallelLinear layers. The unmerge happens on the sharded weights - each rank unmerges its own shard.

Parallel Embedding LoRA

class optimum.neuron.peft.tuners.lora.layer.ParallelEmbedding

< >

( base_layer: Module adapter_name: str r: int = 0 lora_alpha: int = 1 lora_dropout: float = 0.0 fan_in_fan_out: bool = False init_lora_weights: bool | str = True use_rslora: bool = False use_dora: bool = False lora_bias: bool = False **kwargs )

merge

< >

( safe_merge: bool = False adapter_names: list[str] | None = None )

Parameters

  • safe_merge — If True, perform merge in a copy and check for NaNs before merging.
  • adapter_names — List of adapter names to merge. If None, all active adapters will be merged.

Merge the active adapter weights into the base embedding weights.

This works with ParallelEmbedding layers. The merge happens on the sharded weights - each rank merges its own shard.

unmerge

< >

( )

Unmerge all merged adapter layers from the base embedding weights.

This works with ParallelEmbedding layers. The unmerge happens on the sharded weights - each rank unmerges its own shard.

LoRA Model

NeuronLoraModel

class optimum.neuron.peft.tuners.NeuronLoraModel

< >

( model config adapter_name low_cpu_mem_usage: bool = False )

Utility Functions

get_peft_model

optimum.neuron.peft.get_peft_model

< >

( model: PreTrainedModel peft_config: PeftConfig adapter_name: str = 'default' mixed: bool = False autocast_adapter_dtype: bool = True revision: str | None = None low_cpu_mem_usage: bool = False )

Architecture Support

The Neuron LoRA implementation supports the following parallel layer types:

  • ColumnParallelLinear: For layers that split weights along the output dimension
  • RowParallelLinear: For layers that split weights along the input dimension
  • ParallelEmbedding: For embedding layers distributed across ranks
  • GQAQKVColumnParallelLinear: For grouped query attention projections with challenging tensor parallel configurations

Each layer type has a corresponding LoRA implementation that maintains the parallelization strategy while adding low-rank adaptation capabilities.

Key Features

  • Distributed Training: Full support for tensor parallelism and sequence parallelism
  • Checkpoint Consolidation: Automatic conversion between sharded and consolidated checkpoints
  • Weight Transformation: Seamless integration with model weight transformation specs
  • Compatibility: Works with all supported custom modeling architectures in Optimum Neuron