2025-10-09 15:48:39 +08:00

3.1 KiB

Kronos Fine-tuning Training with Custom Dataset

Supports fine-tuning training with custom CSV data using configuration files

Quick Start

1. Configuration Setup

First edit the config.yaml file to set the correct paths and parameters:

# Data configuration
data:
  data_path: "/path/to/your/data.csv"
  lookback_window: 512
  predict_window: 48
  # ... other parameters

# Model path configuration
model_paths:
  pretrained_tokenizer: "/path/to/pretrained/tokenizer"
  pretrained_predictor: "/path/to/pretrained/predictor"
  base_save_path: "/path/to/save/models"
  # ... other paths

2. Run Training

Using train_sequential

# Complete training
python train_sequential.py --config configs/config_ali09988_candle-5min.yaml

# Skip existing models
python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --skip-existing

# Only train tokenizer
python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --skip-basemodel

# Only train basemodel
python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --skip-tokenizer

Run each stage separately

# Only train tokenizer
python finetune_tokenizer.py --config configs/config_ali09988_candle-5min.yaml 

# Only train basemodel (requires fine-tuned tokenizer first)
python finetune_base_model.py --config configs/config_ali09988_candle-5min.yaml 

DDP Training

# Choose communication protocol yourself, nccl can be replaced with gloo
DIST_BACKEND=nccl \
torchrun --standalone --nproc_per_node=8 train_sequential.py --config configs/config_ali09988_candle-5min.yaml

Configuration Description

Main Configuration Items

  • data: Data-related configuration

    • data_path: CSV data file path
    • lookback_window: Lookback window size
    • predict_window: Prediction window size
    • train_ratio/val_ratio/test_ratio: Dataset split ratios
  • training: Training-related configuration

    • epochs: Number of training epochs
    • batch_size: Batch size
    • tokenizer_learning_rate: Tokenizer learning rate
    • predictor_learning_rate: Predictor learning rate
  • model_paths: Model path configuration

    • pretrained_tokenizer: Pre-trained tokenizer path
    • pretrained_predictor: Pre-trained predictor path
    • base_save_path: Model save root directory
    • finetuned_tokenizer: Fine-tuned tokenizer path (for basemodel training)
  • experiment: Experiment control

    • train_tokenizer: Whether to train tokenizer
    • train_basemodel: Whether to train basemodel
    • skip_existing: Whether to skip existing models

Training Process

  1. Tokenizer Fine-tuning Stage

    • Load pre-trained tokenizer
    • Fine-tune on custom data
    • Save fine-tuned tokenizer to {base_save_path}/tokenizer/best_model/
  2. Basemodel Fine-tuning Stage

    • Load fine-tuned tokenizer and pre-trained predictor
    • Fine-tune on custom data
    • Save fine-tuned basemodel to {base_save_path}/basemodel/best_model/

Data Format: Ensure CSV file contains the following columns: timestamps, open, high, low, close, volume, amount