2025-10-09 16:35:59 +08:00

121 lines
3.8 KiB
Markdown

# Kronos Fine-tuning on Custom CSV Datasets
This module provides a comprehensive pipeline for fine-tuning Kronos models on your own CSV-formatted financial data. It supports both sequential training (tokenizer followed by predictor) and individual component training, with full distributed training capabilities.
## 1. Data Preparation
### Required Data Format
Your CSV file must contain the following columns:
- `timestamps`: DateTime stamps for each data point
- `open`: Opening price
- `high`: Highest price
- `low`: Lowest price
- `close`: Closing price
- `volume`: Trading volume
- `amount`: Trading amount
(volume and amount can be 0 if not available)
### Sample Data Format
| timestamps | open | close | high | low | volume | amount |
|------------|------|-------|------|-----|--------|--------|
| 2019/11/26 9:35 | 182.45215 | 184.45215 | 184.95215 | 182.45215 | 15136000 | 0 |
| 2019/11/26 9:40 | 184.35215 | 183.85215 | 184.55215 | 183.45215 | 4433300 | 0 |
| 2019/11/26 9:45 | 183.85215 | 183.35215 | 183.95215 | 182.95215 | 3070900 | 0 |
> **Reference**: Check `data/HK_ali_09988_kline_5min_all.csv` for a complete example of the proper data format.
## 2. Config Preparation
Please edit the correct data path and set your training parameters.
```yaml
# Data configuration
data:
data_path: "/path/to/your/data.csv"
lookback_window: 512 # Historical data points to use
predict_window: 48 # Future points to predict
max_context: 512 # Maximum context length
...
```
There are some other settings here, please see `configs/config_ali09988_candle-5min.yaml` for more comments.
## 3. Training
### Method 1: Sequential Training (Recommended)
The `train_sequential.py` script handles the complete training pipeline automatically:
```bash
# Complete training (tokenizer + predictor)
python train_sequential.py --config configs/config_ali09988_candle-5min.yaml
# Skip existing models
python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --skip-existing
# Only train tokenizer
python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --skip-basemodel
# Only train predictor
python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --skip-tokenizer
```
### Method 2: Individual Component Training
Train each component separately for more control:
```bash
# Step 1: Train tokenizer
python finetune_tokenizer.py --config configs/config_ali09988_candle-5min.yaml
# Step 2: Train predictor (requires fine-tuned tokenizer)
python finetune_base_model.py --config configs/config_ali09988_candle-5min.yaml
```
### DDP Training
For faster training on multiple GPUs:
```bash
# Set communication backend (nccl for NVIDIA GPUs, gloo for CPU/mixed)
DIST_BACKEND=nccl \
torchrun --standalone --nproc_per_node=8 train_sequential.py --config configs/config_ali09988_candle-5min.yaml
```
## 4. Training Results
The training process generates several outputs:
### Model Checkpoints
- **Tokenizer**: Saved to `{base_save_path}/{exp_name}/tokenizer/best_model/`
- **Predictor**: Saved to `{base_save_path}/{exp_name}/basemodel/best_model/`
### Training Logs
- **Console output**: Real-time training progress and metrics
- **Log files**: Detailed logs saved to `{base_save_path}/logs/`
- **Validation tracking**: Best models are saved based on validation loss
## 5. Prediction Vis
The following images show example training results on alibaba (HK stock) data:
![Training Result 1](examples/HK_ali_09988_kline_5min_all_historical_20250919_073929.png)
![Training Result 2](examples/HK_ali_09988_kline_5min_all_historical_20250919_073944.png)
![Training Result 3](examples/HK_ali_09988_kline_5min_all_historical_20250919_074012.png)
![Training Result 4](examples/HK_ali_09988_kline_5min_all_historical_20250919_074042.png)
![Training Result 5](examples/HK_ali_09988_kline_5min_all_historical_20250919_074251.png)