From 38b5176cb763f7c957e8c6c42b0c2146ecd3ce72 Mon Sep 17 00:00:00 2001 From: zhangboyu1 Date: Thu, 9 Oct 2025 16:24:35 +0800 Subject: [PATCH] update readme & figs --- finetune_csv/README.md | 106 ++++++++++++++++++++----------- finetune_csv/README_CN.md | 129 +++++++++++++++++++++----------------- 2 files changed, 140 insertions(+), 95 deletions(-) diff --git a/finetune_csv/README.md b/finetune_csv/README.md index 1cf576b..0b0216b 100644 --- a/finetune_csv/README.md +++ b/finetune_csv/README.md @@ -1,50 +1,60 @@ -# Kronos Finetuning on Your Custom csv Dataset +# Kronos Fine-tuning on Custom CSV Datasets -Supports fine-tuning training with custom CSV data using configuration files +This module provides a comprehensive pipeline for fine-tuning Kronos models on your own CSV-formatted financial data. It supports both sequential training (tokenizer followed by predictor) and individual component training, with full distributed training capabilities. -## 1. Prepare Your Data -**Data Format**: Ensure CSV file contains the following columns: `timestamps`, `open`, `high`, `low`, `close`, `volume`, `amount` +## 1. Data Preparation -A good csv data should be like: +### Required Data Format + +Your CSV file must contain the following columns in this exact order: +- `timestamps`: DateTime stamps for each data point +- `open`: Opening price +- `high`: Highest price +- `low`: Lowest price +- `close`: Closing price +- `volume`: Trading volume +- `amount`: Trading amount + +(volume and amount can be 0 if not available) + +### Sample Data Format | timestamps | open | close | high | low | volume | amount | |------------|------|-------|------|-----|--------|--------| | 2019/11/26 9:35 | 182.45215 | 184.45215 | 184.95215 | 182.45215 | 15136000 | 0 | | 2019/11/26 9:40 | 184.35215 | 183.85215 | 184.55215 | 183.45215 | 4433300 | 0 | -| ... | ... | ... | ... | ... | ... | ... | -| ... | ... | ... | ... | ... | ... | ... | +| 2019/11/26 9:45 | 183.85215 | 183.35215 | 183.95215 | 182.95215 | 3070900 | 0 | -You can check "data/HK_ali_09988_kline_5min_all.csv" to find out the proper format. +> **Reference**: Check `data/HK_ali_09988_kline_5min_all.csv` for a complete example of the proper data format. -## 2. Training -### Configuration Setup +## 2. Config Preparation -First edit the `config.yaml` file to set the correct paths and parameters: + +Please edit the correct data path and set your training parameters. ```yaml # Data configuration data: data_path: "/path/to/your/data.csv" - lookback_window: 512 - predict_window: 48 - # ... other parameters + lookback_window: 512 # Historical data points to use + predict_window: 48 # Future points to predict + max_context: 512 # Maximum context length + +... -# Model path configuration -model_paths: - pretrained_tokenizer: "/path/to/pretrained/tokenizer" - pretrained_predictor: "/path/to/pretrained/predictor" - base_save_path: "/path/to/save/models" - # ... other paths ``` +There are some other settings here, please see `configs/config_ali09988_candle-5min.yaml` for more comments. -### Run Training +## 3. Training -Using train_sequential +### Method 1: Sequential Training (Recommended) + +The `train_sequential.py` script handles the complete training pipeline automatically: ```bash -# Complete training +# Complete training (tokenizer + predictor) python train_sequential.py --config configs/config_ali09988_candle-5min.yaml # Skip existing models @@ -53,36 +63,58 @@ python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --s # Only train tokenizer python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --skip-basemodel -# Only train basemodel +# Only train predictor python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --skip-tokenizer ``` -Run each stage separately +### Method 2: Individual Component Training + +Train each component separately for more control: ```bash -# Only train tokenizer -python finetune_tokenizer.py --config configs/config_ali09988_candle-5min.yaml +# Step 1: Train tokenizer +python finetune_tokenizer.py --config configs/config_ali09988_candle-5min.yaml -# Only train basemodel (requires fine-tuned tokenizer first) -python finetune_base_model.py --config configs/config_ali09988_candle-5min.yaml +# Step 2: Train predictor (requires fine-tuned tokenizer) +python finetune_base_model.py --config configs/config_ali09988_candle-5min.yaml ``` -DDP Training +### DDP Training + +For faster training on multiple GPUs: + ```bash -# Choose communication protocol yourself, nccl can be replaced with gloo +# Set communication backend (nccl for NVIDIA GPUs, gloo for CPU/mixed) DIST_BACKEND=nccl \ torchrun --standalone --nproc_per_node=8 train_sequential.py --config configs/config_ali09988_candle-5min.yaml ``` -## 2. Training Results -![HK_ali_09988_kline_5min_all_historical_20250919_073929](examples/HK_ali_09988_kline_5min_all_historical_20250919_073929.png) +## 4. Training Results -![HK_ali_09988_kline_5min_all_historical_20250919_073944](examples/HK_ali_09988_kline_5min_all_historical_20250919_073944.png) +The training process generates several outputs: -![HK_ali_09988_kline_5min_all_historical_20250919_074012](examples/HK_ali_09988_kline_5min_all_historical_20250919_074012.png) +### Model Checkpoints +- **Tokenizer**: Saved to `{base_save_path}/{exp_name}/tokenizer/best_model/` +- **Predictor**: Saved to `{base_save_path}/{exp_name}/basemodel/best_model/` -![HK_ali_09988_kline_5min_all_historical_20250919_074042](examples/HK_ali_09988_kline_5min_all_historical_20250919_074042.png) +### Training Logs +- **Console output**: Real-time training progress and metrics +- **Log files**: Detailed logs saved to `{base_save_path}/logs/` +- **Validation tracking**: Best models are saved based on validation loss + +## 5. Prediction Vis + +The following images show example training results on alibaba (HK stock) data: + +![Training Result 1](examples/HK_ali_09988_kline_5min_all_historical_20250919_073929.png) + +![Training Result 2](examples/HK_ali_09988_kline_5min_all_historical_20250919_073944.png) + +![Training Result 3](examples/HK_ali_09988_kline_5min_all_historical_20250919_074012.png) + +![Training Result 4](examples/HK_ali_09988_kline_5min_all_historical_20250919_074042.png) + +![Training Result 5](examples/HK_ali_09988_kline_5min_all_historical_20250919_074251.png) -![HK_ali_09988_kline_5min_all_historical_20250919_074251](examples/HK_ali_09988_kline_5min_all_historical_20250919_074251.png) diff --git a/finetune_csv/README_CN.md b/finetune_csv/README_CN.md index 05269ee..3271625 100644 --- a/finetune_csv/README_CN.md +++ b/finetune_csv/README_CN.md @@ -1,36 +1,58 @@ -# 自定义数据集的Kronos微调训练 +# Kronos微调-支持自定义CSV数据集 -支持使用配置文件进行自定义csv数据的微调训练 +这是一个在自定义的CSV格式数据上微调Kronos模型的完整流程。包含顺序训练(先训练tokenizer再训练predictor)和单独模块训练,同时支持分布式训练。 -## 快速开始 -### 1. 配置设置 +## 1. 准备数据 -首先编辑 `config.yaml` 文件,设置正确的路径和参数: +### 数据格式 + +您的CSV文件必须按以下确切顺序包含以下列: +- `timestamps`: 每个数据点的时间戳 +- `open`: 开盘价 +- `high`: 最高价 +- `low`: 最低价 +- `close`: 收盘价 +- `volume`: 交易量 +- `amount`: 交易金额 + +(volume和amount可以全0如果没有这部分的数据) + +### 示例数据格式 + +| timestamps | open | close | high | low | volume | amount | +|------------|------|-------|------|-----|--------|--------| +| 2019/11/26 9:35 | 182.45215 | 184.45215 | 184.95215 | 182.45215 | 15136000 | 0 | +| 2019/11/26 9:40 | 184.35215 | 183.85215 | 184.55215 | 183.45215 | 4433300 | 0 | +| 2019/11/26 9:45 | 183.85215 | 183.35215 | 183.95215 | 182.95215 | 3070900 | 0 | + +> **标准数据样例**: `data/HK_ali_09988_kline_5min_all.csv` + +## 2. 准备config文件 + +data_path需要改成正确的数据路径,训练参数可以自己调节 ```yaml # 数据配置 data: - data_path: "/path/to/your/data.csv" - lookback_window: 512 - predict_window: 48 - # ... 其他参数 + data_path: "/path/to/your/data.csv" + lookback_window: 512 # 要使用的历史数据点 + predict_window: 48 # 要预测的未来点数 + max_context: 512 # 最大上下文长度 + +... -# 模型路径配置 -model_paths: - pretrained_tokenizer: "/path/to/pretrained/tokenizer" - pretrained_predictor: "/path/to/pretrained/predictor" - base_save_path: "/path/to/save/models" - # ... 其他路径 ``` +这里还有其他一些设置, `configs/config_ali09988_candle-5min.yaml` 有更详细的注释。 -### 2. 运行训练 +## 3. 训练 +### 方法1: 直接顺序训练 -使用train_sequential +`train_sequential.py` 脚本自动处理完整的训练流程: ```bash -# 完整训练 +# 完整训练(tokenizer + predictor) python train_sequential.py --config configs/config_ali09988_candle-5min.yaml # 跳过已存在的模型 @@ -39,67 +61,58 @@ python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --s # 只训练tokenizer python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --skip-basemodel -# 只训练basemodel +# 只训练predictor python train_sequential.py --config configs/config_ali09988_candle-5min.yaml --skip-tokenizer ``` -单独运行各个阶段 +### 方法2: 单独组件训练 + +可以单独训练每个组件: ```bash -# 只训练tokenizer -python finetune_tokenizer.py --config configs/config_ali09988_candle-5min.yaml +# 步骤1: 训练tokenizer +python finetune_tokenizer.py --config configs/config_ali09988_candle-5min.yaml -# 只训练basemodel(需要先有微调后的tokenizer) -python finetune_base_model.py --config configs/config_ali09988_candle-5min.yaml +# 步骤2: 训练predictor(需要微调后的tokenizer) +python finetune_base_model.py --config configs/config_ali09988_candle-5min.yaml ``` -DDP训练 +### DDP训练 + +如果有多卡,可以开启ddp加速训练: + ```bash -# 通信协议自行选择,nccl可替换gloo +# 设置通信后端(NVIDIA GPU用nccl,CPU/混合用gloo) DIST_BACKEND=nccl \ torchrun --standalone --nproc_per_node=8 train_sequential.py --config configs/config_ali09988_candle-5min.yaml ``` -## 配置说明 +## 4. 训练结果 -### 主要配置项 +训练过程生成以下输出: -- **data**: 数据相关配置 - - `data_path`: CSV数据文件路径 - - `lookback_window`: 回望窗口大小 - - `predict_window`: 预测窗口大小 - - `train_ratio/val_ratio/test_ratio`: 数据集分割比例 +### 模型检查点 +- **Tokenizer**: 保存到 `{base_save_path}/{exp_name}/tokenizer/best_model/` +- **Predictor**: 保存到 `{base_save_path}/{exp_name}/basemodel/best_model/` -- **training**: 训练相关配置 - - `epochs`: 训练轮数 - - `batch_size`: 批次大小 - - `tokenizer_learning_rate`: Tokenizer学习率 - - `predictor_learning_rate`: Predictor学习率 +### 训练日志 +- **控制台输出**: 实时训练进度和指标 +- **日志文件**: 详细日志保存到 `{base_save_path}/logs/` +- **验证跟踪**: 基于验证损失保存最佳模型 -- **model_paths**: 模型路径配置 - - `pretrained_tokenizer`: 预训练tokenizer路径 - - `pretrained_predictor`: 预训练predictor路径 - - `base_save_path`: 模型保存根目录 - - `finetuned_tokenizer`: 微调后tokenizer路径(用于basemodel训练) +## 5. 预测可视化 -- **experiment**: 实验控制 - - `train_tokenizer`: 是否训练tokenizer - - `train_basemodel`: 是否训练basemodel - - `skip_existing`: 是否跳过已存在的模型 +以下图像显示了kronos在阿里巴巴股票数据上微调后的示例训练结果: -## 训练流程 +![训练结果 1](examples/HK_ali_09988_kline_5min_all_historical_20250919_073929.png) -1. **Tokenizer微调阶段** - - 加载预训练tokenizer - - 在自定义数据上微调 - - 保存微调后的tokenizer到 `{base_save_path}/tokenizer/best_model/` +![训练结果 2](examples/HK_ali_09988_kline_5min_all_historical_20250919_073944.png) -2. **Basemodel微调阶段** - - 加载微调后的tokenizer和预训练predictor - - 在自定义数据上微调 - - 保存微调后的basemodel到 `{base_save_path}/basemodel/best_model/` +![训练结果 3](examples/HK_ali_09988_kline_5min_all_historical_20250919_074012.png) + +![训练结果 4](examples/HK_ali_09988_kline_5min_all_historical_20250919_074042.png) + +![训练结果 5](examples/HK_ali_09988_kline_5min_all_historical_20250919_074251.png) - **数据格式**: 确保CSV文件包含以下列:`timestamps`, `open`, `high`, `low`, `close`, `volume`, `amount` -