自定义
自定义数据集直接命令行传参
支持直接传入行自定义的dataset_id(兼容MS和HF)和dataset_path, 以及同时传入多个自定义数据集以及对应采样数, 脚本会进行自动的预处理和拼接. 如果传入的是dataset_id, 默认会使用dataset_id中的'default'子数据集, 并设置split为'train'. 如果该dataset_id已经注册, 则会使用注册时传入的subsets、split以及预处理函数. 如果传入的是dataset_path, 则可以指定为相对路径和绝对路径, 其中相对路径为相对于当前运行目录.
--dataset {dataset_id} {dataset_path}
# 数据集混合: 以下取dataset_id中subset1和subset2子数据集并采样20000条
--dataset {dataset_name}#20000 {dataset_id}:{subset1}/{subset2}#20000 {dataset_path}#10000
脚本支持的文件格式包含csv, json, jsonl格式. 你需要将传入的文件符合以下数据集格式(只列出了一部分). 以下格式都支持system (需要注意的是, csv如果指定了system字段, 则无法设置为None, 只能指定为空字符串. json和jsonl没有这个限制). json, jsonl格式的文件支持多轮对话 (csv不支持).
格式1:
预训练:
response
11111
aaaaa
AAAAA
{"response": "11111"}
{"response": "aaaaa"}
{"response": "AAAAA"}
单轮对话:
system,query,response
00000,11111,22222
00001,aaaaa,bbbbb
00002,AAAAA,BBBBB
{"system": "00000", "query": "11111", "response": "22222"}
{"query": "aaaaa", "response": "bbbbb"}
{"query": "AAAAA", "response": "BBBBB"}
多轮对话:
{"system": "00000", "query": "55555", "response": "66666"}
{"query": "eeeee", "response": "fffff", "history": []}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
[{"system": "00000", "query": "55555", "response": "66666"},
{"query": "eeeee", "response": "fffff", "history": []},
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}]
格式2:
{"conversations": [{"from": "system", "value": "00000"}, {"from": "user", "value": "11111"}, {"from": "assistant", "value": "22222"}]}
{"conversations": [{"from": "user", "value": "aaaaa"}, {"from": "assistant", "value": "bbbbb"}, {"from": "user", "value": "ccccc"}, {"from": "assistant", "value": "ddddd"}]}
{"conversations": [{"from": "user", "value": "AAAAA"}, {"from": "assistant", "value": "BBBBB"}, {"from": "user", "value": "CCCCC"}, {"from": "assistant", "value": "DDDDD"}]}
格式3:
{"messages": [{"role": "system", "content": "00000"}, {"role": "user", "content": "11111"}, {"role": "assistant", "content": "22222"}]}
{"messages": [{"role": "user", "content": "aaaaa"}, {"role": "assistant", "content": "bbbbb"}, {"role": "user", "content": "ccccc"}, {"role": "assistant", "content": "ddddd"}]}
{"messages": [{"role": "user", "content": "AAAAA"}, {"role": "assistant", "content": "BBBBB"}, {"role": "user", "content": "CCCCC"}, {"role": "assistant", "content": "DDDDD"}]}
格式4:
{"system": "00000", "conversation": [{"human": "11111", "assistant": "22222"}]}
{"conversation": [{"human": "aaaaa", "assistant": "bbbbb"}]}
{"conversation": [{"human": "AAAAA", "assistant": "BBBBB"}, {"human": "CCCCC", "assistant": "DDDDD"}, {"human": "EEEEE", "assistant": "FFFFF"}]}
格式5:
system,instruction,input,output
00000,11111,22222,33333
00001,aaaaa,bbbbb,ccccc
00002,AAAAA,BBBBB,CCCCC
人类对齐(DPO/ORPO/SimPO/CPO)
{"system": "123", "query": "11111", "response": "22222", "rejected_response": "33333", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
{"system": "123", "query": "aaaaa", "response": "bbbbb", "rejected_response": "ccccc", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
{"system": "123", "query": "AAAAA", "response": "BBBBB", "rejected_response": "CCCCC", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
其中system和history为可选项
Tool-Calling Agent
格式1
{"tools":"{}","conversations": [{"from": "system", "value": "00000"}, {"from": "user", "value": "11111"}, {"from": "assistant", "value": "22222"}]}
{"tools":"{}","conversations": [{"from": "user", "value": "aaaaa"}, {"from": "assistant", "value": "bbbbb"}, {"from": "tool", "value": "ccccc"}, {"from": "assistant", "value": "ddddd"}]}
{"tools":"{}","conversations": [{"from": "user", "value": "AAAAA"}, {"from": "assistant", "value": "BBBBB"}, {"from": "tool", "value": "CCCCC"}, {"from": "assistant", "value": "DDDDD"}]}
格式2
{"tools":"{}","messages": [{"role": "system", "content": "00000"}, {"role": "user", "content": "11111"}, {"role": "assistant", "content": "22222"}]}
{"tools":"{}","messages": [{"role": "user", "content": "aaaaa"}, {"role": "assistant", "content": "bbbbb"}, {"role": "tool", "content": "ccccc"}, {"role": "assistant", "content": "ddddd"}]}
{"tools":"{}","messages": [{"role": "user", "content": "AAAAA"}, {"role": "assistant", "content": "BBBBB"}, {"role": "tool", "content": "CCCCC"}, {"role": "assistant", "content": "DDDDD"}]}
自定义模板
register_template会在TEMPLATE_MAPPING中注册对话模板, 该函数的参数含义如下:
template_type: 必填项, 表示对话模板的名字, 也是template的唯一id.
template: 必填项, 需要传入一个Template. 初始化Template需要传入以下参数: prefix, prompt, chat_sep, suffix, default_system.
模板初始化函数会根据这四个内容, 获取完整的chat template. 其中这四个配置内容的含义如下.
prefix: 表示对话模板中的前缀部分, 一般为system部分, 前缀token, bos token等内容. 我们使用{{SYSTEM}}作为system的占位符. 如果{{SYSTEM}}没有在prefix中存在, 则该Template不支持system, e.g. damo-agent-mini-zh数据集.
prompt: 表示对话模板中的一轮对话. 我们使用{{QUERY}}作为每轮对话中, human询问部分的占位符, {{ROUND0}}则表示本次对话是第几轮的占位符, 从0开始计数, {{ROUND1}}从1开始计数. AI助手的回复部分会拼接在prompt的后面, 因此我们没有设计其占位符. 我们只会对AI助手的回复部分计算损失.
chat_sep: 如果需要进行多轮对话, chat_sep会作为每轮对话之间的分隔符, 例如: 换行等. 如果设置为None, 则该Template不支持多轮对话.
suffix: 作为对话模板的后缀部分, 一般为eos token. 会拼接在最后一轮的对话后面.
default_system: 默认的system.
from swift.llm import (Template, ModelType, dataset_map,
get_model_tokenizer, get_template, get_dataset,
print_example, register_template, DatasetName)
class CustomTemplateType:
tigerbot = 'tigerbot'
# Ref: https://github.com/TigerResearch/TigerBot/blob/main/infer.py
register_template(
CustomTemplateType.tigerbot,
Template(['{{SYSTEM}}'], ['\n\n### Instruction:\n{{QUERY}}\n\n### Response:\n'], [],
[['eos_token_id']]))
if __name__ == '__main__':
# test template
train_dataset, _ = get_dataset(DatasetName.blossom_math_zh)
_, tokenizer = get_model_tokenizer(ModelType.qwen1half_0_5b_chat, load_model=False)
template = get_template(CustomTemplateType.tigerbot, tokenizer)
train_dataset = dataset_map(train_dataset, template.encode)
print_example(train_dataset[0], tokenizer)