Unverified 提交 a5ca4bf2 作者: Zhi-guo Huang 提交者: GitHub

1.增加对llama-cpp模型的支持;2.增加对bloom/chatyuan/baichuan模型的支持;3. 修复多GPU部署的bug;4.…

1.增加对llama-cpp模型的支持;2.增加对bloom/chatyuan/baichuan模型的支持;3. 修复多GPU部署的bug;4. 修复了moss_llm.py的bug;5. 增加对openai支持(没有api,未测试);6. 支持在多卡情况自定义设备GPU (#664)

* 修复 bing_search.py的typo;更新model_config.py中Bing Subscription Key申请方式及注意事项

* 更新FAQ,增加了[Errno 110] Connection timed out的原因与解决方案

* 修改loader.py中load_in_8bit失败的原因和详细解决方案

* update loader.py

* stream_chat_bing

* 修改stream_chat的接口,在请求体中选择knowledge_base_id;增加stream_chat_bing接口

* 优化cli_demo.py的逻辑:支持 输入提示;多输入;重新输入

* update cli_demo.py

* add bloom-3b,bloom-7b1,ggml-vicuna-13b-1.1

* 1.增加对llama-cpp模型的支持;2.增加对bloom模型的支持;3. 修复多GPU部署的bug;4. 增加对openai支持(没有api,未测试);5.增加了llama-cpp模型部署的说明

* llama模型兼容性说明

* modified:   ../configs/model_config.py
	modified:   ../docs/INSTALL.md
在install.md里增加对llama-cpp模型调用的说明

* 修改llama_llm.py以适应llama-cpp模型

* 完成llama-cpp模型的支持;

* make fastchat and openapi compatiable

* 1. 修复/增加对chatyuan,bloom,baichuan-7等模型的支持;2. 修复了moss_llm.py的bug;

* set default model be chatglm-6b

* 在多卡情况下也支持自定义GPU设备

---------

Co-authored-by: imClumsyPanda <littlepanda0716@gmail.com>
上级 10abb8d7
......@@ -200,7 +200,6 @@ class LocalDocQA:
return vs_path, loaded_files
else:
logger.info("文件均未成功加载,请检查依赖包或替换为其他文件再次上传。")
return None, loaded_files
def one_knowledge_add(self, vs_path, one_title, one_conent, one_content_segmentation, sentence_size):
......
......@@ -69,7 +69,7 @@ llm_model_dict = {
"name": "chatyuan",
"pretrained_model_name": "ClueAI/ChatYuan-large-v2",
"local_model_path": None,
"provides": None
"provides": "MOSSLLM"
},
"moss": {
"name": "moss",
......@@ -82,6 +82,46 @@ llm_model_dict = {
"pretrained_model_name": "vicuna-13b-hf",
"local_model_path": None,
"provides": "LLamaLLM"
},
# 直接调用返回requests.exceptions.ConnectionError错误,需要通过huggingface_hub包里的snapshot_download函数
# 下载模型,如果snapshot_download还是返回网络错误,多试几次,一般是可以的,
# 如果仍然不行,则应该是网络加了防火墙(在服务器上这种情况比较常见),基本只能从别的设备上下载,
# 然后转移到目标设备了.
"bloomz-7b1":{
"name" : "bloomz-7b1",
"pretrained_model_name": "bigscience/bloomz-7b1",
"local_model_path": None,
"provides": "MOSSLLM"
},
# 实测加载bigscience/bloom-3b需要170秒左右,暂不清楚为什么这么慢
# 应与它要加载专有token有关
"bloom-3b":{
"name" : "bloom-3b",
"pretrained_model_name": "bigscience/bloom-3b",
"local_model_path": None,
"provides": "MOSSLLM"
},
"baichuan-7b":{
"name":"baichuan-7b",
"pretrained_model_name":"baichuan-inc/baichuan-7B",
"local_model_path":None,
"provides":"MOSSLLM"
},
# llama-cpp模型的兼容性问题参考https://github.com/abetlen/llama-cpp-python/issues/204
"ggml-vicuna-13b-1.1-q5":{
"name": "ggml-vicuna-13b-1.1-q5",
"pretrained_model_name": "lmsys/vicuna-13b-delta-v1.1",
# 这里需要下载好模型的路径,如果下载模型是默认路径则它会下载到用户工作区的
# /.cache/huggingface/hub/models--vicuna--ggml-vicuna-13b-1.1/
# 还有就是由于本项目加载模型的方式设置的比较严格,下载完成后仍需手动修改模型的文件名
# 将其设置为与Huggface Hub一致的文件名
# 此外不同时期的ggml格式并不兼容,因此不同时期的ggml需要安装不同的llama-cpp-python库,且实测pip install 不好使
# 需要手动从https://github.com/abetlen/llama-cpp-python/releases/tag/下载对应的wheel安装
# 实测v0.1.63与本模型的vicuna/ggml-vicuna-13b-1.1/ggml-vic13b-q5_1.bin可以兼容
"local_model_path":f'''{"/".join(os.path.abspath(__file__).split("/")[:3])}/.cache/huggingface/hub/models--vicuna--ggml-vicuna-13b-1.1/blobs/''',
"provides": "LLamaLLM"
},
# 通过 fastchat 调用的模型请参考如下格式
......@@ -90,7 +130,8 @@ llm_model_dict = {
"pretrained_model_name": "chatglm-6b",
"local_model_path": None,
"provides": "FastChatOpenAILLM", # 使用fastchat api时,需保证"provides"为"FastChatOpenAILLM"
"api_base_url": "http://localhost:8000/v1" # "name"修改为fastchat服务中的"api_base_url"
"api_base_url": "http://localhost:8000/v1", # "name"修改为fastchat服务中的"api_base_url"
"api_key": "EMPTY"
},
"fastchat-chatglm2-6b": {
"name": "chatglm2-6b", # "name"修改为fastchat服务中的"model_name"
......@@ -106,8 +147,18 @@ llm_model_dict = {
"pretrained_model_name": "vicuna-13b-hf",
"local_model_path": None,
"provides": "FastChatOpenAILLM", # 使用fastchat api时,需保证"provides"为"FastChatOpenAILLM"
"api_base_url": "http://localhost:8000/v1" # "name"修改为fastchat服务中的"api_base_url"
"api_base_url": "http://localhost:8000/v1", # "name"修改为fastchat服务中的"api_base_url"
"api_key": "EMPTY"
},
"openai-chatgpt-3.5":{
"name": "gpt-3.5-turbo",
"pretrained_model_name": "gpt-3.5-turbo",
"provides":"FastChatOpenAILLM",
"local_model_path": None,
"api_base_url": "https://api.openapi.com/v1",
"api_key": ""
},
}
# LLM 名称
......
......@@ -44,4 +44,12 @@ $ pip install -r requirements.txt
$ python loader/image_loader.py
```
注:使用 `langchain.document_loaders.UnstructuredFileLoader` 进行非结构化文件接入时,可能需要依据文档进行其他依赖包的安装,请参考 [langchain 文档](https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/unstructured_file.html)
## llama-cpp模型调用的说明
1. 首先从huggingface hub中下载对应的模型,如https://huggingface.co/vicuna/ggml-vicuna-13b-1.1/的[ggml-vic13b-q5_1.bin](https://huggingface.co/vicuna/ggml-vicuna-13b-1.1/blob/main/ggml-vic13b-q5_1.bin),建议使用huggingface_hub库的snapshot_download下载。
2. 将下载的模型重命名。通过huggingface_hub下载的模型会被重命名为随机序列,因此需要重命名为原始文件名,如[ggml-vic13b-q5_1.bin](https://huggingface.co/vicuna/ggml-vicuna-13b-1.1/blob/main/ggml-vic13b-q5_1.bin)
3. 基于下载模型的ggml的加载时间,推测对应的llama-cpp版本,下载对应的llama-cpp-python库的wheel文件,实测[ggml-vic13b-q5_1.bin](https://huggingface.co/vicuna/ggml-vicuna-13b-1.1/blob/main/ggml-vic13b-q5_1.bin)与llama-cpp-python库兼容,然后手动安装wheel文件。
4. 将下载的模型信息写入configs/model_config.py文件里 `llm_model_dict`中,注意保证参数的兼容性,一些参数组合可能会报错.
......@@ -23,6 +23,7 @@ def _build_message_template() -> Dict[str, str]:
class FastChatOpenAILLM(RemoteRpcModel, LLM, ABC):
api_base_url: str = "http://localhost:8000/v1"
model_name: str = "chatglm-6b"
max_token: int = 10000
......@@ -31,8 +32,14 @@ class FastChatOpenAILLM(RemoteRpcModel, LLM, ABC):
checkPoint: LoaderCheckPoint = None
history = []
history_len: int = 10
def __init__(self, checkPoint: LoaderCheckPoint = None):
api_key: str = ""
def __init__(self,
checkPoint: LoaderCheckPoint = None,
# api_base_url:str="http://localhost:8000/v1",
# model_name:str="chatglm-6b",
# api_key:str=""
):
super().__init__()
self.checkPoint = checkPoint
......@@ -60,7 +67,7 @@ class FastChatOpenAILLM(RemoteRpcModel, LLM, ABC):
return self.api_base_url
def set_api_key(self, api_key: str):
pass
self.api_key = api_key
def set_api_base_url(self, api_base_url: str):
self.api_base_url = api_base_url
......@@ -73,7 +80,8 @@ class FastChatOpenAILLM(RemoteRpcModel, LLM, ABC):
try:
import openai
# Not support yet
openai.api_key = "EMPTY"
# openai.api_key = "EMPTY"
openai.key = self.api_key
openai.api_base = self.api_base_url
except ImportError:
raise ValueError(
......@@ -116,7 +124,8 @@ class FastChatOpenAILLM(RemoteRpcModel, LLM, ABC):
try:
import openai
# Not support yet
openai.api_key = "EMPTY"
# openai.api_key = "EMPTY"
openai.api_key = self.api_key
openai.api_base = self.api_base_url
except ImportError:
raise ValueError(
......
......@@ -6,14 +6,17 @@ import torch
import transformers
from transformers.generation.logits_process import LogitsProcessor
from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList
from typing import Optional, List, Dict, Any
from typing import Optional, List, Dict, Any,Union
from models.loader import LoaderCheckPoint
from models.base import (BaseAnswer,
AnswerResult)
class InvalidScoreLogitsProcessor(LogitsProcessor):
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
def __call__(self, input_ids: Union[torch.LongTensor,list], scores: Union[torch.FloatTensor,list]) -> torch.FloatTensor:
# llama-cpp模型返回的是list,为兼容性考虑,需要判断input_ids和scores的类型,将list转换为torch.Tensor
input_ids = torch.tensor(input_ids) if isinstance(input_ids,list) else input_ids
scores = torch.tensor(scores) if isinstance(scores,list) else scores
if torch.isnan(scores).any() or torch.isinf(scores).any():
scores.zero_()
scores[..., 5] = 5e4
......@@ -163,8 +166,21 @@ class LLamaLLM(BaseAnswer, LLM, ABC):
self.stopping_criteria = transformers.StoppingCriteriaList()
# 观测输出
gen_kwargs.update({'stopping_criteria': self.stopping_criteria})
output_ids = self.checkPoint.model.generate(**gen_kwargs)
# llama-cpp模型的参数与transformers的参数字段有较大差异,直接调用会返回不支持的字段错误
# 因此需要先判断模型是否是llama-cpp模型,然后取gen_kwargs与模型generate方法字段的交集
# 仅将交集字段传给模型以保证兼容性
# todo llama-cpp模型在本框架下兼容性较差,后续可以考虑重写一个llama_cpp_llm.py模块
if "llama_cpp" in self.checkPoint.model.__str__():
import inspect
common_kwargs_keys = set(inspect.getfullargspec(self.checkPoint.model.generate).args)&set(gen_kwargs.keys())
common_kwargs = {key:gen_kwargs[key] for key in common_kwargs_keys}
#? llama-cpp模型的generate方法似乎只接受.cpu类型的输入,响应很慢,慢到哭泣
#?为什么会不支持GPU呢,不应该啊?
output_ids = torch.tensor([list(self.checkPoint.model.generate(input_id_i.cpu(),**common_kwargs)) for input_id_i in input_ids])
else:
output_ids = self.checkPoint.model.generate(**gen_kwargs)
new_tokens = len(output_ids[0]) - len(input_ids[0])
reply = self.decode(output_ids[0][-new_tokens:])
print(f"response:{reply}")
......
......@@ -67,9 +67,11 @@ class LoaderCheckPoint:
self.load_in_8bit = params.get('load_in_8bit', False)
self.bf16 = params.get('bf16', False)
def _load_model_config(self, model_name):
if self.model_path:
self.model_path = re.sub("\s","",self.model_path)
checkpoint = Path(f'{self.model_path}')
else:
if not self.no_remote_model:
......@@ -78,10 +80,12 @@ class LoaderCheckPoint:
raise ValueError(
"本地模型local_model_path未配置路径"
)
model_config = AutoConfig.from_pretrained(checkpoint, trust_remote_code=True)
return model_config
try:
model_config = AutoConfig.from_pretrained(checkpoint, trust_remote_code=True)
return model_config
except Exception as e:
print(e)
return checkpoint
def _load_model(self, model_name):
"""
......@@ -93,6 +97,7 @@ class LoaderCheckPoint:
t0 = time.time()
if self.model_path:
self.model_path = re.sub("\s","",self.model_path)
checkpoint = Path(f'{self.model_path}')
else:
if not self.no_remote_model:
......@@ -103,7 +108,7 @@ class LoaderCheckPoint:
)
self.is_llamacpp = len(list(Path(f'{checkpoint}').glob('ggml*.bin'))) > 0
if 'chatglm' in model_name.lower():
if 'chatglm' in model_name.lower() or "chatyuan" in model_name.lower():
LoaderClass = AutoModel
else:
LoaderClass = AutoModelForCausalLM
......@@ -126,8 +131,14 @@ class LoaderCheckPoint:
.half()
.cuda()
)
# 支持自定义cuda设备
elif ":" in self.llm_device:
model = LoaderClass.from_pretrained(checkpoint,
config=self.model_config,
torch_dtype=torch.bfloat16 if self.bf16 else torch.float16,
trust_remote_code=True).half().to(self.llm_device)
else:
from accelerate import dispatch_model
from accelerate import dispatch_model,infer_auto_device_map
model = LoaderClass.from_pretrained(checkpoint,
config=self.model_config,
......@@ -140,7 +151,13 @@ class LoaderCheckPoint:
elif 'moss' in model_name.lower():
self.device_map = self.moss_auto_configure_device_map(num_gpus, model_name)
else:
self.device_map = self.chatglm_auto_configure_device_map(num_gpus)
# 对于chaglm和moss意外的模型应使用自动指定,而非调用chatglm的配置方式
# 其他模型定义的层类几乎不可能与chatglm和moss一致,使用chatglm_auto_configure_device_map
# 百分百会报错,使用infer_auto_device_map虽然可能导致负载不均衡,但至少不会报错
# 实测在bloom模型上如此
self.device_map = infer_auto_device_map(model,
dtype=torch.int8,
no_split_module_classes=model._no_split_modules)
model = dispatch_model(model, device_map=self.device_map)
else:
......@@ -156,7 +173,7 @@ class LoaderCheckPoint:
elif self.is_llamacpp:
try:
from models.extensions.llamacpp_model_alternative import LlamaCppModel
from llama_cpp import Llama
except ImportError as exc:
raise ValueError(
......@@ -167,7 +184,16 @@ class LoaderCheckPoint:
model_file = list(checkpoint.glob('ggml*.bin'))[0]
print(f"llama.cpp weights detected: {model_file}\n")
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
model = Llama(model_path=model_file._str)
# 实测llama-cpp-vicuna13b-q5_1的AutoTokenizer加载tokenizer的速度极慢,应存在优化空间
# 但需要对huggingface的AutoTokenizer进行优化
# tokenizer = model.tokenizer
# todo 此处调用AutoTokenizer的tokenizer,但后续可以测试自带tokenizer是不是兼容
#* -> 自带的tokenizer不与transoformers的tokenizer兼容,无法使用
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
return model, tokenizer
elif self.load_in_8bit:
......@@ -396,7 +422,7 @@ class LoaderCheckPoint:
print(
"如果您使用的是 macOS 建议将 pytorch 版本升级至 2.0.0 或更高版本,以支持及时清理 torch 产生的内存占用。")
elif torch.has_cuda:
device_id = "0" if torch.cuda.is_available() else None
device_id = "0" if torch.cuda.is_available() and (":" not in self.llm_device) else None
CUDA_DEVICE = f"{self.llm_device}:{device_id}" if device_id else self.llm_device
with torch.cuda.device(CUDA_DEVICE):
torch.cuda.empty_cache()
......@@ -443,5 +469,6 @@ class LoaderCheckPoint:
self.model.transformer.prefix_encoder.float()
except Exception as e:
print("加载PrefixEncoder模型参数失败")
self.model = self.model.eval()
# llama-cpp模型(至少vicuna-13b)的eval方法就是自身,其没有eval方法
if not self.is_llamacpp:
self.model = self.model.eval()
......@@ -6,7 +6,7 @@ from models.base import (BaseAnswer,
AnswerResult)
import torch
# todo 建议重写instruction,在该instruction下,各模型的表现比较差
META_INSTRUCTION = \
"""You are an AI assistant whose name is MOSS.
- MOSS is a conversational language model that is developed by Fudan University. It is designed to be helpful, honest, and harmless.
......@@ -20,7 +20,7 @@ META_INSTRUCTION = \
Capabilities and tools that MOSS can possess.
"""
# todo 在MOSSLLM类下,各模型的响应速度很慢,后续要检查一下原因
class MOSSLLM(BaseAnswer, LLM, ABC):
max_token: int = 2048
temperature: float = 0.7
......@@ -42,10 +42,11 @@ class MOSSLLM(BaseAnswer, LLM, ABC):
return self.checkPoint
@property
def set_history_len(self) -> int:
def _history_len(self) -> int:
return self.history_len
def _set_history_len(self, history_len: int) -> None:
def set_history_len(self, history_len: int) -> None:
self.history_len = history_len
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
......@@ -59,11 +60,13 @@ class MOSSLLM(BaseAnswer, LLM, ABC):
prompt_w_history = str(history)
prompt_w_history += '<|Human|>: ' + prompt + '<eoh>'
else:
prompt_w_history = META_INSTRUCTION
prompt_w_history = META_INSTRUCTION.replace("MOSS", self.checkPoint.model_name.split("/")[-1])
prompt_w_history += '<|Human|>: ' + prompt + '<eoh>'
inputs = self.checkPoint.tokenizer(prompt_w_history, return_tensors="pt")
with torch.no_grad():
# max_length似乎可以设的小一些,而repetion_penalty应大一些,否则chatyuan,bloom等模型为满足max会重复输出
#
outputs = self.checkPoint.model.generate(
inputs.input_ids.cuda(),
attention_mask=inputs.attention_mask.cuda(),
......
......@@ -44,4 +44,5 @@ def loaderLLM(llm_model: str = None, no_remote_model: bool = False, use_ptuning_
if 'FastChatOpenAILLM' in llm_model_info["provides"]:
modelInsLLM.set_api_base_url(llm_model_info['api_base_url'])
modelInsLLM.call_model_name(llm_model_info['name'])
modelInsLLM.set_api_key(llm_model_info['api_key'])
return modelInsLLM
......@@ -23,9 +23,13 @@ openai
#accelerate~=0.18.0
#peft~=0.3.0
#bitsandbytes; platform_system != "Windows"
#llama-cpp-python==0.1.34; platform_system != "Windows"
#https://github.com/abetlen/llama-cpp-python/releases/download/v0.1.34/llama_cpp_python-0.1.34-cp310-cp310-win_amd64.whl; platform_system == "Windows"
# 要调用llama-cpp模型,如vicuma-13b量化模型需要安装llama-cpp-python库
# but!!! 实测pip install 不好使,需要手动从ttps://github.com/abetlen/llama-cpp-python/releases/下载
# 而且注意不同时期的ggml格式并不!兼!容!!!因此需要安装的llama-cpp-python版本也不一致,需要手动测试才能确定
# 实测ggml-vicuna-13b-1.1在llama-cpp-python 0.1.63上可正常兼容
# 不过!!!本项目模型加载的方式控制的比较严格,与llama-cpp-python的兼容性较差,很多参数设定不能使用,
# 建议如非必要还是不要使用llama-cpp
torch~=2.0.0
pydantic~=1.10.7
starlette~=0.26.1
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论