知识图谱
实现基于知识图谱的RAG(Retrieval-Augmented Generation)流程。它结合了大模型(LLM)、图数据库(如NetworkX或OpenGauss)、向量检索、重排序等多种技术,实现了文档的结构化知识抽取、图谱构建、概念聚类、向量化检索和多跳推理等能力。该模块适用于复杂知识问答、企业知识管理、智能检索等场景。
知识图谱利用大型语言模型(LLM)通过结构化提示实现关系抽取、节点和关系的概念化。为了方便您在默认提示的基础上进行自定义修改,以下提供中文和英文版本的默认提示供您参考:
中文默认提示:
-
三元组抽取提示(TRIPLE_INSTRUCTION_CN):
TRIPLE_INSTRUCTIONS_CN = { "entity_relation": """ ## 目标 请从以下文本中提取所有重要实体及其关系,并严格遵守以下规则: ## 要求 1. 实体必须为名词,尽量简洁; 2. 关系必须为一个动词,准确描述“头实体”与“尾实体”之间的具体联系,且不得重复头、尾实体的字面信息; 3. 头实体与尾实体均不得为“是”,不得使用代词; 4. 实体和关系不能为空字符串,不能为仅包含标点符号的字符串; 5. 输出必须采用下列 JSON 格式,禁止添加、删除或修改任何字段: [ { "头实体": "{名词}", "关系": "{动词}", "尾实体": "{名词}" } ] ## 示例 [ { "头实体": "中国", "关系": "首都", "尾实体": "北京", }, { "头实体": "小狗", "关系": "喜欢", "尾实体": "骨头", }, { "头实体": "毛泽东", "关系": "父亲", "尾实体": "毛岸英", }, { "头实体": "中国船舶工业物资云贵有限公司", "关系": "成立", "尾实体": "1990月05月31日", }, { "头实体": "公司", "关系": "地址", "尾实体": "云南省昆明市", }, { "头实体": "公司", "关系": "经营", "尾实体": "电子器件", }, { "头实体": "1999年", "关系": "早于", "尾实体": "2000年", }, { "头实体": "2001年", "关系": "晚于", "尾实体": "2000年", } ] ## 待分析文本 """, "event_entity": """ ## 目标 请对以下段落逐句进行事件抽取,并识别每个事件所涉及的全部实体。 ## 要求 1. 一句视为一个独立事件,保留原句,不做任何省略; 2. 列出每个事件直接参与的所有实体,不重复、不遗漏; 3. 输出严格使用下方 JSON 格式,不允许添加或删减字段。 ## JSON 格式 [ { "事件": "{原句}", "实体": ["实体1", "实体2", "..."] } ] ## 待分析文本 """, "event_relation": """ ## 目标 请对以下段落逐句抽取事件,并识别它们之间的时间或因果关系。 ## 要求 1. 一句视为一个独立事件,保留原句,不做任何省略。 2. 仅使用指定关系类型:在之前、在之后、同时、因为、结果。 3. 每个三元组中的“头事件”与“尾事件”均须为段落中完整原句,且语义对应具体、可独立理解。 4. “头事件”和“尾事件”不能为空字符串,且不能重叠; 5. 关系不能为空字符串; 6. 输出严格使用下方 JSON 格式,不允许添加、删减或省略任何字段。 ## JSON 格式 [ { "头事件": "{事件1完整原句}", "关系": "{在之前|在之后|同时|因为|结果}", "尾事件": "{事件2完整原句}" } ] ## 待分析文本 """, } -
实体概念化提示 (ENTITY_PROMPT_CN):
ENTITY_PROMPT_CN = ''' ## 目标 给定一个实体及其背景,提供多个短语来表示该实体的抽象概念。 ## 输出格式 短语1, 短语2, 短语3,... ## 要求 * 短语应准确代表实体,可以是其类型或相关概念。 * 短语包含1-2个词。 * 短语不能包含空格和换行符,不能包含标点符号,但可以包含连字符。 * 严格遵循输出格式,不添加任何额外字符。 * 尽可能提供3到10个不同抽象层次的短语。 * 不重复使用与实体或已有短语相同的词语。 * 如果无法生成更多短语,请立即停止。 ## 示例 实体:灵魂 背景:在BFI伦敦电影节首映,成为皮克斯票房最高的影片 概念:电影, 影片 实体:ThinkPad X60 背景:理查德·斯托曼宣布他在ThinkPad X60上使用Trisquel 概念:ThinkPad, 笔记本, 机器, 设备, 硬件, 电脑, 品牌 实体:哈里·卡拉汉 背景:欺骗另一个抢劫犯,折磨天蝎座 概念:人, 美国人, 角色, 警察, 侦探 实体:黑山学院 背景:由约翰·安德鲁·赖斯创立,吸引了教师 概念:学院, 大学, 学校, 文理学院 ## 待分析文本 实体:[ENTITY] 背景:[CONTEXT] 概念: ''' -
事件概念化提示 (EVENT_PROMPT_CN):
EVENT_PROMPT_CN = ''' ## 目标 给定一个事件,提供多个短语来表示该事件的抽象概念。 ## 输出格式 短语1, 短语2, 短语3,... ## 要求 * 短语应准确代表事件,可以是其类型或相关概念。 * 短语包含1-2个词。 * 短语不能包含空格和换行符,不能包含标点符号,但可以包含连字符。 * 严格遵循输出格式,不添加任何额外字符。 * 尽可能提供3到10个不同抽象层次的短语。 * 不重复使用与事件或已有短语相同的词语。 * 如果无法生成更多短语,请立即停止。 ## 示例 事件:一名男子隐居于山林 概念:隐居, 放松, 逃避, 自然, 孤独 事件:一只猫追逐猎物进入它的藏身之处 概念:狩猎, 逃避, 捕食, 躲藏, 潜行 事件:山姆和他的狗玩耍 概念:放松的活动, 爱抚, 玩耍, 联结, 友谊 事件:中国船舶工业物资云贵有限公司,成立日期:1990年3月31日成立,住所:云南省昆明市 概念:公司成立, 公司地点, 公司住所, 省, 市, 成立时间, 年月日 ## 待分析文本 事件:[EVENT] 概念: ''' -
关系概念化提示 (RELATION_PROMPT_CN):
RELATION_PROMPT_CN = ''' ## 目标 给定一个关系,提供多个短语来表示该关系的抽象概念。 ## 输出格式 短语1, 短语2, 短语3,... ## 要求 * 短语应准确代表关系,可以是其类型或相关概念。 * 短语包含1-2个词。 * 短语不能包含空格和换行符,不能包含标点符号,但可以包含连字符。 * 严格遵循输出格式,不添加任何额外字符。 * 尽可能提供3到10个不同抽象层次的短语。 * 不重复使用与关系或已有短语相同的词语。 * 如果无法生成更多短语,请立即停止。 ## 示例 关系:参与 概念:成为一部分, 参加, 投入, 涉及 关系:被包括在内 概念:加入, 成为一部分, 成为成员, 成为组成部分 ## 待分析文本 关系:[RELATION] 概念: '''
英文默认提示:
-
三元组抽取提示(TRIPLE_INSTRUCTIONS_EN):
TRIPLE_INSTRUCTIONS_EN = { "entity_relation": """Given a passage, summarize all the important entities and the relations between them in a concise manner. Relations should briefly capture the connections between entities, without repeating information from the head and tail entities. The entities should be as specific as possible. Exclude pronouns from being considered as entities. The output should strictly adhere to the following JSON format: [ { "Head": "{a noun}", "Relation": "{a verb}", "Tail": "{a noun}" }, { "Head": "China", "Relation": "Capital", "Tail": "Beijing" }, { "Head": "Dog", "Relation": "like", "Tail": "bone" }, { "Head": "Mao Zedong", "Relation": "Father", "Tail": "Mao Anying" }, { "Head": "China Shipbuilding Materials Yungui Co., Ltd.", "Relation": "Established", "Tail": "May 31, 1990" }, { "Head": "Company", "Relation": "Address", "Tail": "Kunming City, Yunnan Province" }, { "Head": "Company", "Relation": "Operation", "Tail": "Electronics" }, { "Head": "Year 1999", "Relation": "Before", "Tail": "Year 2000" }, { "Head": "Year 2001", "Relation": "After", "Tail": "Year 2000" }, ]""", "event_entity": """Please analyze and summarize the participation relations between the events and entities in the given paragraph. Each event is a single independent sentence. Additionally, identify all the entities that participated in the events. Do not use ellipses. Please strictly output in the following JSON format: [ { "Event": "{a simple sentence describing an event}", "Entity": ["entity 1", "entity 2", "..."] }... ] """, "event_relation": """Please analyze and summarize the relationships between the events in the paragraph. Each event is a single independent sentence. Identify temporal and causal relationships between the events using the following types: before, after, at the same time, because, and as a result. Each extracted triple should be specific, meaningful, and able to stand alone. Do not use ellipses. The output should strictly adhere to the following JSON format: [ { "Head": "{a simple sentence describing the event 1}", "Relation": "{temporal or causality relation between the events}", "Tail": "{a simple sentence describing the event 2}" }... ]""" } -
实体概念化提示 (ENTITY_PROMPT_EN):
ENTITY_PROMPT_EN = '''I will give you an ENTITY. You need to give several phrases containing 1-2 words for the ABSTRACT ENTITY of this ENTITY. You must return your answer in the following format: phrases1, phrases2, phrases3,... You can't return anything other than answers. These abstract intention words should fulfill the following requirements. 1. The ABSTRACT ENTITY phrases can well represent the ENTITY, and it could be the type of the ENTITY or the related concepts of the ENTITY. 2. Strictly follow the provided format, do not add extra characters or words. 3. Write at least 3 or more phrases at different abstract level if possible. 4. Do not repeat the same word and the input in the answer. 5. Stop immediately if you can't think of any more phrases, and no explanation is needed. ENTITY: Soul CONTEXT: premiered BFI London Film Festival, became highest-grossing Pixar release Your answer: movie, film ENTITY: ThinkPad X60 CONTEXT: Richard Stallman announced he is using Trisquel on a ThinkPad X60 Your answer: ThinkPad, laptop, machine, device, hardware, computer, brand ENTITY: Harry Callahan CONTEXT: bluffs another robber, tortures Scorpio Your answer: person, American, character, police officer, detective ENTITY: Black Mountain College CONTEXT: was started by John Andrew Rice, attracted faculty Your answer: college, university, school, liberal arts college EVENT: 1st April CONTEXT: Utkal Dibas celebrates Your answer: date, day, time, festival ENTITY: [ENTITY] CONTEXT: [CONTEXT] Your answer: ''' -
事件概念化提示 (EVENT_PROMPT_EN):
EVENT_PROMPT_EN = '''I will give you an EVENT. You need to give several phrases containing 1-2 words for the ABSTRACT EVENT of this EVENT. You must return your answer in the following format: phrases1, phrases2, phrases3,... You can't return anything other than answers. These abstract event words should fulfill the following requirements. 1. The ABSTRACT EVENT phrases can well represent the EVENT, and it could be the type of the EVENT or the related concepts of the EVENT. 2. Strictly follow the provided format, do not add extra characters or words. 3. Write at least 3 or more phrases at different abstract level if possible. 4. Do not repeat the same word and the input in the answer. 5. Stop immediately if you can't think of any more phrases, and no explanation is needed. EVENT: A man retreats to mountains and forests Your answer: retreat, relaxation, escape, nature, solitude EVENT: A cat chased a prey into its shelter Your answer: hunting, escape, predation, hiding, stalking EVENT: Sam playing with his dog Your answer: relaxing event, petting, playing, bonding, friendship EVENT: [EVENT] Your answer: ''' -
关系概念化提示 (RELATION_PROMPT_EN):
RELATION_PROMPT_EN = '''I will give you an RELATION. You need to give several phrases containing 1-2 words for the ABSTRACT RELATION of this RELATION. You must return your answer in the following format: phrases1, phrases2, phrases3,... You can't return anything other than answers. These abstract intention words should fulfill the following requirements. 1. The ABSTRACT RELATION phrases can well represent the RELATION, and it could be the type of the RELATION or the simplest concepts of the RELATION. 2. Strictly follow the provided format, do not add extra characters or words. 3. Write at least 3 or more phrases at different abstract level if possible. 4. Do not repeat the same word and the input in the answer. 5. Stop immediately if you can't think of any more phrases, and no explanation is needed. RELATION: participated in Your answer: become part of, attend, take part in, engage in, involve in RELATION: be included in Your answer: join, be a part of, be a member of, be a component of RELATION: [RELATION] Your answer: '''
GraphRAGPipeline
类功能
提供知识图谱创建、检索的统一入口。
from mx_rag.graphrag import GraphRAGPipeline
GraphRAGPipeline(work_dir, llm, embedding_model, dim, rerank_model, graph_type,graph_name, encrypt_fn,decrypt_fn,kwargs)
| 参数名 | 数据类型 | 可选/必选 | 说明 |
|---|---|---|---|
| work_dir | str | 必选 | 知识图谱工作目录,其剩余空间至少为5GB,保存生成的图json中间文件,如果使用的MindFAISS,对应的向量数据也在该路径下。 不能为相对路径,路径长度不能超过1024,不能为软链接且不允许存在".."。 路径不能在路径列表中:["/etc", "/usr/bin", "/usr/lib", "/usr/lib64", "/sys/", "/dev/", "/sbin", "/tmp"]。 |
| llm | Text2TextLLM | 必选 | 大模型接口实例对象。 |
| embedding_model | Embeddings | 必选 | langchain_core.embeddings.Embeddings的子类,取值包含: |
| dim | int | 必选 | 嵌入模型生成的向量维度,其取值范围为[1, 1024 * 1024]。 |
| rerank_model | Reranker | 可选 | mx_rag_reranker.Reranker的子类,默认为None,取值包含: |
| graph_type | str | 可选 | 图数据库类型,默认为“networkx”,其取值仅支持["networkx", "opengauss"]。 |
| graph_name | str | 可选 | 知识图谱名称,默认为“graph”,其取值范围为[1, 255],只能由标识符组成。 |
| encrypt_fn | Callable | 可选 | 回调方法,对调用build_graph产生的json文件内容加密。请注意提供正确加密方法并保证安全性,返回值是加密后的字符串。 如果上传的文档涉及银行卡号、身份证号、护照号、口令等个人数据,请配置该参数保证个人数据安全。 |
| decrypt_fn | Callable | 可选 | 回调方法,在graph_type为"networkx"时,在检索时会对"{graph_name}.json"解密读取。请注意提供正确解密方法并保证安全性,返回值是解密后的字符串。 |
| kwargs | Dict | 可选 | 扩展参数列表: age_graph由用户控制传入,请使用安全的连接方式。 |
GraphRAGPipeline对象。
import getpass
from paddle.base import libpaddle # fix std::bad_alloc
from langchain_opengauss import OpenGaussSettings, openGaussAGEGraph
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from mx_rag.chain.single_text_to_text import GraphRagText2TextChain
from mx_rag.document import LoaderMng
from mx_rag.embedding.local import TextEmbedding
from mx_rag.graphrag import GraphRAGPipeline
from mx_rag.llm import LLMParameterConfig, Text2TextLLM
from mx_rag.reranker.local import LocalReranker
from mx_rag.utils import ClientParam
work_dir = "test_pipeline"
llm = Text2TextLLM(
base_url="https://x.x.x.x:port/v1/chat/completions",
model_name="model_name",
llm_config=LLMParameterConfig(max_tokens=64 * 1024, temperature=0.6, top_p=0.9),
client_param=ClientParam(timeout=180, ca_file="/path/to/ca.crt"),
)
rerank_model = LocalReranker("/data/models/bge-reranker-v2-m3/", 0, 20, False)
embedding_model = TextEmbedding.create(model_path="/data/models/bge-large-en-v1.5")
data_load_mng = LoaderMng()
data_load_mng.register_loader(TextLoader, [".txt"])
data_load_mng.register_splitter(
RecursiveCharacterTextSplitter,
[".txt"],
dict(chunk_size=512, chunk_overlap=20)
)
graph_name = "hotpotqa"
graph_type = "opengauss"
conf = OpenGaussSettings(user="gaussdb",
password=getpass.getpass(),
host="x.x.x.x",
port="x",
database="postgres")
age_graph = openGaussAGEGraph(graph_name, conf,
sslmode="verify-ca",
sslcert="client.crt",
sslkey="client.key",
sslrootcert="cacert.pem")
pipeline = GraphRAGPipeline(work_dir, llm, embedding_model, 1024, rerank_model, graph_name=graph_name,
age_graph=age_graph)
pipeline.upload_files(["./test_graph/hotpotqa.500.txt"], data_load_mng)
pipeline.build_graph()
question = "Which case was brought to court first Miller v. California or Gates v. Collier ?"
contexts = pipeline.retrieve_graph(question)
text2text_chain = GraphRagText2TextChain(
llm=llm,
retriever=pipeline.as_retriever(),
reranker=rerank_model)
result = text2text_chain.query(question)
print(f"#contexts: {len(contexts)}")
print(contexts)
print(result)
upload_files
调用此函数上传创建知识图谱所需要的文档列表。
def upload_files(file_list, loader_mng)
| 参数名 | 数据类型 | 是否必选 | 说明 |
|---|---|---|---|
| file_list | list | 是 | 文件列表,注意同一批次文档列表只支持同一种语言类型,一次上传过多文档构建知识图谱时会比较慢,文档个数限制[1, 100]。 单个文档路径长度取值范围为[1, 1024],文档路径不能为软链接且不允许存在"..",且每个文档大小不超过10G。 |
| loader_mng | LoaderMng | 是 | 提供文档解析函数的管理类对象,数据类型参见LoaderMng。 |
无
build_graph
调用此函数创建文本节点索引以及生成对应文本的知识图谱。
def build_graph(lang, **kwargs)
| 参数名 | 数据类型 | 是否必选 | 说明 |
|---|---|---|---|
| lang | Lang | 否 | 语料所用的语言,默认取值为Lang.EN,即英文语料。 |
| kwargs | dict | 否 | 扩展参数列表:
|
无
方法执行后会在work_dir下生成过程文件:
表 1
| 文件名 | 说明 |
|---|---|
| "{graph_name}.json" | 用于保存图,graph_type为"networkx"时,检索会通过该文件加载图。 |
| "{graph_name}_relations.json" | 保存实体关系信息。 |
| "{graph_name}_concepts.json" | 保存概念信息。 |
| "{graph_name}_synset.json" | 保存概念聚类之后的类别信息。 |
| "{graph_name}_node_vectors.index" | 实体的向量索引文件。 |
| "{graph_name}_concept_vectors.index" | 概念的向量索引文件。 |
retrieve_graph
调用该接口检索返回相关文档片段。
def retrieve_graph(question, **kwargs)
| 参数名 | 数据类型 | 是否必选 | 说明 |
|---|---|---|---|
| question | str | 是 | 用户问题,字符串长度范围[1, 1000*1000] |
| kwargs | dict | 否 | 扩展参数列表: |
| 数据类型 | 说明 |
|---|---|
| List[str] | 检索到的上下文片段。 |
as_retriever
返回检索器。
def as_retriever(**kwargs)
| 参数名 | 数据类型 | 是否必选 | 说明 |
|---|---|---|---|
| kwargs | dict | 否 | 扩展参数列表: |
| 数据类型 | 说明 |
|---|---|
| GraphRetriever | 该检索器继承自langchain_core.retrievers.BaseRetriever。 |
GraphEvaluator
类功能
用于对知识图谱的质量进行评估。
from mx_rag.graphrag import GraphEvaluator
GraphEvaluator(llm, llm_config)
| 参数名 | 数据类型 | 可选/必选 | 说明 |
|---|---|---|---|
| llm | Text2TextLLM | 必选 | 大模型接口实例对象。 |
| llm_config | LLMParameterConfig | 必选 | 描述参见LLMParameterConfig。 |
GraphEvaluator对象。
import json
from paddle.base import libpaddle
from mx_rag.graphrag.graph_evaluator import GraphEvaluator
from mx_rag.llm import Text2TextLLM, LLMParameterConfig
from mx_rag.utils import ClientParam
llm_config = LLMParameterConfig(temperature=0.5, top_p=0.8, max_tokens=8192)
llm = Text2TextLLM(
base_url="https://ip:port/v1/chat/completions",
model_name="Llama3-8B-Chinese-Chat",
llm_config=llm_config,
client_param=ClientParam(ca_file="/path/to/ca.crt", timeout=120),
)
graph_evaluator = GraphEvaluator(llm, llm_config)
relations_path = "/path/to/graph_relations.json"
with open(relations_path, "r", encoding="utf-8") as f:
relations = json.load(f)
graph_evaluator.evaluate(relations)
evaluate
评估大语言模型抽取的三元组关系。
def evaluate(relations)
| 参数名 | 数据类型 | 是否必选 | 说明 |
|---|---|---|---|
| relations | list[dict] | 是 | 知识图谱关系列表,每个元素为一个字典,必须包括如下键:raw_text,键值entity_relations、event_entity_relations、event_relations可选。列表长度范围[1, 50000],嵌套深度最高为5,文本长度限制4096。 |
无。函数会打印输出三组精度、召回以及F1得分,分别对应实体,事件中的实体以及事件的抽取情况。输出的得分越高,抽取的质量越好。