当前位置:首页 > 物联网 > 智能应用
[导读]通常,我们开发基于 LLM 的检索应用程序的知识库包含大量各种格式的数据。为了向LLM提供最相关的上下文来回答知识库中特定部分的问题,我们依赖于对知识库中的文本进行分块并将其放在方便的位置。

通常,我们开发基于 LLM 的检索应用程序的知识库包含大量各种格式的数据。为了向LLM提供最相关的上下文来回答知识库中特定部分的问题,我们依赖于对知识库中的文本进行分块并将其放在方便的位置。

分块

分块是将文本分割成有意义的单元以改进信息检索的过程。通过确保每个块代表一个集中的想法或观点,分块有助于保持内容的上下文完整性。

在本文中,我们将讨论分块的三个方面:

· 糟糕的分块如何导致结果相关性降低

· 良好的分块如何带来更好的结果

· 如何通过元数据进行良好的分块,从而获得具有良好语境的结果

为了有效地展示分块的重要性,我们将采用同一段文本,对其应用 3 种不同的分块方法,并检查如何根据查询检索信息。

分块并存储至 Qdrant

让我们看看下面的代码,它展示了对同一文本进行分块的三种不同方法。

Python

import qdrant_client

from qdrant_client.models import PointStruct, Distance, VectorParams

import openai

import yaml

# Load configuration

with open('config.yaml', 'r') as file:

config = yaml.safe_load(file)

# Initialize Qdrant client

client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key'])

# Initialize OpenAI with the API key

openai.api_key = config['openai']['api_key']

def embed_text(text):

print(f"Generating embedding for: '{text[:50]}'...") # Show a snippet of the text being embedded

response = openai.embeddings.create(

input=[text], # Input needs to be a list

model=config['openai']['model_name']

)

embedding = response.data[0].embedding # Access using the attribute, not as a dictionary

print(f"Generated embedding of length {len(embedding)}.") # Confirm embedding generation

return embedding

# Function to create a collection if it doesn't exist

def create_collection_if_not_exists(collection_name, vector_size):

collections = client.get_collections().collections

if collection_name not in [collection.name for collection in collections]:

client.create_collection(

collection_name=collection_name,

vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)

)

print(f"Created collection: {collection_name} with vector size: {vector_size}") # Collection creation

else:

print(f"Collection {collection_name} already exists.") # Collection existence check

# Text to be chunked which is flagged for AI and Plagiarism but is just used for illustration and example.

text = """

Artificial intelligence is transforming industries across the globe. One of the key areas where AI is making a significant impact is healthcare. AI is being used to develop new drugs, personalize treatment plans, and even predict patient outcomes. Despite these advancements, there are challenges that must be addressed. The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues. As AI continues to evolve, it is crucial that these challenges are not overlooked. By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

"""

# Poor Chunking Strategy

def poor_chunking(text, chunk_size=40):

chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

print(f"Poor Chunking produced {len(chunks)} chunks: {chunks}") # Show chunks produced

return chunks

# Good Chunking Strategy

def good_chunking(text):

import re

sentences = re.split(r'(?<=[.!?]) +', text)

print(f"Good Chunking produced {len(sentences)} chunks: {sentences}") # Show chunks produced

return sentences

# Good Chunking with Metadata

def good_chunking_with_metadata(text):

chunks = good_chunking(text)

metadata_chunks = []

for chunk in chunks:

if "healthcare" in chunk:

metadata_chunks.append({"text": chunk, "source": "Healthcare Section", "topic": "AI in Healthcare"})

elif "ethical implications" in chunk or "data privacy" in chunk:

metadata_chunks.append({"text": chunk, "source": "Challenges Section", "topic": "AI Challenges"})

else:

metadata_chunks.append({"text": chunk, "source": "General", "topic": "AI Overview"})

print(f"Good Chunking with Metadata produced {len(metadata_chunks)} chunks: {metadata_chunks}") # Show chunks produced

return metadata_chunks

# Store chunks in Qdrant

def store_chunks(chunks, collection_name):

if len(chunks) == 0:

print(f"No chunks were generated for the collection '{collection_name}'.")

return

# Generate embedding for the first chunk to determine vector size

sample_text = chunks[0] if isinstance(chunks[0], str) else chunks[0]["text"]

sample_embedding = embed_text(sample_text)

vector_size = len(sample_embedding)

create_collection_if_not_exists(collection_name, vector_size)

for idx, chunk in enumerate(chunks):

text = chunk if isinstance(chunk, str) else chunk["text"]

embedding = embed_text(text)

payload = chunk if isinstance(chunk, dict) else {"text": text} # Always ensure there's text in the payload

client.upsert(collection_name=collection_name, points=[

PointStruct(id=idx, vector=embedding, payload=payload)

])

print(f"Chunks successfully stored in the collection '{collection_name}'.")

# Execute chunking and storing separately for each strategy

print("Starting poor_chunking...")

store_chunks(poor_chunking(text), "poor_chunking")

print("Starting good_chunking...")

store_chunks(good_chunking(text), "good_chunking")

print("Starting good_chunking_with_metadata...")

store_chunks(good_chunking_with_metadata(text), "good_chunking_with_metadata")

上面的代码执行以下操作:

· embed_text方法接收文本,使用 OpenAI 嵌入模型生成嵌入,并返回生成的嵌入。

· 初始化用于分块和后续内容检索的文本字符串

· 糟糕的分块策略: 将文本分成每 40 个字符的块

· 良好的分块策略:根据句子拆分文本以获得更有意义的上下文

· 具有元数据的良好分块策略:向句子级块添加适当的元数据

· 一旦为块生成了嵌入,它们就会存储在 Qdrant Cloud 中相应的集合中。

请记住,创建不良分块只是为了展示不良分块如何影响检索。

下面是来自 Qdrant Cloud 的块的屏幕截图,您可以看到元数据被添加到句子级块中以指示来源和主题。

基于分块策略的检索结果

现在让我们编写一些代码来根据查询从 Qdrant Vector DB 中检索内容。

Python

import qdrant_client

from qdrant_client.models import PointStruct, Distance, VectorParams

import openai

import yaml

# Load configuration

with open('config.yaml', 'r') as file:

config = yaml.safe_load(file)

# Initialize Qdrant client

client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key'])

# Initialize OpenAI with the API key

openai.api_key = config['openai']['api_key']

def embed_text(text):

print(f"Generating embedding for: '{text[:50]}'...") # Show a snippet of the text being embedded

response = openai.embeddings.create(

input=[text], # Input needs to be a list

model=config['openai']['model_name']

)

embedding = response.data[0].embedding # Access using the attribute, not as a dictionary

print(f"Generated embedding of length {len(embedding)}.") # Confirm embedding generation

return embedding

# Function to create a collection if it doesn't exist

def create_collection_if_not_exists(collection_name, vector_size):

collections = client.get_collections().collections

if collection_name not in [collection.name for collection in collections]:

client.create_collection(

collection_name=collection_name,

vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)

)

print(f"Created collection: {collection_name} with vector size: {vector_size}") # Collection creation

else:

print(f"Collection {collection_name} already exists.") # Collection existence check

# Text to be chunked which is flagged for AI and Plagiarism but is just used for illustration and example.

text = """

Artificial intelligence is transforming industries across the globe. One of the key areas where AI is making a significant impact is healthcare. AI is being used to develop new drugs, personalize treatment plans, and even predict patient outcomes. Despite these advancements, there are challenges that must be addressed. The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues. As AI continues to evolve, it is crucial that these challenges are not overlooked. By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

"""

# Poor Chunking Strategy

def poor_chunking(text, chunk_size=40):

chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

print(f"Poor Chunking produced {len(chunks)} chunks: {chunks}") # Show chunks produced

return chunks

# Good Chunking Strategy

def good_chunking(text):

import re

sentences = re.split(r'(?<=[.!?]) +', text)

print(f"Good Chunking produced {len(sentences)} chunks: {sentences}") # Show chunks produced

return sentences

# Good Chunking with Metadata

def good_chunking_with_metadata(text):

chunks = good_chunking(text)

metadata_chunks = []

for chunk in chunks:

if "healthcare" in chunk:

metadata_chunks.append({"text": chunk, "source": "Healthcare Section", "topic": "AI in Healthcare"})

elif "ethical implications" in chunk or "data privacy" in chunk:

metadata_chunks.append({"text": chunk, "source": "Challenges Section", "topic": "AI Challenges"})

else:

metadata_chunks.append({"text": chunk, "source": "General", "topic": "AI Overview"})

print(f"Good Chunking with Metadata produced {len(metadata_chunks)} chunks: {metadata_chunks}") # Show chunks produced

return metadata_chunks

# Store chunks in Qdrant

def store_chunks(chunks, collection_name):

if len(chunks) == 0:

print(f"No chunks were generated for the collection '{collection_name}'.")

return

# Generate embedding for the first chunk to determine vector size

sample_text = chunks[0] if isinstance(chunks[0], str) else chunks[0]["text"]

sample_embedding = embed_text(sample_text)

vector_size = len(sample_embedding)

create_collection_if_not_exists(collection_name, vector_size)

for idx, chunk in enumerate(chunks):

text = chunk if isinstance(chunk, str) else chunk["text"]

embedding = embed_text(text)

payload = chunk if isinstance(chunk, dict) else {"text": text} # Always ensure there's text in the payload

client.upsert(collection_name=collection_name, points=[

PointStruct(id=idx, vector=embedding, payload=payload)

])

print(f"Chunks successfully stored in the collection '{collection_name}'.")

# Execute chunking and storing separately for each strategy

print("Starting poor_chunking...")

store_chunks(poor_chunking(text), "poor_chunking")

print("Starting good_chunking...")

store_chunks(good_chunking(text), "good_chunking")

print("Starting good_chunking_with_metadata...")

store_chunks(good_chunking_with_metadata(text), "good_chunking_with_metadata")

上面的代码执行以下操作:

· 定义查询并生成查询的嵌入

· 搜索查询设置为"ethical implications of AI in healthcare"。

· 该retrieve_and_print函数搜索特定的 Qdrant 集合并检索最接近查询嵌入的前 3 个向量。

现在让我们看看输出:

python retrieval_test.py

Results from 'poor_chunking' collection for the query: 'ethical implications of AI in healthcare':

Result 1:

Text: . The ethical implications of AI in heal

Source: N/A

Topic: N/A

Result 2:

Text: ant impact is healthcare. AI is being us

Source: N/A

Topic: N/A

Result 3:

Text:

Artificial intelligence is transforming

Source: N/A

Topic: N/A

Results from 'good_chunking' collection for the query: 'ethical implications of AI in healthcare':

Result 1:

Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues.

Source: N/A

Topic: N/A

Result 2:

Text: One of the key areas where AI is making a significant impact is healthcare.

Source: N/A

Topic: N/A

Result 3:

Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

Source: N/A

Topic: N/A

Results from 'good_chunking_with_metadata' collection for the query: 'ethical implications of AI in healthcare':

Result 1:

Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues.

Source: Healthcare Section

Topic: AI in Healthcare

Result 2:

Text: One of the key areas where AI is making a significant impact is healthcare.

Source: Healthcare Section

Topic: AI in Healthcare

Result 3:

Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

Source: General

Topic: AI Overview

同一搜索查询的输出根据实施的分块策略而有所不同。

· 分块策略不佳:您可以注意到,这里的结果不太相关,这是因为文本被分成了任意的小块。

· 良好的分块策略:这里的结果更相关,因为文本被分成句子,保留了语义含义。

· 使用元数据进行良好的分块策略:这里的结果最准确,因为文本经过深思熟虑地分块并使用元数据进行增强。

从实验中得出的推论

· 分块需要精心制定策略,并且块大小不宜太小或太大。

· 分块不当的一个例子是,块太小,在非自然的地方切断句子,或者块太大,同一个块中包含多个主题,这使得检索非常混乱。

· 分块的整个想法都围绕着为 LLM 提供更好的背景的概念。

· 元数据通过提供额外的上下文层极大地增强了结构正确的分块。例如,我们已将来源和主题作为元数据元素添加到我们的分块中。

· 检索系统受益于这些附加信息。例如,如果元数据表明某个区块属于“医疗保健部分”,则系统可以在进行与医疗保健相关的查询时优先考虑这些区块。

· 通过改进分块,结果可以结构化和分类。如果查询与同一文本中的多个上下文匹配,我们可以通过查看块的元数据来确定信息属于哪个上下文或部分。

牢记这些策略,并在基于 LLM 的搜索应用程序中分块取得成功。

本站声明: 本文章由作者或相关机构授权发布,目的在于传递更多信息,并不代表本站赞同其观点,本站亦不保证或承诺内容真实性等。需要转载请联系该专栏作者,如若文章内容侵犯您的权益,请及时联系本站删除。
换一批
延伸阅读

9月2日消息,不造车的华为或将催生出更大的独角兽公司,随着阿维塔和赛力斯的入局,华为引望愈发显得引人瞩目。

关键字: 阿维塔 塞力斯 华为

加利福尼亚州圣克拉拉县2024年8月30日 /美通社/ -- 数字化转型技术解决方案公司Trianz今天宣布,该公司与Amazon Web Services (AWS)签订了...

关键字: AWS AN BSP 数字化

伦敦2024年8月29日 /美通社/ -- 英国汽车技术公司SODA.Auto推出其旗舰产品SODA V,这是全球首款涵盖汽车工程师从创意到认证的所有需求的工具,可用于创建软件定义汽车。 SODA V工具的开发耗时1.5...

关键字: 汽车 人工智能 智能驱动 BSP

北京2024年8月28日 /美通社/ -- 越来越多用户希望企业业务能7×24不间断运行,同时企业却面临越来越多业务中断的风险,如企业系统复杂性的增加,频繁的功能更新和发布等。如何确保业务连续性,提升韧性,成...

关键字: 亚马逊 解密 控制平面 BSP

8月30日消息,据媒体报道,腾讯和网易近期正在缩减他们对日本游戏市场的投资。

关键字: 腾讯 编码器 CPU

8月28日消息,今天上午,2024中国国际大数据产业博览会开幕式在贵阳举行,华为董事、质量流程IT总裁陶景文发表了演讲。

关键字: 华为 12nm EDA 半导体

8月28日消息,在2024中国国际大数据产业博览会上,华为常务董事、华为云CEO张平安发表演讲称,数字世界的话语权最终是由生态的繁荣决定的。

关键字: 华为 12nm 手机 卫星通信

要点: 有效应对环境变化,经营业绩稳中有升 落实提质增效举措,毛利润率延续升势 战略布局成效显著,战新业务引领增长 以科技创新为引领,提升企业核心竞争力 坚持高质量发展策略,塑强核心竞争优势...

关键字: 通信 BSP 电信运营商 数字经济

北京2024年8月27日 /美通社/ -- 8月21日,由中央广播电视总台与中国电影电视技术学会联合牵头组建的NVI技术创新联盟在BIRTV2024超高清全产业链发展研讨会上宣布正式成立。 活动现场 NVI技术创新联...

关键字: VI 传输协议 音频 BSP

北京2024年8月27日 /美通社/ -- 在8月23日举办的2024年长三角生态绿色一体化发展示范区联合招商会上,软通动力信息技术(集团)股份有限公司(以下简称"软通动力")与长三角投资(上海)有限...

关键字: BSP 信息技术
关闭