通过智能分块和元数据集成获得更好的搜索结果
扫描二维码
随时随地手机看文章
通常,我们开发基于 LLM 的检索应用程序的知识库包含大量各种格式的数据。为了向LLM提供最相关的上下文来回答知识库中特定部分的问题,我们依赖于对知识库中的文本进行分块并将其放在方便的位置。
分块
分块是将文本分割成有意义的单元以改进信息检索的过程。通过确保每个块代表一个集中的想法或观点,分块有助于保持内容的上下文完整性。
在本文中,我们将讨论分块的三个方面:
· 糟糕的分块如何导致结果相关性降低
· 良好的分块如何带来更好的结果
· 如何通过元数据进行良好的分块,从而获得具有良好语境的结果
为了有效地展示分块的重要性,我们将采用同一段文本,对其应用 3 种不同的分块方法,并检查如何根据查询检索信息。
分块并存储至 Qdrant
让我们看看下面的代码,它展示了对同一文本进行分块的三种不同方法。
Python
import qdrant_client
from qdrant_client.models import PointStruct, Distance, VectorParams
import openai
import yaml
# Load configuration
with open('config.yaml', 'r') as file:
config = yaml.safe_load(file)
# Initialize Qdrant client
client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key'])
# Initialize OpenAI with the API key
openai.api_key = config['openai']['api_key']
def embed_text(text):
print(f"Generating embedding for: '{text[:50]}'...") # Show a snippet of the text being embedded
response = openai.embeddings.create(
input=[text], # Input needs to be a list
model=config['openai']['model_name']
)
embedding = response.data[0].embedding # Access using the attribute, not as a dictionary
print(f"Generated embedding of length {len(embedding)}.") # Confirm embedding generation
return embedding
# Function to create a collection if it doesn't exist
def create_collection_if_not_exists(collection_name, vector_size):
collections = client.get_collections().collections
if collection_name not in [collection.name for collection in collections]:
client.create_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)
)
print(f"Created collection: {collection_name} with vector size: {vector_size}") # Collection creation
else:
print(f"Collection {collection_name} already exists.") # Collection existence check
# Text to be chunked which is flagged for AI and Plagiarism but is just used for illustration and example.
text = """
Artificial intelligence is transforming industries across the globe. One of the key areas where AI is making a significant impact is healthcare. AI is being used to develop new drugs, personalize treatment plans, and even predict patient outcomes. Despite these advancements, there are challenges that must be addressed. The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues. As AI continues to evolve, it is crucial that these challenges are not overlooked. By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.
"""
# Poor Chunking Strategy
def poor_chunking(text, chunk_size=40):
chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
print(f"Poor Chunking produced {len(chunks)} chunks: {chunks}") # Show chunks produced
return chunks
# Good Chunking Strategy
def good_chunking(text):
import re
sentences = re.split(r'(?<=[.!?]) +', text)
print(f"Good Chunking produced {len(sentences)} chunks: {sentences}") # Show chunks produced
return sentences
# Good Chunking with Metadata
def good_chunking_with_metadata(text):
chunks = good_chunking(text)
metadata_chunks = []
for chunk in chunks:
if "healthcare" in chunk:
metadata_chunks.append({"text": chunk, "source": "Healthcare Section", "topic": "AI in Healthcare"})
elif "ethical implications" in chunk or "data privacy" in chunk:
metadata_chunks.append({"text": chunk, "source": "Challenges Section", "topic": "AI Challenges"})
else:
metadata_chunks.append({"text": chunk, "source": "General", "topic": "AI Overview"})
print(f"Good Chunking with Metadata produced {len(metadata_chunks)} chunks: {metadata_chunks}") # Show chunks produced
return metadata_chunks
# Store chunks in Qdrant
def store_chunks(chunks, collection_name):
if len(chunks) == 0:
print(f"No chunks were generated for the collection '{collection_name}'.")
return
# Generate embedding for the first chunk to determine vector size
sample_text = chunks[0] if isinstance(chunks[0], str) else chunks[0]["text"]
sample_embedding = embed_text(sample_text)
vector_size = len(sample_embedding)
create_collection_if_not_exists(collection_name, vector_size)
for idx, chunk in enumerate(chunks):
text = chunk if isinstance(chunk, str) else chunk["text"]
embedding = embed_text(text)
payload = chunk if isinstance(chunk, dict) else {"text": text} # Always ensure there's text in the payload
client.upsert(collection_name=collection_name, points=[
PointStruct(id=idx, vector=embedding, payload=payload)
])
print(f"Chunks successfully stored in the collection '{collection_name}'.")
# Execute chunking and storing separately for each strategy
print("Starting poor_chunking...")
store_chunks(poor_chunking(text), "poor_chunking")
print("Starting good_chunking...")
store_chunks(good_chunking(text), "good_chunking")
print("Starting good_chunking_with_metadata...")
store_chunks(good_chunking_with_metadata(text), "good_chunking_with_metadata")
上面的代码执行以下操作:
· embed_text方法接收文本,使用 OpenAI 嵌入模型生成嵌入,并返回生成的嵌入。
· 初始化用于分块和后续内容检索的文本字符串
· 糟糕的分块策略: 将文本分成每 40 个字符的块
· 良好的分块策略:根据句子拆分文本以获得更有意义的上下文
· 具有元数据的良好分块策略:向句子级块添加适当的元数据
· 一旦为块生成了嵌入,它们就会存储在 Qdrant Cloud 中相应的集合中。
请记住,创建不良分块只是为了展示不良分块如何影响检索。
下面是来自 Qdrant Cloud 的块的屏幕截图,您可以看到元数据被添加到句子级块中以指示来源和主题。
基于分块策略的检索结果
现在让我们编写一些代码来根据查询从 Qdrant Vector DB 中检索内容。
Python
import qdrant_client
from qdrant_client.models import PointStruct, Distance, VectorParams
import openai
import yaml
# Load configuration
with open('config.yaml', 'r') as file:
config = yaml.safe_load(file)
# Initialize Qdrant client
client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key'])
# Initialize OpenAI with the API key
openai.api_key = config['openai']['api_key']
def embed_text(text):
print(f"Generating embedding for: '{text[:50]}'...") # Show a snippet of the text being embedded
response = openai.embeddings.create(
input=[text], # Input needs to be a list
model=config['openai']['model_name']
)
embedding = response.data[0].embedding # Access using the attribute, not as a dictionary
print(f"Generated embedding of length {len(embedding)}.") # Confirm embedding generation
return embedding
# Function to create a collection if it doesn't exist
def create_collection_if_not_exists(collection_name, vector_size):
collections = client.get_collections().collections
if collection_name not in [collection.name for collection in collections]:
client.create_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)
)
print(f"Created collection: {collection_name} with vector size: {vector_size}") # Collection creation
else:
print(f"Collection {collection_name} already exists.") # Collection existence check
# Text to be chunked which is flagged for AI and Plagiarism but is just used for illustration and example.
text = """
Artificial intelligence is transforming industries across the globe. One of the key areas where AI is making a significant impact is healthcare. AI is being used to develop new drugs, personalize treatment plans, and even predict patient outcomes. Despite these advancements, there are challenges that must be addressed. The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues. As AI continues to evolve, it is crucial that these challenges are not overlooked. By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.
"""
# Poor Chunking Strategy
def poor_chunking(text, chunk_size=40):
chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
print(f"Poor Chunking produced {len(chunks)} chunks: {chunks}") # Show chunks produced
return chunks
# Good Chunking Strategy
def good_chunking(text):
import re
sentences = re.split(r'(?<=[.!?]) +', text)
print(f"Good Chunking produced {len(sentences)} chunks: {sentences}") # Show chunks produced
return sentences
# Good Chunking with Metadata
def good_chunking_with_metadata(text):
chunks = good_chunking(text)
metadata_chunks = []
for chunk in chunks:
if "healthcare" in chunk:
metadata_chunks.append({"text": chunk, "source": "Healthcare Section", "topic": "AI in Healthcare"})
elif "ethical implications" in chunk or "data privacy" in chunk:
metadata_chunks.append({"text": chunk, "source": "Challenges Section", "topic": "AI Challenges"})
else:
metadata_chunks.append({"text": chunk, "source": "General", "topic": "AI Overview"})
print(f"Good Chunking with Metadata produced {len(metadata_chunks)} chunks: {metadata_chunks}") # Show chunks produced
return metadata_chunks
# Store chunks in Qdrant
def store_chunks(chunks, collection_name):
if len(chunks) == 0:
print(f"No chunks were generated for the collection '{collection_name}'.")
return
# Generate embedding for the first chunk to determine vector size
sample_text = chunks[0] if isinstance(chunks[0], str) else chunks[0]["text"]
sample_embedding = embed_text(sample_text)
vector_size = len(sample_embedding)
create_collection_if_not_exists(collection_name, vector_size)
for idx, chunk in enumerate(chunks):
text = chunk if isinstance(chunk, str) else chunk["text"]
embedding = embed_text(text)
payload = chunk if isinstance(chunk, dict) else {"text": text} # Always ensure there's text in the payload
client.upsert(collection_name=collection_name, points=[
PointStruct(id=idx, vector=embedding, payload=payload)
])
print(f"Chunks successfully stored in the collection '{collection_name}'.")
# Execute chunking and storing separately for each strategy
print("Starting poor_chunking...")
store_chunks(poor_chunking(text), "poor_chunking")
print("Starting good_chunking...")
store_chunks(good_chunking(text), "good_chunking")
print("Starting good_chunking_with_metadata...")
store_chunks(good_chunking_with_metadata(text), "good_chunking_with_metadata")
上面的代码执行以下操作:
· 定义查询并生成查询的嵌入
· 搜索查询设置为"ethical implications of AI in healthcare"。
· 该retrieve_and_print函数搜索特定的 Qdrant 集合并检索最接近查询嵌入的前 3 个向量。
现在让我们看看输出:
python retrieval_test.py
Results from 'poor_chunking' collection for the query: 'ethical implications of AI in healthcare':
Result 1:
Text: . The ethical implications of AI in heal
Source: N/A
Topic: N/A
Result 2:
Text: ant impact is healthcare. AI is being us
Source: N/A
Topic: N/A
Result 3:
Text:
Artificial intelligence is transforming
Source: N/A
Topic: N/A
Results from 'good_chunking' collection for the query: 'ethical implications of AI in healthcare':
Result 1:
Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues.
Source: N/A
Topic: N/A
Result 2:
Text: One of the key areas where AI is making a significant impact is healthcare.
Source: N/A
Topic: N/A
Result 3:
Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.
Source: N/A
Topic: N/A
Results from 'good_chunking_with_metadata' collection for the query: 'ethical implications of AI in healthcare':
Result 1:
Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues.
Source: Healthcare Section
Topic: AI in Healthcare
Result 2:
Text: One of the key areas where AI is making a significant impact is healthcare.
Source: Healthcare Section
Topic: AI in Healthcare
Result 3:
Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.
Source: General
Topic: AI Overview
同一搜索查询的输出根据实施的分块策略而有所不同。
· 分块策略不佳:您可以注意到,这里的结果不太相关,这是因为文本被分成了任意的小块。
· 良好的分块策略:这里的结果更相关,因为文本被分成句子,保留了语义含义。
· 使用元数据进行良好的分块策略:这里的结果最准确,因为文本经过深思熟虑地分块并使用元数据进行增强。
从实验中得出的推论
· 分块需要精心制定策略,并且块大小不宜太小或太大。
· 分块不当的一个例子是,块太小,在非自然的地方切断句子,或者块太大,同一个块中包含多个主题,这使得检索非常混乱。
· 分块的整个想法都围绕着为 LLM 提供更好的背景的概念。
· 元数据通过提供额外的上下文层极大地增强了结构正确的分块。例如,我们已将来源和主题作为元数据元素添加到我们的分块中。
· 检索系统受益于这些附加信息。例如,如果元数据表明某个区块属于“医疗保健部分”,则系统可以在进行与医疗保健相关的查询时优先考虑这些区块。
· 通过改进分块,结果可以结构化和分类。如果查询与同一文本中的多个上下文匹配,我们可以通过查看块的元数据来确定信息属于哪个上下文或部分。
牢记这些策略,并在基于 LLM 的搜索应用程序中分块取得成功。