文档解析

文档解析的本质 - 将格式各异、版本多样、元素多种的文档数据，转化为阅读顺序正确的字符串信息
Quality in, Quality out 是 LLM 的典型特征
- 高质量的文档解析能够从各种复杂格式的非结构化数据中提取出高精度信息
- 对 RAG 系统的最终效果起到决定性作用
RAG 系统的应用场景主要集中在专业领域和企业场景
- 除了数据库，更多的数据以 PDF、Word 等多种格式存储
- PDF 文件有统一的排版和多样化的结构形式，是最为常见的文档数据格式和交换格式

Quality in, Quality out

LangChain

Document Loaders

LangChain 提供了一套功能强大的文档加载器（Document Loaders）
LangChain 定义了 BaseLoader 类和 Document 类
- BaseLoader - 定义如何从不同数据源加载文档
- Document - 统一描述不同文档类型的元数据
开发者可以基于 BaseLoader 为特定数据源创建自定义加载器，将其内容加载为 Document 对象
Document Loader 模块是封装好的各种文档解析库集成 SDK，需要安装对应的文档解析库
在实际研发场景中，需要根据具体的业务需求编写自定义的文档后处理逻辑

Type	Document Loader	Library	App
.pdf	PDFPlumberLoader	pdfplumber
.txt	TextLoader	-
.doc	UnstructuredWordDocumentLoader	unstructured python-docx	libreoffice
.docx	UnstructuredWordDocumentLoader	unstructured python-docx
.ppt	UnstructuredPowerPointLoader	unstructured python-pptx
.pptx	UnstructuredPowerPointLoader	unstructured python-pptx
.xlsx	UnstructuredExcelLoader	unstructured openpyx
.csv	CSVLoader	pandas
.md	UnstructuredMarkdownLoader	unstructured markdown
.xml	UnstructuredXMLLoader	unstructured Ixml
.html	UnstructuredHTMLLoader	unstructured Ixml

Library

1	$ pip install unstructured pdfplumber python-docx python-pptx markdown openpyxl pandas

App

1 2	$ sudo apt install libreoffice $ brew install libreoffice

LangChain Community

LangChain Community 是 LangChain 与常用第三方库集成的拓展库
langchain_community.document_loaders
- 各类开源库和企业库基于 BaseLoader 扩展了不同文档类型的加载器
- 覆盖了本地文件、云端文件、数据库、互联网平台、Web 服务等多种数据源

加载文档

from langchain_community.document_loaders import (
    PDFPlumberLoader,
    TextLoader,
    UnstructuredWordDocumentLoader,
    UnstructuredPowerPointLoader,
    UnstructuredExcelLoader,
    CSVLoader,
    UnstructuredMarkdownLoader,
    UnstructuredXMLLoader,
    UnstructuredHTMLLoader,
)  # 从 langchain_community.document_loaders 模块中导入各种文档加载器类

# 定义文档解析加载器字典，根据文档类型选择对应的文档解析加载器类和输入参数
DOCUMENT_LOADER_MAPPING = {
    ".pdf": (PDFPlumberLoader, {}),
    ".txt": (TextLoader, {"encoding": "utf8"}),
    ".doc": (UnstructuredWordDocumentLoader, {}),
    ".docx": (UnstructuredWordDocumentLoader, {}),
    ".ppt": (UnstructuredPowerPointLoader, {}),
    ".pptx": (UnstructuredPowerPointLoader, {}),
    ".xlsx": (UnstructuredExcelLoader, {}),
    ".csv": (CSVLoader, {}),
    ".md": (UnstructuredMarkdownLoader, {}),
    ".xml": (UnstructuredXMLLoader, {}),
    ".html": (UnstructuredHTMLLoader, {}),
}


def load_document(file_path):
    """
    解析多种文档格式的文件，返回文档内容字符串
    :param file_path: 文档文件路径
    :return: 返回文档内容的字符串
    """
    ext = os.path.splitext(file_path)[1]  # 获取文件扩展名，确定文档类型
    loader_tuple = DOCUMENT_LOADER_MAPPING.get(ext)  # 获取文档对应的文档解析加载器类和参数元组

    if loader_tuple:  # 判断文档格式是否在加载器支持范围
        loader_class, loader_args = loader_tuple  # 解包元组，获取文档解析加载器类和参数
        loader = loader_class(file_path, **loader_args)  # 创建文档解析加载器实例，并传入文档文件路径
        documents = loader.load()  # 加载文档
        content = "\n".join([doc.page_content for doc in documents])  # 多页文档内容组合为字符串
        print(f"文档 {file_path} 的部分内容为: {content[:100]}...")  # 仅用来展示文档内容前100个字符
        return content  # 返回文档内容的字符串

    print(file_path + f"，不支持的文档类型: '{ext}'")
    return ""

解析多种文档格式并返回文档内容的字符串
检查文件的扩展名 ext，并动态选择合适的 Document Loader
- 实例化对应的 Document Loader，并调用对应文档解析库读取文档内容
将文档内容加载为字符串 - 合并多页文档

PDF

PDF vs MarkDown

PDF
- 显示效果不受设备、软件和系统的影响
- 一系列显示打印指令的集合，非数据结构化格式，存储的信息无法被计算机直接理解
- 在 LLM 的训练数据中不包含直接的 PDF 文件，无法直接理解
MarkDown
- 关注内容而非打印格式，能够表示多种文档元素
- PDF 转换为 MarkDown 最为合适，能够被 LLM 理解

电子版 vs 扫描版

电子版

电子版可以通过规则解析 - 提取出文本、表格等文档元素
开源库 - pyPDF2 / PyMuPDF / pdfminer / pdfplumber / papermage / …
- pdfplumber - 对中文支持良好，但表格解析效果较弱
- pyPDF2 - 对英文支持较好，但中文支持较差
- papermage - 集成了 pdfminer 和其它工具，适合于处理论文场景

2ce37e8a03fa9170abd06f7673d3878f

扫描版

需要经过文本识别和表格识别，才能提取出文档中各类元素

文档解析

Deep Learning

要实现真正的文档解析，还需要进行版面分析和还原阅读顺序
将内容解析为一个包含所有文档元素并具有正确阅读顺序的 MarkDown 文件
只依赖规则解析无法实现这一点 - 基于深度学习的开源库
- Layout-parser / PP-StructureV2 / PDF-Extract-Kit / pix2text / MinerU / marker / Gptpdf /…
- 由于深度学习模型的部署复杂性以及算力要求，尚未集成在 LangChain Community 中 - 独立部署

Model	Desc	Starts	Link
LayoutParser	版面分析工具包布局检测、OCR 识别、布局分析	4.7K	https://github.com/Layout-Parser/layout-parser
PP-StructureV2	百度开源项目文字识别、表格识别、版面还原	42.2K	https://github.com/PaddlePaddle/PaddleOCR
PDF-Extract-Kit	LayoutLMv3 - 布局检测 YOLOv8 - 公式检测 UniMERNet - 公式识别 PaddleOCR - 文字识别	4.5K	https://github.com/opendatalab/PDF-Extract-Kit
pix2text	数学公式检测能力很突出，Mathpix 平替	1.7K	https://github.com/breezedeus/Pix2Text
MinerU	上海人工智能实验室支持多格式、高精度解析、支持多种语言	10.3K	https://github.com/opendatalab/MinerU
marker	对书籍和科学论文进行了优化	16K	https://github.com/VikParuchuri/marker
Gptpdf	基于 GPT-4o，每页解析成本为 $0.013	2.7K	https://github.com/CosmosShadow/gptpdf

Paid

风险 - 信息泄露

由于 PDF 文档解析流程用到了多个 Deep Learning 模型组合，在生产场景中会遇到效率问题
商业闭源库部署在云端，可以做到并行处理和工程效率优化，在精度和效率上可以做到生产级别

多模态

需要进一步探索 PDF 中的图像内容理解
不仅限于文字模态，还包括对图片中非文字内容的解析，也可能包含重要内容
- 将这些内容转换为文字形式并嵌入到 MarkDown 文件中
- 通常依赖于端到端的多模态 LLM - GPT-4o / Gemini - 成本 / 效率