Unverified 提交 d2716add 作者: zhenkaivip 提交者: GitHub

使用paddleocr实现实现UnstructuredPaddlePDFLoader和UnstructuredPaddleImageLoader (#344)

* jpg and png ocr

* fix

* write docs to tmp file

* fix

* image loader

* fix

* fix

* add pdf_loader

* fix

* update INSTALL.md

---------

Co-authored-by: imClumsyPanda <littlepanda0716@gmail.com>
上级 ff8182f4
......@@ -29,7 +29,14 @@ $ git clone https://github.com/imClumsyPanda/langchain-ChatGLM.git
# 进入目录
$ cd langchain-ChatGLM
# 使用paddleocr需要卸载detectron2避免tools冲突
$ pip uninstall detectron2
# 安装依赖
$ pip install -r requirements.txt
# 验证paddleocr是否成功,首次运行会下载约18M模型到~/.paddleocr
$ python test_image.py
```
注:使用 `langchain.document_loaders.UnstructuredFileLoader` 进行非结构化文件接入时,可能需要依据文档进行其他依赖包的安装,请参考 [langchain 文档](https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/unstructured_file.html)
......@@ -27,8 +27,8 @@ class UnstructuredPaddleImageLoader(UnstructuredFileLoader):
txt_file_path = image_ocr_txt(self.file_path)
from unstructured.partition.text import partition_text
return partition_text(filename=txt_file_path, **self.unstructured_kwargs)
if __name__ == "__main__":
filepath = "../content/samples/test.jpg"
loader = UnstructuredPaddleImageLoader(filepath, mode="elements")
......
......@@ -44,9 +44,10 @@ class UnstructuredPaddlePDFLoader(UnstructuredFileLoader):
from unstructured.partition.text import partition_text
return partition_text(filename=txt_file_path, **self.unstructured_kwargs)
if __name__ == "__main__":
filepath = "../content/samples/test.pdf"
loader = UnstructuredPaddlePDFLoader(filepath, mode="elements")
docs = loader.load()
for doc in docs:
print(doc)
\ No newline at end of file
print(doc)
from configs.model_config import *
import nltk
nltk.data.path = [NLTK_DATA_PATH] + nltk.data.path
filepath = "./img/test.jpg"
from loader import UnstructuredPaddleImageLoader
loader = UnstructuredPaddleImageLoader(filepath, mode="elements")
docs = loader.load()
for doc in docs:
print(doc)
from configs.model_config import *
import nltk
nltk.data.path = [NLTK_DATA_PATH] + nltk.data.path
filepath = "docs/test.pdf"
from loader import UnstructuredPaddlePDFLoader
loader = UnstructuredPaddlePDFLoader(filepath, mode="elements")
docs = loader.load()
for doc in docs:
print(doc)
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论