Skip to content
Snippets Groups Projects
Unverified Commit 17d751d2 authored by KevinHuSh's avatar KevinHuSh Committed by GitHub
Browse files

refine README (#72)

* refine README

* Update README.md
parent 7fd1eca5
No related branches found
No related tags found
No related merge requests found
English | [简体中文](./README_zh.md) English | [简体中文](./README_zh.md)
#*Deep*Doc # *Deep*Doc
---
- [1. Introduction](#1) - [1. Introduction](#1)
- [2. Vision](#2) - [2. Vision](#2)
...@@ -11,7 +9,6 @@ English | [简体中文](./README_zh.md) ...@@ -11,7 +9,6 @@ English | [简体中文](./README_zh.md)
<a name="1"></a> <a name="1"></a>
## 1. Introduction ## 1. Introduction
---
With a bunch of documents from various domains with various formats and along with diverse retrieval requirements, With a bunch of documents from various domains with various formats and along with diverse retrieval requirements,
an accurate analysis becomes a very challenge task. *Deep*Doc is born for that purpose. an accurate analysis becomes a very challenge task. *Deep*Doc is born for that purpose.
There 2 parts in *Deep*Doc so far: vision and parser. There 2 parts in *Deep*Doc so far: vision and parser.
...@@ -19,8 +16,6 @@ There 2 parts in *Deep*Doc so far: vision and parser. ...@@ -19,8 +16,6 @@ There 2 parts in *Deep*Doc so far: vision and parser.
<a name="2"></a> <a name="2"></a>
## 2. Vision ## 2. Vision
---
We use vision information to resolve problems as human being. We use vision information to resolve problems as human being.
- OCR. Since a lot of documents presented as images or at least be able to transform to image, - OCR. Since a lot of documents presented as images or at least be able to transform to image,
OCR is a very essential and fundamental or even universal solution for text extraction. OCR is a very essential and fundamental or even universal solution for text extraction.
...@@ -64,19 +59,16 @@ We use vision information to resolve problems as human being. ...@@ -64,19 +59,16 @@ We use vision information to resolve problems as human being.
<a name="3"></a> <a name="3"></a>
## 3. Parser ## 3. Parser
---
Four kinds of document formats as PDF, DOCX, EXCEL and PPT have their corresponding parser. Four kinds of document formats as PDF, DOCX, EXCEL and PPT have their corresponding parser.
The most complex one is PDF parser since PDF's flexibility. The output of PDF parser includes: The most complex one is PDF parser since PDF's flexibility. The output of PDF parser includes:
- Text chunks with their own positions in PDF(page number and rectangular positions). - Text chunks with their own positions in PDF(page number and rectangular positions).
- Tables with cropped image from the PDF, and contents which has already translated into natural language sentences. - Tables with cropped image from the PDF, and contents which has already translated into natural language sentences.
- Figures with caption and text in the figures. - Figures with caption and text in the figures.
###Résumé ### Résumé
---
The résumé is a very complicated kind of document. A résumé which is composed of unstructured text The résumé is a very complicated kind of document. A résumé which is composed of unstructured text
with various layouts could be resolved into structured data composed of nearly a hundred of fields. with various layouts could be resolved into structured data composed of nearly a hundred of fields.
We haven't opened the parser yet, as we open the processing method after parsing procedure. We haven't opened the parser yet, as we open the processing method after parsing procedure.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment