TextHawk is a Multimodal Large Language Model (MLLM) specifically designed for document-oriented tasks, while preserving the general capabilities. It is aimed to explore efficient fine-grained perception by designing four dedicated components:
- ReSampling and ReArrangement (ReSA)
- Scalable Positional Embeddings (SPEs)
- Query Proposal Network (QPN)
- Multi-Level Cross-Attention (MLCA)
We create a new instruction-tuning dataset DocGemini for document-oriented tasks by enriching the multimodal document data with Gemini Pro. Each data sample contains:
- A brief summary of the document topics.
- Short QA pairs, up to 10.
- Insights behind each answer.
- [Optional] An imaginary conversations between two researchers.
DocGemini consists of 30K images and 195K QA pairs with insights.
Dataset | QA | Conversation |
---|---|---|
DocVQA | link | link |
ChartQA | link | link |
InfoVQA | link | link |
Note: Alternatively, you can produce data on your own using the scripts we provide.
Model | ViT (Params.) |
MME perception |
MMB dev |
SEED image |
GQA | DocVQA | ChartQA | InfoVQA | TabFact | WTQ | RefCOCO val |
RefCOCO test-A |
RefCOCO test-B |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(0.1B) |
- | - | - | - | 67.5 | 41.8 | 11.6 | 54.6 | 18.8 | - | - | - | |
- | - | - | - | - | 76.6 | 58.6 | 40.0 | - | - | - | - | - | |
(1B) |
1528.4 | 74.8 | 66.1 | - | - | - | - | - | - | - | - | - | |
(0.3B) |
1510.7 | 65.2 | - | 62.0 | - | - | - | - | - | - | - | - | |
(0.3B) |
- | 58.8 | - | - | - | - | - | - | - | 87.0 | 91.1 | 81.8 | |
(2B) |
1487.6 | 60.6 | 65.4 | 57.5 | 62.6 | 66.3 | - | - | - | 88.6 | 92.3 | 84.5 | |
(2B) |
- | 59.3 | - | 60.7 | 66.5 | 65.1 | 36.1 | - | 25.3 | - | - | - | |
(0.3B) |
- | - | - | - | 65.4 | 59.3 | 42.2 | 67.6 | 29.4 | - | - | - | |
(2B) |
- | - | - | - | 73.0 | 66.9 | - | - | 31.9 | - | - | - | |
(0.4B) |
1520.9 | 73.0 | 69.2 | 64.7 | 73.6 | 64.0 | 47.3 | 70.7 | 33.5 | 87.3 | 90.9 | 83.3 | |
(0.4B) |
1500.0 | 74.6 | 69.2 | 64.6 | 76.4 | 66.6 | 50.6 | 71.1 | 34.7 | 87.2 | 90.8 | 82.5 |
Note:
$\textbf{TextHawk}^*$ is fine-tuned without the DocGemini.