How to extract metadata from create_retriever_tool? #1055

Louva1 · 2024-07-18T13:42:46Z

Louva1
Jul 18, 2024

Hi,

I am building a setup that is similar to this example: LangGraph Agentic RAG

The retriever_tool now always returns the page content of the retrieved documents as one large string where the pages are separated by '\n\n'.

I want to know the source of these retrieved documents so I want to have a look at the metadata. How can I extract the retrieved Docuement objects and their metadata in this setup?

Please let me know how I should modify the code as referenced above in order to implement this.

Thanks!

Answered by hwchase17

Jul 18, 2024

The retriever tool is a pretty simple around a retriever that:

Calls the retriever with the input
Gets back documents
formats those documents into a string

The retriever is turned into a tool here:

retriever_tool = create_retriever_tool(
    retriever,
    "retrieve_blog_posts",
    "Search and return information about Lilian Weng blog posts on LLM agents, prompt engineering, and adversarial attacks on LLMs.",
)

It is in step 3 where this formatting is happening. The simplest solution is probably to NOT use the off-the-shelf create_retriever_tool and instead write your own tool (add call the raw retriever inside there, get back raw documents with metadata, and do whatever you want there)

View full answer

hwchase17 · 2024-07-18T14:01:37Z

hwchase17
Jul 18, 2024
Maintainer

The retriever tool is a pretty simple around a retriever that:

Calls the retriever with the input
Gets back documents
formats those documents into a string

The retriever is turned into a tool here:

retriever_tool = create_retriever_tool(
    retriever,
    "retrieve_blog_posts",
    "Search and return information about Lilian Weng blog posts on LLM agents, prompt engineering, and adversarial attacks on LLMs.",
)

It is in step 3 where this formatting is happening. The simplest solution is probably to NOT use the off-the-shelf create_retriever_tool and instead write your own tool (add call the raw retriever inside there, get back raw documents with metadata, and do whatever you want there)

4 replies

Louva1 Jul 18, 2024
Author

Thanks Harrison, I will try this. Can you provide a short code snippet on how I can create a customer retriever tool?

a143416 Jul 22, 2024

The function create_retriever_tool used to return the retrieved documents' metadata in previous versions of LangChain. I was using the metadata to provide links to the retrieved chunks. Recently, I upgraded to the most recent version of LangChain, and this functionality stopped as the LLM generates a fake link since the retriever tool is no longer including metadata in the pulled chunks.

a143416 Jul 23, 2024

@Louva1 This is a workaround to include the metadata in the retrieved contexts.

doc_prompt = PromptTemplate.from_template(
    "<context>\n{page_content}\n\n<meta>\nsource: {source}\npage: {page}\n</meta>\n</context>"
)
tool = create_retriever_tool(
    retriever,
    name="search_knowledge_base",
    description="desription",
    document_prompt=doc_prompt,
)
tool.invoke("what is ...?")

gopidon Sep 26, 2024

@Louva1 , See this: langchain-ai/langchain#17398 (comment)

a143416 · 2024-07-22T18:52:39Z

a143416
Jul 22, 2024

This is how one can reproduce this issue. I am wondering why Langchain tools do not support backward compatibility. It feels disappointing to build a custom retriever tool for the new Langchain version to do exactly what the previous version was doing nicely.

langchain 0.0.324

from langchain.vectorstores import FAISS
from langchain.agents.agent_toolkits import create_retriever_tool

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", chunk_size=1)
vectorstore = FAISS.load_local("data/faiss_index", embeddings)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 2},
)

tool = create_retriever_tool(
    retriever,
    name="search_knowledge_base",
    description="Submit a question about the food review process, ...",
)
tool.invoke("what is food?")

[Document(page_content='# Food 1\n\n## Subsection\n\n ...', metadata={'source': 'data/food/food_1.md'}),
 Document(page_content='# Food 2\n\n## Best Practices:\n\n- ...', metadata={'source': 'data/food/food_2.md'}),

langchain 0.2.10

from langchain.vectorstores import FAISS
from langchain.agents.agent_toolkits import create_retriever_tool

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", chunk_size=1)
vectorstore = FAISS.load_local("data/faiss_index", embeddings, allow_dangerous_deserialization=True)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 2},
)

tool = create_retriever_tool(
    retriever,
    name="search_knowledge_base",
    description="Submit a question about the food review process, ...",
)
tool.invoke("what is food?")

'# Food 1\n\n## Subsection\n\n ... \n\n# Food 2\n\n## Best Practices:\n\n- ...'

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract metadata from create_retriever_tool? #1055

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to extract metadata from create_retriever_tool? #1055

Louva1 Jul 18, 2024

Replies: 2 comments · 4 replies

hwchase17 Jul 18, 2024 Maintainer

Louva1 Jul 18, 2024 Author

a143416 Jul 22, 2024

a143416 Jul 23, 2024

gopidon Sep 26, 2024

a143416 Jul 22, 2024

Louva1
Jul 18, 2024

Replies: 2 comments 4 replies

hwchase17
Jul 18, 2024
Maintainer

Louva1 Jul 18, 2024
Author

a143416
Jul 22, 2024