Using LangChain and ChatGPT to make Q&A and interpretation for professional documents

In most cases, we ask simple questions and hopefully get answers from ChatGPT on OpenAI's ChatGPT official website. At this time, the training data of ChatGPT is used to answer, which is equivalent to a closed-book exam, and its knowledge is limited until September 2021.

So how do we make ChatGPT answer questions in an "open book exam"? In other words, if I have a document (paper, contract, etc.), how can I ask ChatGPT to refer to it and answer it?

The basic idea

The basic idea is very simple, that is:

Find relevant text blocks from the original text
-> Put them in prompt for ChatGPT to refer to
-> Use ChatGPT to answer as an open book exam

Of course, it sounds simple, but in actual execution, every step is as troublesome as "putting an elephant in the refrigerator".

For example, how to find the corresponding paragraph in the original text? How to write the prompt to make ChatGPT easier to understand? Every question here worths tons of SCI articles.

However, the good thing is that lots of research has already been taken and results have been achieved, and open source frameworks also have been developed to realize these ones. One of them is LangChain.

Implementation using LangChain

LangChain is a relatively new and rapidly developing language framework.

Of course, moving fast is synonymous with falling behind documentation. LangChain also has such a problem. But the good thing is that the overall documentation of LangChain is relatively concise and easy to understand, and it is easy to reference back to the source code, which also has very clear instructions and architecture. If there are some documents that are not up to date, just take a look at the source code, it is easy to understand.

Let’s take an example to explain how to implement it. For example, we have an insurance contract and we want to know "what age range can be insured for this insurance". There is a paragraph in the contract that says:

The insurance age refers to the age of the insured at the time of insurance application, 
and the insurance age is calculated in years. 
The insurance age range accepted by this contract is from 
0 years old (must be born over 28 days) to 65 years old, 
and must comply with our regulations at the time of insurance purchase.

The obvious answer is 0 years old (must be born over 28 days) to 65 years old. So how to do this using LangChain and ChatGPT? This will be explained in detail below.

Before proceeding with the following process, it is recommended to read LangChain tutorial: Store and reference chat history to have some basic concepts about langchain's usage and logic.

Extract text chunks from PDF

It's easy to understand this step: extracting texts from the PDF and then process it with LLM.

I'm gonna use pymupdf to read PDF, which has better performance. Other libraries, such as (new) pypdf, should also work.

import fitz # imports the pymupdf library

doc = fitz.open(pdf_path) # open a document
blocks = []
page_num = 0 # index starts from 1
for page in doc: # iterate the document pages
    page_num += 1
    if page_num < page_st:
        continue
    if (page_en != -1) and (page_num > page_en):
        break
    blocks.extend(page.get_text('blocks', sort=False)) # get plain text encoded as UTF-8

The blocks array we've got here, represent each line of the PDF. Then we merge the related lines together to obtain the text chunks. One of the chunks is:

The insurance age refers to the age of the insured at the time of insurance application, 
and the insurance age is calculated in years. 
The insurance age range accepted by this contract is from 
0 years old (must be born over 28 days) to 65 years old, 
and must comply with our regulations at the time of insurance purchase.

The specific splitting and merging strategies vary in different scenarios. Here is a trick: try to put texts together according to article paragraphs.

LangChain of course provides corresponding libraries, you can refer to the following two documents:

Find relevant text chunks from the original text

Our insurance contract has tens of thousands of words and dozens of pages, but ChatGPT's GPT3.5 only accepts prompt tokens as few as 4k-16k. Obviously it cannot hold all the text.

Therefore, we need to make a way to prune, by taking out the text chunks that may be related to the question asked by users (generally called query), and removing the obviously irrelevant ones.

Generate embedding from text chunks

# read saved embedding
ebd_raw = {}
if r_embedding:
    with open(r_embedding, 'rb') as r:
        ebd_raw = pickle.load(r)

# generate embedding for each chunk
for i, chunk in enumerate(chunks):
    logtail = "i: %d" % (i)
    if chunk in ebd_raw: # easiest simple memory cache
        logger.info("chunk already exists, skip to next. chunk: <%s> %s", chunk, logtail)
        continue
    time.sleep(sleep)

    # openai python library using custom api_key and api_base
    openai.api_key = os.environ['EBD_OPENAI_API_KEY'] 
    openai.api_base = os.environ['EBD_OPENAI_API_BASE']

    # generate embedding
    ebd_res = openai.Embedding.create(input=[chunk], engine="text-embedding-ada-002")
    logger.info("chunk: <%s> ebd_res: <%s> %s", chunk, str(ebd_res).replace("\n", " "), logtail)

    ebd_raw[chunk] = ebd_res

    # write embedding to disk
    with open(w_embedding, 'wb') as w:
        pickle.dump(ebd_raw, w)

Store embedding into the vector store

Here I use faiss to store vectors. The official tutorial uses Chroma. Since LangChain always encapsulates well, so they are similar in use.

from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

tes = []
metas = []
for i, chunk in enumerate(chunks):
    ebd_item = ebd_raw[chunk]
    ebd_data = ebd_item["data"][0]["embedding"]
    source = "[Page num %d]" % (the page num of this chunk)
    meta = {'source': source}
    tes.append((chunk, ebd_data))
    metas.append(meta)

ebd_obj = OpenAIEmbeddings(openai_api_base=os.environ['EBD_OPENAI_API_BASE'], openai_api_key=os.environ['EBD_OPENAI_API_KEY'])
faiss_obj = FAISS.from_embeddings(tes, ebd_obj, metadatas=metas)

Put them in prompt for ChatGPT to refer to

import langchain
doc_chain = langchain.chains.qa_with_sources.load_qa_with_sources_chain(model, chain_type="stuff")

This doc_chain will extract the user query's embedding during execution, then search out the most relevant text chunks from the vectore store. The detailed instruction of load_qa_with_sources_chain can be referenced here .

As for chain_type, currently Langchain provides stuff, refine, map_reduce and map_rerank, which is explained in LangChain doc: Documents

The prompt's template and workflow is generated inside question_generator. For example, we're using stuff chain_type here, then the corresponding prompt template is written in source code stuff_prompt.py.

ChatGPT open-book answering

model = langchain.chat_models.ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0) # use OPENAI_API_BASE and OPENAI_API_KEY env vars to power OpenAI
question_generator = langchain.chains.LLMChain(llm=model, prompt=langchain.chains.conversational_retrieval.prompts.CONDENSE_QUESTION_PROMPT)

from langchain.chains import ConversationalRetrievalChain
qa = ConversationalRetrievalChain(retriever=faiss_obj.as_retriever(), question_generator=question_generator, combine_docs_chain=doc_chain,)

The detailed instruction of ConversationalRetrievalChain can be found here .

# see the differences in https://python.langchain.com/docs/guides/debugging
langchain.debug = True
# langchain.verbose = True

query = 'what age range can be insured for this insurance'
chat_history = []
logger.info("now start qa. query: <%s> %s", query, logtail)
result = qa({"question": query, "chat_history": chat_history})
logger.info("qa result: <%s> %s", str(result), logtail)

After execution, the program accurately provided the answer and tell us it referenced the source from page 11, which is convenient for us to backtrack and confirm:

qa result: <{'question': 'what age range can be insured for this insurance', 'chat_history': [], 'answer': '0 years old (must be born over 28 days) to 65 years old\nSOURCES: [Page num 11]'}>

What indeed did we say to ChatGPT?

Amazing, huh?

Indeed, what spell did we give ChatGPT to make it so smart?

Since we turned on debugging, it is very intuitive to understand what was happening. Let's take a look at the final prompt sent to ChatGPT:

Human: Given the following extracted parts of a long document and a question, create a final answer with references (\"SOURCES\"). 
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
ALWAYS return a \"SOURCES\" part in your answer.

(LangChain official example 1)
(LangChain official example 2)

QUESTION: what age range can be insured for this insurance
=========
Content: The insurance age range accepted by this contract is from 0 years old (must be born over 28 days) to 65 years old, and must comply with our regulations at the time of insurance purchase.
Source: [Page num 11]

Content: For life insurance for minors, before the insured reaches adulthood, the total amount of insurance benefits due to the death of the insured shall not exceed the limit prescribed by the Insurance Regulatory Authority of the State Council, and the total agreed amount of insurance benefits for death benefits shall not exceed the aforementioned limit.
Source: [Page num 5]

(more text chunks)

=========
FINAL ANSWER:

This is actually the few-shot prompting in Instruction Prompting. And finding text chunks from faiss is called Retrieval-Augmented Generation. To put it bluntly, it means giving some examples and relevant text chunks to make LLM automatically reason.

To go into depth, it is recommended to read Prompt Engineering (by lilianweng), which provides second-hand interpreting to dozens of academic papers that LangChain is built on. Moreover, this article summarizes each of the papers by a handful of sentences, which makes it classic.

Reference

Leave a comment