Chat with Docs - Project - Python Warriors

In this tutorial, we are going to learn about a project ChatwithDocs which let’s you interact with your Documents in just few steps.

Large Langauge Models (LLM) have started a whole new revolution in Artificial Industry. There are loads of new applications being developed on LLMs on a daily basis. The only grey area that these LLMs had was the amount of text that they could handle at once. Meaning we cannot use our docs and ask questions on the docs directly in these LLMs.

This grey area has been smartly handled by Langchain and Llama-Index. Not only we can input our long format document, we can now interact with LLM as well.

ChatwithDocs is a python based project designed using Llama-Index and Streamlit. You can interact with CSV/PDFs/TxT/Docs format documents in 2-3 steps.

Prerequisites

You only need the following things to get started with this project.

OpenAI API Key: You will have to create an account on OpenAI and then get key from here.
Working Python Programming knowledge, which can be started from here

Coding Section

To use the project we need to do the following things:

Clone or download the project from GITHUB: Chat with Docs
Install the required python libraries.
Run the project

Once you are done with cloning and installing the required packages. We need to make one folder in the same directory where the .py is present and name it as documents. For the UI perspective, we are using Streamlit, Which helps to develop UI in python programming language.

Assuming all the above steps are done at your end, Let’s look at the code. The code contains different sections.

Two tabs, One to handle CSV and other to other format documents.
Storing documents inside a folder until vectors are created and then removing them
Interacting with CSV file using PandasAI loader, Which is a wrapper over PandasAI library
Cache the Vectors so that we do not end up creating embeddings again and again.
Query the docs and generate response.

Interact with CSV

PandasAIReader = download_loader("PandasAIReader")
def get_csv_result(df, query):
  reader = PandasAIReader(llm=csv_llm)
  response = reader.run_pandas_ai(
    df, 
    query, 
    is_conversational_answer=False
    )
  return response
Code language: Python (python)

We Initialized a CSV reader using PandasAIReader which is a wrapper over PandasAI project. It helps to interact with our CSV file using the power of LLM. This makes it easier to find insights on the data. Rather than going to and fro on the data you can simply ask!.

Ask questions, based on your dataset and you will get your answers

Renders graphs as well

Interact with Docs

This section let’s you interact with other form of documents such as PDFs/TxT/Docs.

def save_file(doc):        
    fn = os.path.basename(doc.name)
    # open read and write the file into the server
    open(documents_folder+'/'+fn, 'wb').write(doc.read())
    # Check for the current filename, If new filename
    # clear the previous cached vectors and update the filename 
    # with current name     
    if st.session_state.get('file_name'):
        if st.session_state.file_name != fn:
            st.cache_resource.clear()
            st.session_state['file_name'] = fn
    else:
        st.session_state['file_name'] = fn

    return fn

def remove_file(file_path):
    # Remove the file from the Document folder once 
    # vectors are created
    if os.path.isfile(documents_folder+'/'+file_path):
        os.remove(documents_folder+'/'+file_path)

    

@st.cache_resource
def create_index():
    # Create vectors for the file stored under Document folder. 
    # NOTE: You can create vectors for multiple files at once.
    documents = SimpleDirectoryReader(documents_folder).load_data()
    index = GPTVectorStoreIndex.from_documents(documents)
    return index



def query_doc(vector_index, query):
    # Applies Similarity Algo, Finds the nearest match and 
    # take the match and user query to OpenAI for rich response
    query_engine = vector_index.as_query_engine()
    response = query_engine.query(query)
    return response

Code language: Python (python)

These are the four main methods which are helping us from uploading a doc to query on a doc steps.

save_file helps us to save the file in the required folder which is used to generate embeddings
remove_file method removes the file from the folder once the embeddings are generated
create_index method interacts with LLM ( default with OpenAI) and generate embeddings which are used in the query part.
query_doc method takes user query and the embeddings of the document and finds the nearest possible section using Cosine similarity or any other Nearest neighbors algorithm. Once found, it sends the only required part to LLM to generate rich response.

Once the doc is uploaded, You can start asking your queries related to doc.

To begin with, Sample documents have been added in Sample Docs folder in the Github repo. You can use those documents and start asking questions.

You can find the code on 🖥️ GitHub | Feel free to give ⭐