Large Language Models( LLM ) such as OpenAI/Google Bard has made life of everyone better whether a developer/content writer/social media manger or normal users. You can build anything using LLM nowadays. There are certain areas where LLMs lacks and they are not having knowledge to specific domains. For example: Companies documentations, Questions on PDFs etc. One more lacking area is we cannot input more than 2K tokens in GPT-3, 8K tokens in GPT-4 which becomes a hurdle in taking help from these LLMs for our custom data.
We are going to overcome these lacking areas in this tutorial.
Introduction
LlamaIndex
LlamaIndex is defined as the project which provides a central point of interaction between LLMs and your custom knowledge bases.
Llamaindex came in the picture because you cannot use GPT-3/ChatGPT/GPT-4 for large datasets. While trying to get idea on larger datasets on ChatGPT, you may have seen something like this.
LlamaIndex creates embeddings of your large datasets and then applies semantic searching ( ANN ) on the embeddings for the query. Once required embedding is found, then only the required embedding is sent to OpenAI with user query. This process reduces the size of data that we send to OpenAI and hence makes using of OpenAI possible for custom data.
OpenAI
OpenAI is an AI company which has created LLM models like GPT-3/ ChatGPT and GPT-4.
Streamlit
Streamlit is a web based framework developed in python programming language. It is being used to develop data driven applications with UI.
Approach to the Problem
Coding Section
Let’s try to implement the above approach via code. To get started, we would require the following things
- OpenAI API Key, Which can be found by creating account on OpenAI platform and requesting for the key. Check here
- LlamaIndex: It is a python library which can be installed via PIP:
pip install llama-index
- Streamlit: It is a python library which can be installed via PIP:
pip install streamlit
We are going to use The Adventure of Sherlock Holmes, You can download it from here. To verify that ChatGPT is not able to read it, you can copy the entire book and paste on ChatGPT, and you’ll get that ChatGPT is not able to process the entire document at once.
Create a python file and name it as chatbot.py
and add the following code in it.
import streamlit as st
from streamlit_pills import pills
from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex, ServiceContext
from pathlib import Path
import os
os.environ['OPENAI_API_KEY'] = "ADD_YOUR_OPENAI_API_KEY_HERE"
# Create a llama session
service_context = ServiceContext.from_defaults(chunk_size_limit=256)
# Load your data to prepare data for creating embeddings
documents = SimpleDirectoryReader(input_files=['sher.txt']).load_data()
# Embeddings are created here for your data
global_index = GPTSimpleVectorIndex.from_documents(documents)
# You can save your indexes, to be re-used again
global_index.save_to_disk('sherlockholmes.json')
# To load vectors from local/ Allows to extract it from server also
global_index = GPTSimpleVectorIndex.load_from_disk(f'sherlockholmes.json', service_context=service_context)
st.subheader("AI Assistant based on Custom Knowledge Base: `The Adventure of Sherlock Holmes`")
# You can also use radio buttons instead
selected = pills("", ["OpenAI", "Huggingface"], ["🤖", "🤗"])
user_input = st.text_input("You: ",placeholder = "Ask me anything ...", key="input")
if st.button("Submit", type="primary"):
st.markdown("----")
res_box = st.empty()
if selected == "OpenAI":
response = global_index.query(user_input, similarity_top_k=3)
res_box.write(str(response))
else:
res_box.write("Work in progress!!")
st.markdown("----")
Code language: Python (python)
Let’s break down our code to understand it better.
service_context
is used to define which LLM we are going to use and what is the number of tokens it should use to generate output. By default it uses OpenAI.SimpleDirectoryReader
takes file of format txt/pdf and prepare data which is used to create embeddings.GPTSimpleVectorIndex
creates embedding for our data and store it in memory. We can save the embeddings with us and load the saved embeddings as well.- We have two section:
- OpenAI: Predicts repsonse for user query using openAI GPT-3
- HuggingFace: Predicts response using huggingface model. You can choose any Text generation model from huggingface.
We now have the idea of our code, Let’s try running it. Type streamlit run chatbot.py
in your terminal.
Voila! We have our very own Chatbot. It uses your custom knowledge base to interact with ChatGPT which was impossible earlier.
You can find the code on 🖥️ GitHub, Feel free to give ⭐| Stay Tuned to PythonWarriors