LlamaParse: Simplifying File Parsing for Enhanced Efficiency

Parsers are the backbone of retrieval augmented generation (RAG) apps. Parsers extract information from user-given documents and make it available to be fed into RAG apps. A variety of parsers are available for all formats of files and many are being developed and enhanced regularly.

In this article, we’ll take a closer look at an interesting parser LlamaParse that I discovered accidentally while exploring LlamaIndex repositories 😄.

LlamaParse as mentioned in the official repo is an API created by LlamaIndex to efficiently parse and represent files for effective retrieval and context augmentation using LlamaIndex frameworks.

For any RAG app, performance is directly proportional to how good the parsers are in extracting the pieces of information present in the files. The better the extraction, the better the augmentation. LlamaParse will take care of your PDF parsing needs so that you only have to worry about your RAG app.

LlamaParse provides the following features:

Supports async batching which enables you to pass multiple PDF files in a go!
Text and Markdown-based extraction, Markdown-based extraction can help you in parsing documents that contain tables.

Setup

To use LlamaParse, You’ll require the following items in your bag:

API key by registering at LlamaCloud
llama-parse python library installed with pip install llama-parse

Once you fulfill both requirements, you are good to go!

LlamaIndex is very popular for its “5 lines of code” starter template and the same they have done with LlamaParse too. See the example below.

from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

parser = LlamaParse(
        api_key="llx-...",  # can also be set in your env as LLAMA_CLOUD_API_KEY
        result_type="markdown",  # "markdown" and "text" are available
        num_workers=4, # if multiple files passed, split in `num_workers` API calls
        verbose=True
      )

# sync
documents = parser.load_data("./my_file.pdf")
# sync batch
documents = parser.load_data(["./my_file1.pdf", "./my_file2.pdf"])
# async
documents = await parser.aload_data("./my_file.pdf")
# async batch
documents = await parser.aload_data(["./my_file1.pdf", "./my_file2.pdf"])
# Integrate this parser with SimpleDirectoryReader
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader("./data", file_extractor=file_extractor).load_data()

Code language: Python (python)

Parser will return data in the form of Document object which you can use to create Index.

Let’s do a live example testing with LlamaParse!

Live Experiment with LlamaParse

Using CRAG paper to demo the parsing capabilities of LlamaParse.

Create a .py file and add the following code.

from llama_parse import LlamaParse
import os

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."   # ADD YOUR LLMParse API key here

documents = LlamaParse(result_type="markdown").load_data("DOWNLOADED_CRAG_FILE_PATH")
print(documents[0].text[0:2000])

Code language: Python (python)

Run the py file with from any terminal: python parser.py and you’ll get output like this in the terminal:

See how the printed data has heading and everything preserved. This will help in augmenting better responses!

Let’s combine LlamaParse and LlamaIndex together 🦙

Add the following lines of code in your current code to integrate LlamaParse into LlamaIndex and see for yourself how easy it is to combine these two. Updated code will look like this:

from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex
import os
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."   # ADD YOUR LLMParse API key here
os.environ["OPENAI_API_KEY"] = "OPEN_AI_KEY"

documents = LlamaParse(result_type="markdown").load_data("C:\\Users\\avima\\Pictures\\crag.pdf")
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is CRAG")
print("\n\nResponse:\n")
print(response) # To check the repsonse
print("\n\nSource Texts:\n")
for source in response.source_nodes:  # To verify if correct nodes have been picked or not
    print(source)

Code language: Python (python)

Run the code again and check the terminal for output, it should be something like this:

Conclusion

Parsing is the first step in improving the quality of your RAG app. It helps to make the LLM understand your data in a much better and efficient way.
Currently LlamaParse is supporting PDF format only, but more formats will be supported soon!

If you want to learn more, checkout to the repo: LlamaParse.
Blog reviewed by Logan Markewich from LlamaIndex team.
You can connect with on Twitter and Github and do checkout PythonWarriors

Setup

Live Experiment with LlamaParse

Let’s combine LlamaParse and LlamaIndex together 🦙

Conclusion

You Might Also Like

logging in python

Exploring the Ethical Implications of GPT-3: The Pros, Cons, and Potential Risks of the World’s Most Powerful Language Generation Model

Setting up Python Environment

Leave a Reply Cancel reply