RAG for Equipment Manuals
GenAI Prototypes • Worldwide Nanofab Community
Challenge
The challenge was to create a chatbot that could answer questions about PDF laboratory manuals. PDF manuals, especially those for complex equipment, are often difficult to search, as the data is often contained in tables, images with text, and the required information is often in multiple places within the document
Our Approach
We used a Retrival Augmented Generation (RAG) pipeline to create the chatbot. The approach used the PDF Plumber library to ingest the PDF manuals (one pass for the text, one pass for the tables), OpenAI embedding to vectorize the chunks (1536 dimension vectors), Weaviate to store the vectors, and the Streamlit library to create the chat interface. The search was done by the cosine similarity of the query vector to the vectors in the database.
Results
The project was a great introduction to creating a RAG pipeline and the chatbot was able to answer questions about the PDF manuals. The code for this project is open sourced on GitHub.
Future Plans
We have switched to using OpenAI multimodal models to understand the PDF files, which are first flattened into images. Additionally, we have found that the use of PGVector to store the vectors is much more efficient than Weaviate and allows for much faster queries and more efficient storage in a Postgres database. This new version will be available soon.