Nanobot for Labnetwork Data
GenAI Prototypes • Worldwide Nanofab Community
Link to nanobot webapp
Click here to try it out: nanobot.chat.
Challenge
The MIT Labnetwork is an email forum for the micro- and nanofabrication facility community that has been in existence for nearly 30 years that is moderated by dedicated faculty at MIT. Though the information shared in this forum is vital knowledge for the other facility staff and administrators, historical posts are not easily able to be found or searched. The concept was to build a Generative AI RAG based (Retrieval Augmented Generation) chatbot that would allow users to enter in queries and retrieve relevant data, if the topic had been covered in the forum at a previous time.
The Solution
The Labnetwork forum has a web based archive indexed by month dating back to 2007. A beautiful soup script was written that scraped the data from each of the weblinks. This data included header information, copied reply messages, and signature files which would not be useful to an accurate retrieval process. The scraped data was programmatically cleaned, with the following items retained in a database:
- Message data
- Sender name
- Sender email
- Message ID
- Thread ID
- Email subject line
- Email body
An LLM was then used to clean the email body, further stripping away any text or characters that were not part of the main message. The clean text body with the subject line were then each vectorized using an OpenAI embedding model, and the metadata plus the vector were added to a Postgres database with PGVector search implementation.
The Chatbot was built with a streamlit front end to create a user friendly interface, and was given two different prompts and search methods:
Vector Retrieval:
The user would enter a prompt, asking about a particular topic. This prompt was then vectorized and cosine similarity was then used to find threads that related to the question. These were then sent to the OpenAI LLM which returned the response along with the retrieved data that was used to generate the answer.
SQL Database Retrieval:
In a slightly more complicated setup, the user was told that if they were making queries about the data itself, (such as the senders, the number of posts, the institutions etc), they could use the SQL query feature. Knowing that the user would not know how to construct a SQL query, the implementation was to take the user query and have the LLM construct the proper SQL prompt. Given the SQL prompt, the retrieval was done, and this data was then sent back to the LLM to generate the response.
Thread Retrieval:
Lastly, once the user found an interesting post, there was an area where they could enter the thread ID and retrieve the date ordered sequence of posts that were made on that thread.
Results
This webapp was released at the 2024 UGIM Conference (link to presentation) held at MIT (nanobot.chat) to a great response. This is open to the entire community and has been used by nanofabs worldwide. This was the first presentation of the positive effect of generative AI on important data that would otherwise be largely inaccessible by other methods.
Future Plans
This app data is scraped and updated monthly. The future plan is that nanobot.chat will host many more prototype chatbots to showcase to the nanofab community the many utilities of this technology.