Mark Richards

System Outline

The idea that you will be able to take any long pieces of text data like books or technical manuals and convert them into chatbots that will be able to answer your questions about them is very powerful. Microsoft’s partnership with OpenAI has made this much more possible. You can take the text and load it into storage containers and have the AI Search service periodically update the search indexes based on the files in that container. You can then use the connection to ChatGPT to send your questions, your search index, the matching files from your search index as context to ChatGPT to come up with an answer for you. This is what is known as Retrieval Augmented Generation or RAG.

You can limit the responses from ChatGPT to only information it finds in your input text. Meaning it wont hallucinate or make up additional results. This is very powerful control system when you need answers that are specific to your data and not the wider ChatGPT training set.

Links to Project Files and Articles

Linkedin Article discussing the project

Implementation Lessons

The only tricky part of this was that it uses new services and so you have to be comfortable with exploring and learning the order of operations for setting up services. Or for exploring what services are actually necessary.

Azure AI Studio service not necessary

There was a little confusion when creating this because you interact with it though AI studio. I initially thought I needed to create an AI studio service. However, you can access the studio dashboard once you create the OpenAI service. There is no need to create an additional AI Studio service.

How you preprocess the text matters a lot

I converted PDFs to plain text for ease of ingestion. Though there is an option to use an OCR to handle things like PDF natively. The index service can't read all of your text at once so you need to decide how to split the text. This decision can drastically affect the way the bot responds. However, creating breaks that are not natural (like pages) means that references in the responses become meaningless. So there is a balance between using single pages of text that may not be able to give the model enough context, and using N character chunks that can't be easily referenced back to a document.

Indexing service chunk limits will drop excess with quiet warnings

Depending on the service tier you select the indexer will have a limit of N characters for its input. If you exceed this limit with your input files it will drop the excess characters without any real warning. You will only know this has happened if you go and look at the indexer log. It will show a warning that the input was truncated. So if you ask questions that are from the first part of your document you may not initially notice the missing data.

Increasing the number of documents to retrieve cant outweigh poor context

When you add your own data there is a slider to set the maximum number of documents to send to ChatGPT with your question as context. I had hoped that when splitting by page that increasing this number might help make up for the lost context, but It didnt help as much as I had hoped. Having longer input files seemed to return better results.

Retrospective

This was my first use of ChatGPT to create a chatbot with custom data. In the end it took fewer services that I expected to set it up and run. However, it was not simple to determine the order of creation and how each piece interacts at first. There are also a lot of parameters that can be tuned and that will create a lot of work as you try and set this up for production. The answers can be good for some questions and not so good for others. The wording a user uses when asking matters but is uncontrollable. The model parameters may be good for some questions and would need to be adjusted for others. So finding that perfect mix will be difficult.

I think that there will need to be a lot of human feedback when setting these up and again when any updates are made. Currently I might think of them as tools for expert users to help them navigate large contexts rather than as public facing tools for conveying that context.

Return To Azure Projects