I had a project over the summer that required me to do a deep dive into various concepts around AI. Approximately 200 academic papers later, I came away with a laundry list of things I wanted to try. I am primarily interested in the capabilities of LLMs, but I am not particularly interested in the heavy market hype they are enjoying right now.
As with most technology, I think there are useful applications for it, but much of the market-facing content being produced right is over the top and obscures the things it can do well. Given the amount of unstructured data produced daily, LLMs that are tuned for specific purposes can play a role in processing and curating the data.
LLMs are noted for “hallucinations,” which are really just statistical outliers in content generation. In essence, the “math” of language generation is sometimes wrong. Additionally, pre-trained LLMs are often only really reliable up to the point in time at which their training corpus existed. Techniques such as retrieval augmented generation (RAG) have emerged to mitigate some of these issues.
RAG is conceptually simple – you first retrieve support information for your topic from a source that is known or trusted and submit the retrieved information along with your prompt, to augment the generation of a response. Essentially, RAG takes advantage of an LLMs propensity for similarity analysis via vector embeddings. An LLM embeds an incoming prompt using the same model as that which was used to embed its stored corpus of information. It uses vector search to analyze for appropriate responses. RAG provides more information in the incoming prompt to fine tune the similarity search.
This diagram illustrates the flow:
RAG is something I wanted to play with, so I wrote a simple Python script to use it in conjunction with the ChatGPT API. I did this so that I could avoid setting up my own LLM right now. I’ll quickly do that in the near future but I wanted to spend more time on the workflow than on fiddling with infrastructure.
I took another liberty in that I am using DuckDuckGo (DDG) for the retrieval. I really should use some corpus of text I have vetted and have set up in PostgreSQL or Elastic or something but 1) I don’t have that much data laying around and 2) again I didn’t want to spend a lot of time fiddling with infrastructure. So grant me some literary license on the retrieval aspect of the workflow. The Python source code for this experiment is at the end of this post.
Following through the workflow in the code, I get the user’s prompt in ‘main’ which then gets passed to the ‘search_duckduckgo’ function to run the DDG query. The function takes the first five responses and returns the body (excerpt) of each in an array. In the ‘generate_response_with_rag’ function, a prompt is constructed consisting of some instructions for the LLM, the original prompt provided by the user, and the excerpts from the DDG search. This prompt is submitted through the OpenAI API and the response is printed.
I then do the same thing again without the excerpts to produce “with RAG” and “without RAG” versions. So, does RAG make a difference? The results of my limited experiments here suggest it does, but it’s not a panacea. The two results are below. My prompt was “What is the latest version of GDAL/OGR?” The response without RAG clearly states that it’s information is limited up to October 2023, at which the version was 3.6.4 (I took this at face value).
In the version with RAG there was good news and bad news. The good news is that the additional content made it aware that version 3.8.0 exists. The bad news is that isn’t the latest version. I am chalking this up to two things. 1) I am using a DDG search rather than an internal corpus so my retrieval is subject to the order in which results return from DDG. Perhaps version 3.8.0 has better SEO than later versions. 2) I am only augmenting with 5 excerpts, so I am not not really expanding ChatGPT’s worldview by very much.
So, there’s promise, but I really can’t put off fiddling with infrastructure if I want to do this right. I’ll revisit this and update as I make progress.
Generated Response (Without RAG):
As of my last update in October 2023, the latest stable version of GDAL/OGR is 3.6.4, released on September 22, 2023. GDAL, which stands for Geospatial Data Abstraction Library, is widely used for reading, writing, and transforming raster and vector geospatial data formats. OGR is the vector component of the library.
The GDAL/OGR library is continuously updated, so it’s advisable to check the [official GDAL website](https://gdal.org) or the project’s [GitHub repository](https://github.com/OSGeo/gdal) for the most current version and release notes. These resources provide detailed information on new features, bug fixes, and enhancements in each release. Additionally, users can also find installation instructions and documentation to assist with their geospatial data needs.
Generated Response (With RAG):
The latest version of GDAL/OGR is 3.8.0, which includes enhancements and updates to both the raster and vector data access library. GDAL/OGR, developed by the Open Source Geospatial Foundation, supports multiple geospatial data formats and provides a unified model for working with them through various programming language bindings and command-line tools. For detailed information on the new features and fixes included in the 3.8.0 release, you can visit the GDAL GitHub repository. Additionally, prior to this release, GDAL/OGR 3.7.0 was announced, highlighting ongoing improvements in the library’s functionality. For the latest updates, it’s always a good practice to check the official GDAL website or its GitHub page.