Simple Geocoding with ChatGPT

Early in my career, I dealt with a requirement to post-process a corpus of existing documents to “geocode” them. This meant identifying the locations referenced in each document, calculating a minimum bounding rectangle (MBR), and writing the MBR, the file path, and some associated ID into a database, so that documents could be surfaced via spatial queries.

At the time, we had some commercial gazetteer that our customer had purchased and we had some logic that bounced a full-text search against it to see if there were any locations. Later on, we got really advanced and built a Lucene index. The main drawback was that, if a location wasn’t in the gazetteer, it would fall through and we would never know it. Also, it was easy to misidentify similar names due to lack of contextual and semantic awareness.

The process described above is called Named Entity Recognition (NER) and it is a fairly standard piece of Natural Language Processing (NLP) at this point. Locations (places) are simply one kind of entity that can be extracted with NER.

As technology has progressed over the course of my career, I have occasionally revisited old use cases, mainly because working with a known problem set allows me to focus on the technology. I recently revisited that old NER use case with ChatGPT and, afterward, the OpenAI API. I’ve done some previous work with the API, and NLP/AI/ML is figuring more prominently into my consulting work overall.

The basic premise was to get ChatGPT to extract all of the locations referred to in a document and return a GeoJSON FeatureCollection. That round-trip isn’t exceptionally exotic – I was more interested in how well ChatGPT would do the NER. I started off interactively.

I first printed a BBC article about the war in Ukraine to PDF (https://www.bbc.com/news/articles/c6pyv8q94g1o) and then uploaded it to ChatGPT. I could have pointed at the URL, but I wanted to test different models and not all ChatGPT versions support browsing. This post will focus on ChatGPT 4.0 for the interactive piece.

I had already indicated that I was interested in NER, so ChatGPT got right to work after I uploaded the PDF.

Since it had already identified locations, I asked it to determine the coordinates for each one, which it did fairly quickly.

That’s not exceptionally useful, so I asked it to give me GeoJSON. With this, I needed to be fairly specific on how you want the output formatted. On my first pass, it only have me the geometry. The next pass had another anomaly, so it was best to tell it exactly what I wanted.

Finally, I displayed the output in QGIS to ensure the locations were correct. It worked well, even picking up London from the byline.

All of that was a nice proof of concept, but doing this work interactively isn’t very scalable, so I looked into automating this with the OpenAI API. The first subtle difference here is that you can’t pass in the PDF like you can through the UI. With the API, you have to do the work of stripping out the text in your code and then pass it in as part of your prompt. For this run, I didn’t do that. I pasted the text into a file and read it from there.

Remember that you’ll need an OpenAI API key and you’ll need to set up billing in order to use it. You can manage billing thresholds easily and it also doesn’t cost a lot. My numerous test runs for this post cost me 20 cents.

import json
from openai import OpenAI
from openai.types import Completion, CompletionChoice, CompletionUsage

# Replace 'your-api-key' with your actual OpenAI API key
client = OpenAI(api_key = "your-api-key")

def extract_locations_from_text(text):
    #wrap some light prompt engineering around the input text
    prompt = f"Extract all locations, along with their longitudes and latitudes, from the following text:\n\n{text}\n\n Return the output as a GeoJSON FeatureCollection with each location as a feature and the location name as feature property. Include only the GeoJSON in the response. Do not use Markdown."
    print(prompt)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": prompt}
        ] 
    )
    locations = response.choices[0].message.content.strip()
    #do a little cleanup, just in case ChatGPT sends markdown anyway
    locations = locations.replace("```json", "").replace("```", "")

    #for internal use only
    #exit()
    
    # Create a JSON document
    # This primarily validates the json string
    feats = json.loads(locations)
    #send it back
    return json.dumps(feats, indent=4)

# Example text for initial testing
text = """
John traveled from New York to San Francisco. He visited the Golden Gate Bridge and then drove to Los Angeles. 
Later, he took a flight to Tokyo and spent a week exploring the city.
"""
#replace initial text with file contents. this is easy to comment out if needed
with open('ukr.txt', 'r') as in_file:
    text = in_file.read()
in_file.close()
# Extract locations and return as JSON
locations_json = extract_locations_from_text(text)
#write the geojson to disk
out_file = open("ukr.geojson", "w")
out_file.write(locations_json)
out_file.close()
#visual indication that something happened
print("Extracted Locations JSON:", locations_json)

As you can see above, I read the text out of a text file and embed that into the prompt I am sending to the API. Again, the prompt should be fairly specific. You’ll notice that I told it to only send me the GeoJSON and to not use Markdown. More on that in a bit.

Note that the model I am using is “gpt-4o” which is the latest model as of this writing. It was released the day before this post. I also did runs with “gpt-4” and “gpt-3.5-turbo.” The choice of model affects your prompt engineering. For example, the 3.5 model recognized fewer locations and I had to do work to the input text to increase it. (Example: I had to change “Kyiv” to “the city of Kyiv” to get it recognized.) The 4.0 model initially refused to identify coordinates, telling me to use an external geocoder. I didn’t fight that too much since gpt-4o was available. This model found all of the locations, geocoded them, and returned the GeoJSON. It also wrapped everything in Markdown, including the not-so-helpful text “Here are your results.” That’s why I included the specific formatting instructions.

So the main take-away here is that different prompt engineering will be needed for different models. That’s not surprising, given the nature of how vector embeddings work. This particular use case didn’t require a high degree of precision, so as long as it got most of the locations in the document and got close in terms of coordinates, it was successful. I haven’t tried anything yet that would rely on more precision, but I suspect it won’t take long to find the edges of reliability.

So here is the map of the locations returned by the API. You’ll notice London is gone. That is my fault because I left the byline out when I pasted the text into the text file.

So this experiment showed promising results. There is a place in geospatial workflows to use “AI” to automate many tasks beyond NER and the rapid advancement of pre-trained models is enabling more use cases all the time. If AI/ML is approached with realism, not with the breathless hype that the “AI” marketing machine has spun up, there is room for it in daily use.