Local GenAI Code Completion With Context

2025-02-08

A couple weeks ago, news about a new large-language model (LLM), named DeepSeek-R1, came out. The publisher, DeepSeek, is a Chinese company that was founded in 2023 and focuses on AI research and development. In response to this, I decided to undertake a small project to try to implement LLM and RAG in a simple context-aware code-completion utility.

Some of the promises/claims about this new LLM on its release included the following:

It is open-sourced and released under the terms of the MIT License
It is available for use as a local model in systems like ollama, in addition to a DeepSeek-offered SaaS/PaaS cloud service
Compared to peer models, it was trained using a lot of lower-cost (and lower power) hardware
Its performance has been alleged to be on-par with larger-sized models
It remains performant when used even on some mid-range and lower-end hardware

As an advocate of open-source, the openness of the model, as well as the performance claims, piqued my interest. I had been curious about the Generative AI space for a few years, and had even piloted GitHub Copilot when I worked for Vector0, finding it to offer a decent amount of programming-assisting benefits using the popular Vim plugin.

The below blog entry contains some code and docker compose templates. These are also available on my GitHub account:

https://github.com/ckane/ollama-local-amd

Disclaimer Regarding Cloud-Hosted AI and Security

One of the common criticisms of cloud-hosted AI assistant services, such as Copilot, is the data collection potential and privacy ramifications of the approach. Many companies have established policies that take a “deny by default“ approach to Generative AI services. Consequently, many employees often unknowingly or intentionally violate these same policies every day. A lot of this is further driven by product marketing departments, which often go so far as to collaborate with many social media content creators to promote cloud-hosted third-party generative AI products that offer to organize one’s calendar, listen/watch and take notes during meetings, and even rewrite company emails for you. Much of this collaborative advertising with social media content creators presents the employee’s (remember, played by an actor on social media) unilateral GenAI tool selection as a normalized behavior of every-day life, ignoring the important detail that in nearly every corporate environment, this decision-making power is the responsibility of the IT policymaking department, who determines what third-party services are permitted for employee use.

Here’s one example of this pattern I have started to see proliferate:

In the above video, the speaker not only helps teach the viewer how to use Notion, OpenAI, and Otter, but up-sells the idea that the viewer should use the “free of charge” version, which largely limits the vendor’s liability in the event of a cybersecurity incident, and often means an implied permission grant for the vendor to use whatever content is provided to them for whatever purposes they see fit. It is important to recognize that very few employees are authorized to grant such permission to a third-party for company-private content, such as meetings, schedules, slides, emails, etc. As a general rule, were any regular employee to themselves follow the instructions outlined in the video above, no doubt they will have violated multiple corporate policies about authorized use of IT systems.

Don’t get me wrong, these tools are all really helpful, but remember that if you want to use these to optimize your work (the use-case indicated by the video’s content creator), you need to follow the following processes before you can start using it:

Reach out to your IT department or IT Compliance / Cyber Security team
Provide the list of tools you are interested in evaluating (include even the social media content describing them)
Get explicit written permission to use the tools, and make sure to inquire as to whether the company places limits on their authorized use
If the employer denies the tools, then that’s the decision you will have to adhere to

Why am I exploring this tangent? Well, to bring it back to the engineering topic, I saw the embrace of open-source here as an opportunity to do some research and development toward a locally-hosted GenAI solution that could potentially be compatible with these entirely reasonable practices.

Docker Compose

For the below examples, I utilize docker compose heavily. It is very useful for managing smaller single-instance service deployments that consist of multiple separate docker containers that communicate amongst one another.

Installing and setting it up is outside the scope of this post, but there is documentation on the project website for Linux, Windows, and MacOS:

https://docs.docker.com/compose/install/

Deploying Ollama and Fetching Models

Though DeepSeek-R1 was a catalyst for my interest to look into this area, this guide is by no means specific to DeepSeek’s models, and should be compatible with any of the Ollama model libraries posted on their online model library:

https://ollama.com/library/

One thing I learned from exploring the GenAI ecosystem is that package and system dependency issues are common. As I typically do, I have elected to mitigate these using docker and docker compose to run the software in containers as a managed cluster of services. This allows for expanding the solution with additional components, that can be started and stopped with the docker compose up, down, retart, and other commands.

To start off, I made the initial docker-compose.yml template, which creates a single service container that runs ollama, exposes access to the device nodes for my GPU (an AMD Radeon 6800 XT), and uses the container image from the Ollama project built for AMD’s ROCm API. For Nvidia GPUs, there is a slightly more involved process that involves installing nvidia-ctk. Those instructions and a further explanation of the AMD example below is available here.

services:
  ollama:
    # Use the Use the rocm version of the image for AMD GPUs - :latest for non-AMDGPU usage
    image: ollama/ollama:rocm
    container_name: ollama
    ports:
      # Expose port 11434 so that plugins like ollama-vim and others can use the service
      - "127.0.0.1:11434:11434"
    networks:
      - ollamanet
    devices:
      # kfd and dri device access needed for AMDGPU ROCm support to work
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    volumes:
      # Mount the current user's ~/.ollama folder to /root/.ollama (change the first half of this to anywhere you prefer)
      - ~/.ollama:/root/.ollama
    restart: always

# Give an easy name to the virtual network that will be created for integrating the
# ollama and future co-hosted services as the example is built out
networks:
  ollamanet:

Note that the volumes section above routes filesystem activity to a specific folder on the host system. This allows for managing ollama local persistent data, like downloaded models and other data, in a manner whereby it will not be deleted if the containers get destroyed (and, thus, you won’t lose your downloaded models).

After the above is created, you should be able to run the following to start the new ollama service:

docker compose up --wait

The following output, or similar, indicates success:

[+] Running 1/1
 ✔ Container ollama  Healthy

To see if the container started properly:

docker compose exec ollama ollama --version

ollama version is 0.5.7-0-ga420a45-dirty

Listing models (should be none after a fresh install) is as easy as:

docker compose exec ollama ollama list

NAME                        ID              SIZE      MODIFIED

Next, try pulling a new model (start with the deepseek-r1:7b model):

docker compose exec ollama ollama pull deepseek-r1:7b

Many of the models are multi-GB bundles, so downloads will take time. The output should look something like this, with the bars acting as progress bars as download proceeds:

pulling manifest 
pulling 96c415656d37... 100% ▕█████████████████████████████████████████▏ 4.7 GB                         
pulling 369ca498f347... 100% ▕█████████████████████████████████████████▏  387 B                         
pulling 6e4c38e1172f... 100% ▕█████████████████████████████████████████▏ 1.1 KB                         
pulling f4d24e9138dd... 100% ▕█████████████████████████████████████████▏  148 B                         
pulling 40fb844194b2... 100% ▕█████████████████████████████████████████▏  487 B                         
verifying sha256 digest 
writing manifest 
success

Re-running the “list models” command from earlier, the model should be visible:

docker compose exec ollama ollama list

NAME                        ID              SIZE      MODIFIED
deepseek-r1:7b              0a8c26691023    4.7 GB    2 minutes ago

Verify that the service is listening on port 11434:

ss -tunl | grep 11434

tcp   LISTEN 0      4096                                127.0.0.1:11434      0.0.0.0:*

The API exposed by ollama is a JSON-based REST API (documented here). You can interact with it via any tools typically used for REST API usage. For example, the curl tool can send a prompt to the API to ask a question of the deepseek-r1:7b model. Piping the output from that to jq allows for a more human-readable response.

curl -X POST -H 'Content-Type: application/json' -d '
{
  "model": "deepseek-r1:7b",
  "prompt": "Hello, tell me about yourself concisely",
  "stream": false
}' http://localhost:11434/api/generate | jq .

When not using a UI, the stream field should be set to false in order to make sure that the response isn’t a real-time stream of the text being generated. When employed in a UI, there’s a common UX approach whereby the AI responses appear to be “typed” as the LLM is responding to a prompt, so the API doesn’t wait until a prompt-response is fully formed to begin the HTTP response. Rather, it streams a sequence of responses in real time as the text is generated to the client. For scripted work, this is both noisy and additional complexity to parse.

The response:

{
  "model": "deepseek-r1:7b",
  "created_at": "2025-02-08T21:34:28.029426762Z",
  "response": "<think>\nAlright, someone just asked me to tell them about myself in a concise way. I need to figure out the best way to respond.\n\nFirst off, they probably want a quick overview without too much detail. Since I'm an AI, it's important to highlight that I don't have personal experiences or emotions. I should mention my purpose is to assist and provide helpful information or conversation.\n\nI should keep it friendly but professional. Maybe start by saying I'm an AI designed for help. Then explain my limitations in a polite way. Emphasize my goal of being as useful as possible without overstepping.\n\nAlso, adding something about looking forward to assisting them shows willingness to help. Keep the tone positive and open-ended so they feel comfortable asking more questions.\n</think>\n\nI am an artificial intelligence designed to assist with information, answer questions, and provide helpful conversations. My purpose is to support you in a wide range of tasks and inquiries. While I don't have personal experiences or emotions, I aim to be as useful as possible while remaining neutral and non-judgmental. Let me know how I can help!",
  "done": true,
  "done_reason": "stop",
  "context": [
    151644,
    9707,
....
    1492,
    0
  ],
  "total_duration": 5524044952,
  "load_duration": 2055321942,
  "prompt_eval_count": 12,
  "prompt_eval_duration": 32000000,
  "eval_count": 227,
  "eval_duration": 3435000000
}

The response field in the object contains the generated text to display to the submitter of the prompt from the request API call.

This validates that ollama is working properly, listening to the correct TCP port (11434) for API requests, and that it is able to load and execute models from the Ollama Library.

Feel free to try downloading other models using the above steps as a guide, and test out prompting them via the API. Some models that will be explored later on are:

deepseek-coder-v2:16b (or, simply: deepseek-coder-v2:latest)
qwen2.5-coder:14b
qwen2.5-coder:7b
nomic-embed-text:latest

Installing Open-WebUI AI Workspace Platform

The Open-WebUI project is a web-based AI workspace platform. It allows you to have a nice user-friendly interface to supported GenAI systems, such as ollama. An example GIF of it in action is below:

So, the next step in the process for improving the utility of our ollama installation is to add a container running open-webui to the services cluster.

Update the docker-compose.yml file by adding another service named open-webui:

services:
  ollama:
    # Use the Use the rocm version of the image for AMD GPUs - :latest for non-AMDGPU usage
    image: ollama/ollama:rocm
    container_name: ollama
    ports:
      # Expose port 11434 so that plugins like ollama-vim and others can use the service
      - "127.0.0.1:11434:11434"
    networks:
      - ollamanet
    devices:
      # kfd and dri device access needed for AMDGPU ROCm support to work
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    volumes:
      # Mount the current user's ~/.ollama folder to /root/.ollama (change the first half of this to anywhere you prefer)
      - ~/.ollama:/root/.ollama
    restart: always

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      # Expose port 8080 to host port 3000 so that we can access the open-webui from http://localhost:3000
      - "127.0.0.1:3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    networks:
      - ollamanet
    volumes:
      - ~/.open-webui-data:/app/backend/data
    restart: always
    depends_on:
      - ollama

# Give an easy name to the virtual network that will be created for integrating the
# ollama and future co-hosted services as the example is built out
networks:
  ollamanet:

Similar to the ollama service, I use the volumes section to expose a host-filesystem path for the container to store persistent data even when containers are destroyed and rebuilt.

Once added, the same command can be used to start it - docker compose will intelligently recognize that ollama is already running and just initialize the new open-webui service. In the above template, the new service is set up to listen on TCP port 3000 on the local system:

docker compose up --wait

[+] Running 2/2
 ✔ Container ollama      Healthy
 ✔ Container open-webui  Healthy

Similarly, we can verify the service is listening on port 3000:

ss -tunl | grep 3000

tcp   LISTEN 0      4096                    127.0.0.1:3000       0.0.0.0:*

And, finally, using a web browser, you should be able to visit the http://localhost:3000 URL and get the initial splash page for the Open WebUI service, to allow you to log in. If this is the first time it is run, and the ~/.open-webui-data directory is empty, then it’ll ask you to create a new admin user the first time it is used.

Once a new user is created, and you’re logged in, you can interact with LLM’s using a user-friendly chat interface, as well as utilize workspace-management tools to organize your conversations with the LLM. The Open WebUI can also be used to pull new models into your ollama instance, rather than relying upon the command-line approach used earlier. The Open WebUI platform provides a large amount of additional integration capability with ollama so more advanced activities, such as query and retrieval via a knowledge base, tasks that can be implemented with Python code to perform automation actions and incorporate results into responses, and other features.

Using Python to Prompt AI

Ollama can also be interacted with via Python. For example, the following Python code will send a short prompt to the API, save the response, and display both on STDOUT:

import requests
from pprint import pprint

model = "deepseek-r1:7b"

# Build a JSON prompt to send to the deepseek-r1:7b LLM
data = {
    "model": model,
    "prompt": "Hello how are you?",
    "stream": False,
}

# POST the prompt to ollama
r = requests.post("http://localhost:11434/api/generate", json=data)

try:
    # Display the original prompt
    print("Prompt:")
    print(data["prompt"])

    # Display the Response from the AI
    print("\nResponse:")
    print(r.json()['response'])
except e:
    # In the event of an exception, dump the associated error information
    pprint(e)
    pprint(f"Error: {r.status_code} - {r.text}")

When run (your exact response may vary):

Prompt:
Hello how are you?

Response:
<think>

</think>

Hello! I'm just a virtual assistant, so I don't have feelings, but I'm here and ready
to help you with whatever you need. How are you doing? 😊

Two Models Have a Conversation

Using Python, some more complex implementations are possible. For example, the following Python code can be used to facilitate a conversation between two different LLMs. The Python script acts as a proxy between them, and logs their conversation to STDOUT. Note that a new function has been added named remove_think(msg) that removes the contents of the <think> tags, which are LLM introspectives and wouldn’t be perceived by a normal other party in conversation.

import requests
from pprint import pprint

# To remove the introspective <think>...</think> tags from any response text
# before it is sent to the other model.
def remove_think(msg):
    think_open = msg.find("<think>")
    while think_open >= 0:
        think_close = msg.find("</think>", think_open)
        if think_close >= 0:
            msg = msg[:think_open] + msg[think_close+8:]
        else:
            break
    return msg

# We will try to have two different models engage in a proxied conversation, where
# this Python code handles proxying the responses from either as prompts for the
# other.
model1 = "deepseek-r1:7b"
model2 = "llama3.1:8b"

# Counter for the number of responses from each model that we want to stop at
count = 10

# Build an initial JSON prompt to bootstrap the conversation
data1 = {
    "model": model1,
    "prompt": "I need you to role-play with me. Your name is Alan and you are lost in the woods. I want you to introduce yourself to me, and ask me how to escape the woods. Do not explain that you are pretending, talk to me as if you are Alan. Also do not tell me what I would say. I will tell you my own responses.",
    "stream": False,
}

while count > 0:
    # POST the prompt to ollama using model1
    r = requests.post("http://localhost:11434/api/generate", json=data1)

    try:
        # Display the prompt sent to model1
        print("Prompt to Data1:")
        print(remove_think(data1["prompt"]))

        # Display the Response from model1
        print("\nResponse from Data1 to Data2:")
        print(r.json()['response'])

        # Format a prompt to model2 that uses model1's response as the prompt text
        data2 = {
            "model": model2,
            "prompt": remove_think(r.json()['response']),
            "stream": False,
        }

        # Send the new prompt to model2
        r = requests.post("http://localhost:11434/api/generate", json=data2)

        # Display the prompt sent to model2
        print("Prompt to Data2:")
        print(remove_think(data2["prompt"]))

        # Display the Response from model2
        print("\nResponse from Data2 to Data1:")
        print(r.json()['response'])

        # Overwrite the prompt to model1 with the response from model2
        data1 = {
            "model": model1,
            "prompt": remove_think(r.json()['response']),
            "stream": False,
        }

        # Decrement the count-down
        count -= 1

        # Loop

    except e:
        # In the event of an exception, dump the associated error information
        pprint(e)
        pprint(f"Error: {r.status_code} - {r.text}")

Try running the above and see what it produces. Try messing with the prompt to overcome some of the limitations in the initial response. If model1 and model2 are swapped, does that significantly change the conversation?

Code Completion

A common use case exists for utilizing LLMs for in-editor source-code auto-completion work. This is a common generative coding solution. Many of the models available on Ollama’s Library can accept specially structured input that indicates your intention to use the response for text completion, rather than conversationally. Unfortunately, the tokens used to indicate this are model-specific and highly variable. In addition, whether the response simply contains completion source code, or completion code plus some additional commentary, varies from model to model as well.

Reading the relevant model documentation on HuggingFace can explain how the input prompt should be formatted for code completion tasks. In some cases, the Ollama library page also documents this:

To pick deepseas-coder-v2 (second URL in the list) as an example, the following is an example of how to use its code-completion syntax:

import requests
from pprint import pprint

# Model to use for code completion (not all models support code completion)
model = "deepseek-coder-v2:latest"

# Model-specific keywords for code completion use cases
# Note that sometimes (deepseek-coder-v2 is an example) these contain extended UTF characters
fim_begin  = "<｜fim▁begin｜>"
fim_cursor = "<｜fim▁hole｜>"
fim_end    = "<｜fim▁end｜>"

# Build a JSON prompt to send to the deepseek-r1:7b LLM
data = {
    "model": model,
    # Example prompt pulled from https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct#code-insertion
    "prompt": f"""{fim_begin}def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    left = []
    right = []
{fim_cursor}
        if arr[i] < pivot:
            left.append(arr[i])
        else:
            right.append(arr[i])
    return quick_sort(left) + [pivot] + quick_sort(right){fim_end}""",
    "stream": False,

    # Set raw=True to get raw output without any additional formatting or processing
    "raw": True,
}

# POST the prompt to ollama
r = requests.post("http://localhost:11434/api/generate", json=data)

try:
    # Display the original prompt
    print("Prompt:")
    print(data["prompt"])

    # Display the Response from the AI (should only be the completion to insert where fim_cursor is)
    print("\nResponse:")
    print(r.json()['response'])
except Exception as e:
    # In the event of an exception, dump the associated error information
    pprint(e)
    pprint(f"Error: {r.status_code} - {r.text}")

Running this returns the following results:

Prompt:
<｜fim▁begin｜>def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    left = []
    right = []
<｜fim▁hole｜>
        if arr[i] < pivot:
            left.append(arr[i])
        else:
            right.append(arr[i])
    return quick_sort(left) + [pivot] + quick_sort(right)<｜fim▁end｜>

Response:
    for i in range(1, len(arr)):

In this case, the response tells us that the code for i in range(1, len(arr)) should be inserted on the line right after initially assigning right = [].

Advanced Code Completion

Some models have more advanced code completion and formatting options available. Take the Qwen2.5-Coder model, for instance. According to its documentation, the following tokens are supported that map directly to the ones above in deepseas-coder-v2:

<|fim_prefix|>: Token marking that the prompt is a code-completion task
<|fim_suffix|>: Token marking the position in the prompt where the code completion insertion should occur (where cursor is in an editor)
<|fim_middle|>: Token marking the end of the code-completion prompt

Additionally, the following tokens are supported as well, to help with providing additional context to the input prompt:

<|repo_name|>: Token labeling a repository name for repository code-completion
<|file_sep|>: Token marking a file separation point

Installing ChromaDB

Another tool exists which can store and query embeddings. It’s called ChromaDB. This database can take as input a string of text, and then provide snippets from the database that are the closest matches to the input data. This can be very useful for cases where a larger codebase cannot fit entirely within the context window of an LLM. This can help yield context-aware code completion suggestions, in these situations.

We can further expand the docker-compose.yml to deploy a chroma service:

services:
  ollama:
    image: ollama/ollama:rocm
    container_name: ollama
    ports:
      # Expose port 11434 so that plugins like ollama-vim and others can use the service
      - "127.0.0.1:11434:11434"
    networks:
      - ollamanet
    devices:
      # kfd and dri device access needed for AMDGPU ROCm support to work
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    volumes:
      - ~/.ollama:/root/.ollama
    restart: always
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      # Expose port 8080 to host port 3000 so that we can access the open-webui from localhost:3000
      - "127.0.0.1:3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    networks:
      - ollamanet
    volumes:
      - ~/.open-webui-data:/app/backend/data
    restart: always
    depends_on:
      - ollama
  chroma:
    image: chromadb/chroma:latest
    container_name: chroma
    environment:
      - IS_PERSISTENT=TRUE
    ports:
      - "127.0.0.1:3001:8000"
    networks:
      - ollamanet
    volumes:
      - ./chroma:/chroma/chroma
    restart: always
networks:
  ollamanet:

Then doing a docker compose up --wait again bringis up the new chroma service, leaving the others in place and untouched.

Installing Python Packages

The following requirements.txt contains the dependencies I installed for the LangChain examples from here on out:

chromadb
langchain_community
langchain_chroma
langchain_text_splitters
langchain_ollama

Loading Source into Chroma

In order to work with large datasets, such as the source code and API for a codebase and its dependent libraries, the larger dataset needs to be broken up into a sequence of embeddings, which can then be stored in ChromaDB. We can use ollama for part of this task as well, along with an LLM that is suited for the task of embedding generation. Earlier on, the nomic-embed-text:latest LLM was listed as one of the ones to download.

This should be done now, if it wasn’t already, either through Open WebUI or via the command line via:

docker compose exec ollama ollama pull nomic-embed-text:latest

The following script then can be used to store the desired source code into the ChromaDB embeddings store. The root_dir variable set in this script can be changed to point to whatever directory contains the source code desired to be stored for AI retrieval:

#!/usr/bin/env python
import chromadb
from pprint import pprint
from hashlib import sha256
from glob import iglob

# Import LangChain features we will use
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
from langchain_chroma.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings

# Initialize a connection to ChromaDB to store embeddings
client = chromadb.HttpClient(host='localhost', port='3001')

# Initialize a connection to Ollama to generate embeddings
embed = OllamaEmbeddings(
    model='nomic-embed-text:latest',
    base_url='http://localhost:11434',
)

# Create a new collection (if it doesn't exist already) and open it
chroma = Chroma(
    collection_name='py_collection_test',
    client=client,
    embedding_function=embed,
)

# Start scanning below a Ghidra installation folder
# TODO: Change this to whatever folder under which you want to store your private code
#       for LLM retrieval
root_dir="../../ghidra_11.1.2_PUBLIC"

# Walk the filesystem below root_dir, and pick all *.py files
myglob = iglob("**/*.py",
               root_dir=root_dir,
               recursive=True)

# Even though the GenericLoader from langchain has a glob-based filesystem
# walking feature, if any of the files cause a parser exception, it will
# pass this exception up to the lazy_load() or load() iterator call, failing
# every subsequent file. Use the Python glob interface to load the files one
# at a time with GenericLoader and LanguageParser, and if any throw an exception,
# discard that one file, and continue on to try the next.
for fsentry in myglob:
    try:
        # Try creating a new GenericLoader for the file
        loader = GenericLoader.from_filesystem(
            f"{root_dir}/{fsentry}",
            parser=LanguageParser(language='python'),
            show_progress=False,
        )

        # Try loading the file through its parser
        for doc in loader.lazy_load():
            # Set some useful metadata from the file context
            newdoc = {
                "page_content": doc.page_content,
                "metadata": {
                    # Store the filename the snippet came from
                    "source": doc.metadata["source"],
                    # I'm just using Python as an example, but any supported language will work
                    "language": 'python',
                },
            }
            # Generate a SHA256 "unique id" for each embedding, to help dedupe
            h = sha256(newdoc['page_content'].encode('utf-8')).hexdigest()

            # Store the embedding into ChromaDB
            chroma.add_documents(documents=[doc], ids=[h])
    except Exception as e:
        # If the file failed to load and/or parse, then report it to STDOUT
        pprint(f"Failed with {fsentry}!")
        #pprint(e)  # To dump the exception details, if needed
        continue

The above code does a few things:

Creates a new collection in chroma named py_collection_test - you can choose to isolate your sources into different collections by project or by language or by some other arbitrary criteria, and then limit retrieval to the snippets relevant to your project.
Discovers all of the files under a provided “root” folder matching a file-glob pattern (in this case, **/*.py which is all Python source code stored in a folder or any of its subfolders)
Attempts to load each file via langchain’s GenericLoader and parse it with the LanguageParser
If it is successful in parsing, then the generated text embeddings are stored in ChromaDB

While the example above assumes Python, omitting the language= parameter altogether from the LanguageParser instantiation line will cause it to attempt to detect the language automatically based on the file extension and/or content. LangChain’s LanguageParser supports a long list of programming languages. If using the language auto-detection, the language parameter will be added to the ChromaDB metadata automatically, if the LanguageParser is certain of the source code language.

Additionally, readers familiar with the LangChain toolset might already know that the GenericLoader.from_filesystem helper function natively supports file-globbing and recursive directory traversal, rather than loading a single file at a time. One of its limitations that I have run into has been that if any source code files cause loading or parsing errors, then the entire traversal job will fail and error out. A common issue I keep running into with it is that LanguageParser doesn’t parse extended UTF-8 characters in source code well, which is a problem where the language (such as Python, Rust, Go, and others) natively supports any UTF-8 as part of its defined language specification. To overcome this limitation, I have resorted to replicating the file-glob discovery and directory traversal using the native Python glib.iglob function instead. On each iteration of this, I call GenericLoader (and then, LanguageParser) individually for each file. If any of their parsing causes an exception, it is gracefully caught and the offending file is ignored and reported to the user, moving onto the next. In my opinion, this is a preferable way to handle “defective input data”, as the situation without this ChromaDB is that I cannot benefit from the data retrieval, any subset of my dataset, especially a 90%-some portion of it, is near-enough to complete that it’ll be reasonably usable for the context-informed code-completion tasks I wish to perform.

Some further examples and documentation about how this is working is available here:

https://python.langchain.com/docs/integrations/document_loaders/source_code/

Querying From ChromaDB

The langchain_chroma library provides a Chroma interface which has a similarity_search method that, if given a piece of code (such as the partial user input to complete), will return the top k closest matches from the ChromaDB.

For example, hypothetically we might have the following partial Python code:

from ghidra.program.model.listing import CodeUnit
from ghidra.program.model.symbol import SourceType

fm = currentProgram.getFunctionManager()
functions = fm.get

This snippet of code has been intentionally built to implement some simple elements of the Ghidra API.

Where the user stopped typing after the fm.get and requested a completion to the code. The entire snippet above could be stored in a string named code, and then the query would look something like:

r_docs = chroma.similarity_search(code, k=2)

2 matches (defined by k=2) from ChromaDB would be provided to the caller.

An example of using the API to query embeddings from ChromaDB:

#!/usr/bin/env python
import chromadb
import requests
from pprint import pprint
from hashlib import sha256

from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
from langchain_chroma.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings

# Open a connection to the ChromaDB server
client = chromadb.HttpClient(host='localhost', port='3001')

# Connect to the same embedding model that was used to create the
# embeddings in load_chroma.py
embed = OllamaEmbeddings(
    model='nomic-embed-text:latest',
    base_url='http://localhost:11434',
)

# Open a session to query the py_collection_test collection within Chroma
# this was populated by load_chroma.py
chroma = Chroma(
    collection_name='py_collection_test',
    client=client,
    embedding_function=embed,
)

# An example snippet of Python code that we would like to use to query
# chroma for similarity
code = """from ghidra.program.model.listing import CodeUnit
from ghidra.program.model.symbol import SourceType

fm = currentProgram.getFunctionManager()
functions = fm.get"""

# Perform the similarity search against the chroma database. The k= param
# will control the number of "top results" to return. For this example, we'll
# use 2 of them, but in production more would be better for feeding more
# context to the LLM generating a code completion
r_docs = chroma.similarity_search(code, k=2)

# Iterate across each result from Chroma
for doc in r_docs:
    # Display which source file it came from (using the schema created in load_chroma.py)
    print('#  ' + doc.metadata['source'] + ':')

    # Display the embedding snippet content
    print(doc.page_content)

Running the above code against my database pre-loaded with the Python code from Ghidra earlier, I got the following 2 snippets in my output:

#  ../../ghidra_11.1.2_PUBLIC/Ghidra/Features/Python/ghidra_scripts/AddCommentToProgramScriptPy.py:
## ###
#  IP: GHIDRA
# 
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#  
#       http://www.apache.org/licenses/LICENSE-2.0
#  
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
##
# Adds a comment to a program.

# DISCLAIMER: This is a recreation of a Java Ghidra script for example
# use only. Please run the Java version in a production environment.

#@category Examples.Python


from ghidra.program.model.address.Address import *
from ghidra.program.model.listing.CodeUnit import *
from ghidra.program.model.listing.Listing import *

minAddress = currentProgram.getMinAddress()
listing = currentProgram.getListing()
codeUnit = listing.getCodeUnitAt(minAddress)
codeUnit.setComment(codeUnit.PLATE_COMMENT, "AddCommentToProgramScript - This is an added comment!")

#  ../../ghidra_11.1.2_PUBLIC/Ghidra/Features/Python/ghidra_scripts/external_module_callee.py:
## ###
#  IP: GHIDRA
# 
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#  
#       http://www.apache.org/licenses/LICENSE-2.0
#  
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
##
# Example of being imported by a Ghidra Python script/module
# @category: Examples.Python

# The following line will fail if this module is imported from external_module_caller.py,
# because only the script that gets directly launched by Ghidra inherits fields and methods
# from the GhidraScript/FlatProgramAPI.
try:
    print currentProgram.getName()
except NameError:
    print "Failed to get the program name"

# The Python module that Ghidra directly launches is always called __main__.  If we import
# everything from that module, this module will behave as if Ghidra directly launched it.
from __main__ import *

# The below method call should now work
print currentProgram.getName()

As can be seen in the above, there are a lot of comments in the code, particularly license text. It may be desirable to strip this type of content during the ChromaDB loading phase, in order to allow more actual code as context to the LLM prompt, which may assist with completion.

Additionally, the internal API and library code were uploaded into ChromaDB. However, it could be beneficial to also collect a large number of examples to store in ChromaDB, possibly setting this up as a regular job fed by private code repositories worked on by the team.

Combining ChromaDB Query with FIM Code Insertion

When I was looking at the different “coder” LLMs available, I came across the qwen2.5-coder model from Alibaba (link here), which supports the popular “fill-in-the-middle” (FIM) protocol, but also has an extension offering a <|file_sep|> tag. The earlier FIM code can be adapted to take a prompt from a user, and then extend it by adding some additional code embeddings pulled from ChromaDB, to produce a context-aware code completion utility.

The following Python code is a blend of the earlier FIM code and the more recent ChromaDB query code. In the version below, the data = data_norag line causes the LLM to be prompted with a query that doesn’t include any of the results from ChromaDB (they’re simply discarded). This is done for the purpose of illustrating what the LLM would produce when not context-aware about the Ghidra codebase.

#!/usr/bin/env python
import chromadb
import requests
from pprint import pprint
from hashlib import sha256

from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
from langchain_chroma.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings

# Model to use for code completion (not all models support code completion)
model = "qwen2.5-coder:7b"

# Model-specific keywords for code completion use cases
# Note that sometimes (deepseek-coder-v2 is an example) these contain extended UTF characters
fim_begin  = "<|fim_prefix|>"
fim_cursor = "<|fim_suffix|>"
fim_end    = "<|fim_middle|>"
file_sep   = "<|file_sep|>"

# Open a connection to the ChromaDB server
client = chromadb.HttpClient(host='localhost', port='3001')

# Connect to the same embedding model that was used to create the
# embeddings in load_chroma.py
embed = OllamaEmbeddings(
    model='nomic-embed-text:latest',
    base_url='http://localhost:11434',
)

# Open a session to query the py_collection_test collection within Chroma
# this was populated by load_chroma.py
chroma = Chroma(
    collection_name='py_collection_test',
    client=client,
    embedding_function=embed,
)

# An example snippet of Python code that we would like to use to query
# chroma for similarity
code = """from ghidra.program.model.listing import CodeUnit
from ghidra.program.model.symbol import SourceType

fm = currentProgram.getFunctionManager()
functions = fm.get"""


# Perform the similarity search against the chroma database. The k= param
# will control the number of "top results" to return. For this example, we'll
# grab 5 of them.
r_docs = chroma.similarity_search(code, k=5)

# First, an example of building the query without including the context from ChromaDB
data_norag = {
    "model": model,
    "prompt": f"{fim_begin}{code}{fim_cursor}{fim_end}",
    "stream": False,
    "raw": True,
}

# Then, use the ChromaDB query response as part of the input prompt to the coder LLM
data_rag = {
    "model": model,
    "prompt": "{fim_begin}{file_sep}\n{context}\n{file_sep}{code}{fim_cursor}{fim_end}".format(
        fim_begin=fim_begin, file_sep=file_sep, fim_cursor=fim_cursor, fim_end=fim_end,
        context=f"\n{file_sep}".join([doc.page_content for doc in r_docs]),
        code=code,
    ),
    "stream": False,
    "raw": True,
}

data = data_norag

# POST the request to ollama
r = requests.post("http://localhost:11434/api/generate", json=data)

try:
    # Display the prompt, followed by the ollama response
    print("Prompt:" + data["prompt"])
    print("---------------------------------------------------")

    # Denote where the cursor would be using >>> (the Python CLI prompt)
    print("Response: >>>" + r.json()['response'])
except Exception as e:
    # In the event of an exception, show the details that caused it
    pprint(e)
    pprint(f"Error: {r.status_code} - {r.text}")

The following output was produced by the above, with the >>> in the Response identifying where the completion string would start. Note that this is completion performed by the LLM without the results from ChromaDB.

Prompt:<|fim_prefix|>from ghidra.program.model.listing import CodeUnit
from ghidra.program.model.symbol import SourceType

fm = currentProgram.getFunctionManager()
functions = fm.get<|fim_suffix|><|fim_middle|>
---------------------------------------------------
Response: >>>Functions(True)

for function in functions:
    function.setName("NewName", SourceType.USER_DEFINED)

Change the earlier line assigning data = to read:

data = data_rag

Then, run the (modified) script, and the following output was produced:

Prompt:<|fim_prefix|><|file_sep|>
## ###
#  ....
#@category Examples.Python


from ghidra.program.model.address.Address import *
from ghidra.program.model.listing.CodeUnit import *
from ghidra.program.model.listing.Listing import *

minAddress = currentProgram.getMinAddress()
listing = currentProgram.getListing()
codeUnit = listing.getCodeUnitAt(minAddress)
codeUnit.setComment(codeUnit.PLATE_COMMENT, "AddCommentToProgramScript - This is an added comment!")
<|file_sep|>## ###
#  IP: GHIDRA
#  ....
# The following line will fail if this module is imported from external_module_caller.py,
# because only the script that gets directly launched by Ghidra inherits fields and methods
# from the GhidraScript/FlatProgramAPI.
try:
    print currentProgram.getName()
except NameError:
    print "Failed to get the program name"

# The Python module that Ghidra directly launches is always called __main__.  If we import
# everything from that module, this module will behave as if Ghidra directly launched it.
from __main__ import *

# The below method call should now work
print currentProgram.getName()


<|file_sep|>## ###
#  IP: GHIDRA
#  ....
# @category: BSim.python

import ghidra.app.decompiler.DecompInterface as DecompInterface
import ghidra.app.decompiler.DecompileOptions as DecompileOptions

def processFunction(func):
    decompiler = DecompInterface()
    try:
        options = DecompileOptions()
        decompiler.setOptions(options)
        decompiler.toggleSyntaxTree(False)
        decompiler.setSignatureSettings(0x4d)
        if not decompiler.openProgram(currentProgram):
            print "Unable to initialize the Decompiler interface!"
            print "%s" % decompiler.getLastMessage()
            return
        language = currentProgram.getLanguage()
        sigres = decompiler.debugSignatures(func,10,None)
        for i,res in enumerate(sigres):
            buf = java.lang.StringBuffer()
            sigres.get(i).printRaw(language,buf)
            print "%s" % buf.toString()
    finally:
        decompiler.closeProgram()
        decompiler.dispose()

func = currentProgram.getFunctionManager().getFunctionContaining(currentAddress)
if func is None:
    print "no function at current address"
else:
    processFunction(func)

<|file_sep|>## ###
#  IP: GHIDRA
#  ....
##
# Prints out all the functions in the program that have a non-zero stack purge size

for func in currentProgram.getFunctionManager().getFunctions(currentProgram.evaluateAddress("0"), 1):
  if func.getStackPurgeSize() != 0:
    print "Function", func, "at", func.getEntryPoint(), "has nonzero purge size", func.getStackPurgeSize()

<|file_sep|>## ###
#  IP: GHIDRA
#  ....
##
from ctypes import *
from enum import Enum

from comtypes import IUnknown, COMError
from comtypes.automation import IID, VARIANT
from comtypes.gen import DbgMod
from comtypes.hresult import S_OK, S_FALSE
from pybag.dbgeng import exception

from comtypes.gen.DbgMod import *

from .iiterableconcept import IterableConcept
from .ikeyenumerator import KeyEnumerator


# Code for: class ModelObjectKind(Enum):


# Code for: class ModelObject(object):
<|file_sep|>from ghidra.program.model.listing import CodeUnit
from ghidra.program.model.symbol import SourceType

fm = currentProgram.getFunctionManager()
functions = fm.get<|fim_suffix|><|fim_middle|>
---------------------------------------------------
Response: >>>Functions(True)
for function in functions:
    entryPoint = function.getEntryPoint()
    codeUnit = listing.getCodeUnitAt(entryPoint)
    if isinstance(codeUnit, CodeUnit):
        codeUnit.setComment(SourceType.USER_DEFINED, "This is a user-defined comment!")

LLM Response Comparison

Cutting the two responses out of the program output, they can be compared to see what’s going on:

Response: >>>Functions(True)

for function in functions:
    function.setName("NewName", SourceType.USER_DEFINED)

Response: >>>Functions(True)
for function in functions:
    entryPoint = function.getEntryPoint()
    codeUnit = listing.getCodeUnitAt(entryPoint)
    if isinstance(codeUnit, CodeUnit):
        codeUnit.setComment(SourceType.USER_DEFINED, "This is a user-defined comment!")

As can be seen, the first response (without the added context from ChromaDB) did appear to provide a simple iteration suggestion that appears to relate to Ghidra, as a SourceType.USER_DEFINED does exist in Ghidra’s Python API, in addition to a setName call. However, the suggestion is simpler than the second response. This may be due to the fact that Ghidra itself is a well-known public codebase, and therefore might have been present in the training data for the LLM.

The second response incorporates more API content from ChromaDB, and it shows, with more details in the suggested completion.

Conclusion

Hopefully the steps detailed above provide some illumination about how LLMs and Vector Stores (like ChromaDB) can be paired to help accomplish Retrieval-Augmented Generation (RAG) tasks. The approach above attempts a fairly naive RAG solution, and more complex approaches to retrieve context for the LLM may exist beyond simply using a similarity search in ChromaDB. The DB offers a number of other similarity algorithms, and as well there are some other non-AI context-aware coding systems, such as python-ctags, which is a popular choice to integrate with code editors on Linux-based systems.

The associated code is available in the following GitHub repository:

https://github.com/ckane/ollama-local-amd

Author: Coleman Kane
Permanent Link: https://blog.malware.re/2025/02/08/genai-context-coder/index.html