llm_client: An Interface for Deterministic Signals from Probabilistic LLM Vibes

// Load Local LLMs
let llm_client = LlmClient::llama_cpp().available_vram(48).mistral7b_instruct_v0_3().init().await?;

// Build requests
let response: u32 = llm_client.reason().integer()
    .instructions()
    .set_content("Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?")
    .return_primitive().await?;

// Recieve 'primitive' outputs
assert_eq!(response, 1)

This runs the reason one round cascading prompt workflow with an integer output.

The code for this workflow is here.

I have a full breakdown of this workflow in my blog post, "Step-Based Cascading Prompts: Deterministic Signals from the LLM Vibe Space."

From AI Vibes to the Deterministic Real World

Large Language Models (LLMs) are somewhere between conventional programming and databases: they process information like rule-based systems while storing and retrieving vast amounts of data like databases. But unlike the deterministic outputs of if statements and database queries, raw LLMs produce outputs that can be described as “vibes” — probabilistic, context-dependent results that may vary between runs. Building on vibes may work for creative applications, but in real-world applications, a lossy, vibey output is a step back from the reliability and consistency of traditional programming paradigms.

llm_client is an interface for building cascading prompt workflows from dynamic and novel inputs, running the steps of the workflows in a linear or reactive manner, and constraining and interpreting LLM inference as actionable signals. This allows the integration of LLMs into traditional software systems with the level of consistency and reliability that users expect.

llm_client Implementation Goals

As Friendly as Possible: the most intuitive and ergonomic Rust interface possible
Local and Embedded:
- Designed for native, in-process integration—just like a standard Rust crate
- No stand-alone servers, containers, or services
- Minimal cloud dependencies, and fully local, fully supported
Works: Accurate, reliable, and observable results from available workflows

Traditional LLM Constraints Impact LLM Performance

Using an LLM as part of your business logic requires extracting usable signals from LLM responses. This means interpreting the text generated by an LLM using something like Regex or LLM constraints. Controlling the output of an LLM is commonly achieved with constraints like logit bias, stop words, or grammars.

However, from practical experience, as well as studies, constraints such as logit bias and grammars negatively impact the quality of LLMs. Furthermore, these merely constrain the inference at the token level, when we may wish to shape the structure of the entire generation.

Controlled Generation with Step Based Cascade Workflows

llm_client's cascade prompting system runs pre-defined workflows that control and constrain both the overall structure of generation and individual tokens during inference. This allows the implementation of specialized workflows for specific tasks, shaping LLM outputs towards intended, reproducible outcomes.

This method significantly improves the reliability of LLM use cases. For example, there are test cases this repo that can be used to benchmark an LLM. There is a large increase in accuracy when comparing basic inference with a constrained outcome and a CoT style cascading prompt workflow. The decision workflow that runs N count of CoT workflows across a tempature gradient approaches 100% accuracy for the test cases.

Cascade Prompt Elements

Workflow: A workflow, or 'flow', is a high level object that runs the individual elements.
Rounds: A workflow is made up of multiple rounds. Each round is a pair of a user turn and an assistant turn. Turns are also known as ‘messages’.
- Both the user turn and the assistant turn can be pre-generated, or dynamically generated.
Tasks: The ‘user message’ in the user turn of a round. Generically referred to ‘task’ for the sake of brevity.
Steps: Each assistant turn consists of one or more steps.
- Inference steps generate text via LLM inference.
- Guidance steps generate text from pre-defined static inputs or dynamic inputs from your program.
Generation prefixes: Assistant steps can be prefixed with content prior to generation.
Dynamic sufixes: Assistant steps can also be suffixed with additional content after generation.

Reasoning with Primitive Outcomes

A constraint enforced CoT process for reasoning. First, we get the LLM to 'justify' an answer in plain english. This allows the LLM to 'think' by outputting the stream of tokens required to come to an answer. Then we take that 'justification', and prompt the LLM to parse it for the answer. See the workflow for implementation details.

Currently supporting returning booleans, u32s, and strings from a list of options
Can be 'None' when ran with return_optional_primitive()

    // boolean outcome
    let reason_request = llm_client.reason().boolean();
    reason_request
        .instructions()
        .set_content("Does this email subject indicate that the email is spam?");
    reason_request
        .supporting_material()
        .set_content("You'll never believe these low, low prices 💲💲💲!!!");
    let res: bool = reason_request.return_primitive().await.unwrap();
    assert_eq!(res, true);

    // u32 outcome
    let reason_request = llm_client.reason().integer();
    reason_request.primitive.lower_bound(0).upper_bound(10000);
    reason_request
        .instructions()
        .set_content("How many times is the word 'llm' mentioned in these comments?");
    reason_request
        .supporting_material()
        .set_content(hacker_news_comment_section);
    // Can be None
    let response: Option<u32> = reason_request.return_optional_primitive().await.unwrap();
    assert!(res > Some(9000));

    // string from a list of options outcome
    let mut reason_request = llm_client.reason().exact_string();
    reason_request
        .instructions()
        .set_content("Based on this readme, what is the name of the creator of this project?");
    reason_request
        .supporting_material()
        .set_content(llm_client_readme);
    reason_request
        .primitive
        .add_strings_to_allowed(&["shelby", "jack", "camacho", "john"]);
    let response: String = reason_request.return_primitive().await.unwrap();
    assert_eq!(res, "shelby");

See the reason example for more

Decisions with N number of Votes Across a Temperature Gradient

Uses the same process as above N number of times where N is the number of times the process must be repeated to reach a consensus. We dynamically alter the temperature to ensure an accurate consensus. See the workflow for implementation details.

Supports primitives that implement the reasoning trait
The consensus vote count can be set with best_of_n_votes()
By default dynamic_temperture is enabled, and each 'vote' increases across a gradient

    // An integer decision request
    let decision_request = llm_client.reason().integer().decision();
    decision_request.best_of_n_votes(5); 
    decision_request
        .instructions()
        .set_content("How many fingers do you have?");
    let response = decision_request.return_primitive().await.unwrap();
    assert_eq!(response, 5);

See the decision example for more

Structured Outputs and NLP

Data extraction, summarization, and semantic splitting on text.
Currently implemented NLP workflows are url extraction.

See the extract_urls example

Basic Primitives

A generation where the output is constrained to one of the defined primitive types. See the currently implemented primitive types. These are used in other workflows, but only some are used as the output for specific workflows like reason and decision.

These are fairly easy to add, so feel free to open an issue if you'd like one added.

See the basic_primitive example

LLM -> LLMs

Basic support for API based LLMs. Currently, anthropic, openai, perplexity
Perplexity does not currently return documents, but it does create it's responses from live data

    let llm_client = LlmClient::perplexity().sonar_large().init();
    let mut basic_completion = llm_client.basic_completion();
    basic_completion
        .prompt()
        .add_user_message()
        .set_content("Can you help me use the llm_client rust crate? I'm having trouble getting cuda to work.");
    let response = basic_completion.run().await?;

See the basic_completion example

Loading Custom Models from Local Storage

llm_client uses the llm_utils crate on the backend. To load custom models, checkout the llm_utils documentation.

    // From a local storage
    let llm_client = LlmClient::llama_cpp().local_quant_file_path(local_llm_path).init().await?;

    // From hugging face
    let llm_client = LlmClient::llama_cpp().hf_quant_file_url("https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q6_K.gguf").init().await?;

Configuring Requests

All requests and workflows implement the BaseRequestConfigTrait which gives access to the parameters sent to the LLM
These settings are normalized across both local and API requests

    let llm_client = LlmClient::llama_cpp()
        .available_vram(48)
        .mistral7b_instruct_v0_3()
        .init()
        .await?;

    let basic_completion = llm_client.basic_completion();

    basic_completion
        .temperature(1.5)
        .frequency_penalty(0.9)
        .max_tokens(200);

See See all the settings here

Guides

Limiting power in Nvidia GPUs

Installation

llm_client currently relies on llama.cpp. As it's a c++ project, it's not bundled in the crate. In the near future, llm_client will support mistral-rs, an inference backend built in Candle and supporting great features like ISQ. Once integration is complete, llm_client will be pure Rust and can be installed as just a crate.

Clone repo:

git clone --recursive https://github.com/ShelbyJenkins/llm_client.git
cd llm_client

Add to cargo.toml:

[dependencies]
llm_client = {path="../llm_client"}

Optional: Build devcontainer from llm_client/.devcontainer/devcontainer.json This will build out a dev container with nvidia dependencies installed.

Build llama.cpp (This is dependent on your hardware. Please see full instructions here):

  // Example nvidia gpu build
  cd llm_client/src/llm_backends/llama_cpp/llama_cpp
  make GGML_CUDA=1

Roadmap

Migrate from llama.cpp to mistral-rs. This would greatly simplify consuming as an embedded crate. It's currently a WIP. It may also end up that llama.cpp is behind a feature flag as a fallback.
- Current blockers are grammar migration
- and multi-gpu support
Reasoning where the output can be multiple answers
Expanding the NLP functionality to include semantic splitting and data further extraction.
Refactor the benchmarks module

Dependencies

async-openai is used to interact with the OpenAI API. A modifed version of the async-openai crate is used for the Llama.cpp server. If you just need an OpenAI API interface, I suggest using the async-openai crate.

clust is used to interact with the Anthropic API. If you just need an Anthropic API interface, I suggest using the clust crate.

llm_utils is a sibling crate that was split from the llm_client. If you just need prompting, tokenization, model loading, etc, I suggest using the llm_utils crate on it's own.

Contributing

Yes.

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contact

Shelby Jenkins - Here or Linkedin

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.devcontainer		.devcontainer
examples		examples
media		media
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm_client: An Interface for Deterministic Signals from Probabilistic LLM Vibes

From AI Vibes to the Deterministic Real World

llm_client Implementation Goals

Traditional LLM Constraints Impact LLM Performance

Controlled Generation with Step Based Cascade Workflows

Cascade Prompt Elements

Reasoning with Primitive Outcomes

Decisions with N number of Votes Across a Temperature Gradient

Structured Outputs and NLP

Basic Primitives

LLM -> LLMs

Loading Custom Models from Local Storage

Configuring Requests

Guides

Installation

Roadmap

Dependencies

Contributing

License

Contact

About

Releases

Packages

Languages

License

ShelbyJenkins/llm_client

Folders and files

Latest commit

History

Repository files navigation

llm_client: An Interface for Deterministic Signals from Probabilistic LLM Vibes

From AI Vibes to the Deterministic Real World

llm_client Implementation Goals

Traditional LLM Constraints Impact LLM Performance

Controlled Generation with Step Based Cascade Workflows

Cascade Prompt Elements

Reasoning with Primitive Outcomes

Decisions with N number of Votes Across a Temperature Gradient

Structured Outputs and NLP

Basic Primitives

LLM -> LLMs

Loading Custom Models from Local Storage

Configuring Requests

Guides

Installation

Roadmap

Dependencies

Contributing

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages