In the preceding article, we were introduced to AI/ML concepts and explored the process of running a local Large Language Model (LLM) - Ollama. We further delved into interacting with it via Java using JBang and Langchain4j.
Now, let's explore into what "chat memory" is and how langchain4j helps in the cumbersome task of maintaining the chat memory.
To begin with, let's discuss the necessity of chat memory. Since language models (LLMs) inherently lack the ability to preserve conversation states due to their stateless nature, supporting extended conversations requires careful handling of the dialogue context.
If we run the OllamaMistralExample
from the previous article, the following are the responses from the model
Please enter your question - 'exit' to quit: My name is Kevin, the minion. I work for Gru!
Hello Kevin the Minion! It's great to meet you, the dedicated and hardworking minion from Gru's team. I'm here to help answer any questions or provide information you may need. What can I assist you with today?
Please enter your question - 'exit' to quit: Who is my boss?
I cannot determine who your boss is as I don't have the ability to access or interpret real-world information. Your boss would be the person who has authority over you in your workplace, such as a manager or supervisor. If you are unsure, it may be best to ask someone in a position of seniority within your organization or consult your employment contract or HR department for clarification.
Please enter your question - 'exit' to quit: What is my name?
I am an artificial intelligence and do not have a name or personal identity. I exist to provide information and answer questions to the best of my ability. How may I assist you today?
From the responses above, we can clearly see that the model does not remember the context of the conversation during the interaction with the LLM as they don't remember the state. Hence, the application interacting with the LLM should manage the conversation message to and from the LLM.
For sending multiple messages, langchain4j's ChatLanguageModel
interface provides the following methods
default Response<AiMessage> generate(ChatMessage... messages);
Response<AiMessage> generate(List<ChatMessage> messages);
default Response<AiMessage> generate(List<ChatMessage> messages, ToolSpecification toolSpecification);
default Response<AiMessage> generate(List<ChatMessage> messages, List<ToolSpecification> toolSpecifications);
Now let's see a code example that uses the second method in the ChatLanguageModel
interface, that is Response<AiMessage> generate(List<ChatMessage> messages);
//JAVA 21
//DEPS dev.langchain4j:langchain4j:0.28.0
//DEPS dev.langchain4j:langchain4j-ollama:0.28.0
import java.io.Console;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.data.message.ChatMessage;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.model.StreamingResponseHandler;
import dev.langchain4j.model.chat.StreamingChatLanguageModel;
import dev.langchain4j.model.ollama.OllamaStreamingChatModel;
import dev.langchain4j.model.output.Response;
class OllamaMistralBasicMemory {
private static final String MODEL = "mistral";
private static final String BASE_URL = "http://localhost:11434";
private static Duration timeout = Duration.ofSeconds(120);
public static void main(String[] args) {
beginChatWithBasicMemory();
}
static void beginChatWithBasicMemory() {
Console console = System.console();
List<ChatMessage> messages = new ArrayList<>();
StreamingChatLanguageModel model = OllamaStreamingChatModel.builder()
.baseUrl(BASE_URL)
.modelName(MODEL)
.timeout(timeout)
.temperature(0.0)
.build();
String question = console.readLine(
"\n\nPlease enter your question - 'exit' to quit: ");
while (!"exit".equalsIgnoreCase(question)) {
messages.add(UserMessage.from(question));
CompletableFuture<Response<AiMessage>> futureResponse = new CompletableFuture<>();
model.generate(messages, new StreamingResponseHandler<AiMessage>() {
@Override
public void onNext(String token) {
System.out.print(token);
}
@Override
public void onComplete(Response<AiMessage> response) {
messages.add(response.content());
futureResponse.complete(response);
}
@Override
public void onError(Throwable error) {
futureResponse.completeExceptionally(error);
}
});
futureResponse.join();
question = console.readLine("\n\nPlease enter your question - 'exit' to quit: ");
}
}
}
The OllamaMistralBasicMemory
class is a modified version of OllamaMistralExample
class from the previous article. We use the StreamingChatLanguageModel
which let's us get the response immediately for each token generated rather than having to wait for the full response.
Here we use an ArrayList to store the UserMessage
and the AiMessage
that gets sent to the LLM whenever we want the LLM to generate the response.
After each input received from the user, messages.add(UserMessage.from(question));
adds the user input to the list and when the response is completely received it triggers the event onComplete(Response<AiMessage> response)
which in turn adds the message to the list of messages by messages.add(response.content());
.
Now, try executing the OllamaMistralBasicMemory
, and now the responses seem to align with what we expect and it seems to know the context. The following is the output for the same conversation as above.
Please enter your question - 'exit' to quit: My name is Kevin, the minion. I work for Gru!
Hello Kevin the Minion! It's great to meet you, the dedicated and hardworking minion from Gru's team. I'm here to help answer any questions or provide information you may need. What can I assist you with today?
Please enter your question - 'exit' to quit: What is my name?
I apologize for the confusion earlier, Kevin. You have introduced yourself as Kevin the Minion. So, your name is indeed Kevin! Is there something specific you would like to know or discuss related to Gru's lab or minion activities?
Please enter your question - 'exit' to quit: Who is my boss?
Your boss is Gru! He is the mastermind and leader of the evil organization that you and your fellow Minions work for. Gru is known for his cunning plans and schemes, and he relies on your help to carry them out. If you have any questions or need assistance with tasks related to Gru's plans, feel free to ask!
As we can see, the LLM remembers the context and starts providing appropriate responses to the questions. However, there are a few problems with this implementation
- First, LLMs possess a finite context window that accommodates a certain number of tokens at any given moment. Conversations have the potential to surpass this limit
- Second, each token comes with a cost, which increases progressively as more tokens are requested from the LLM
- Third, the resource usage increases considerably on both the LLM and the application over time as the list builds up
Managing ChatMessage
s manually is an arduous task. To simplify this process, LangChain4j provides the ChatMemory
interface for managing ChatMessage
s that is backed by a List
, offering additional features such as persistence (as provided by ChatMemoryStore
) and the essential "eviction policy". This eviction policy to address the issues described above.
LangChain4j currently implements two algorithms for eviction policy:
MessageWindowChatMemory
provides a sliding window functionality, retaining theN
most recent messages and evicting the older ones when it goes beyond the specified capacityN
. However, theSystemMessage
type ofChatMessage
is retained and not evicted. The other types of messagesUserMessage
,AiMessage
andToolExecutionResultMessage
will be evictedTokenWindowChatMemory
also provides a sliding window functionality but retains theN
most recent tokens instead of messages. ATokenizer
needs to be specified to count the tokens in eachChatMessage
. If there isn't enough space for a new message, the oldest one (or multiple) is evicted. Messages are indivisible. If a message doesn't fit, it is evicted completely. Like theMessageWindowChatMemory
,SystemMessage
is not evicted.
Now, let's implement the OllamaMistralBasicMemory
using ChatMemory
with the MessageWindowChatMemory
eviction policy
//JAVA 21
//DEPS dev.langchain4j:langchain4j:0.28.0
//DEPS dev.langchain4j:langchain4j-ollama:0.28.0
import java.io.Console;
import java.time.Duration;
import java.util.concurrent.CompletableFuture;
import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.memory.ChatMemory;
import dev.langchain4j.memory.chat.MessageWindowChatMemory;
import dev.langchain4j.model.StreamingResponseHandler;
import dev.langchain4j.model.chat.StreamingChatLanguageModel;
import dev.langchain4j.model.ollama.OllamaStreamingChatModel;
import dev.langchain4j.model.output.Response;
class OllamaMistralChatMemory {
private static final String MODEL = "mistral";
private static final String BASE_URL = "http://localhost:11434";
private static Duration timeout = Duration.ofSeconds(120);
public static void main(String[] args) {
beginChatWithChatMemory();
System.exit(0);
}
static void beginChatWithChatMemory() {
Console console = System.console();
ChatMemory memory = MessageWindowChatMemory.withMaxMessages(3);
StreamingChatLanguageModel model = OllamaStreamingChatModel.builder()
.baseUrl(BASE_URL)
.modelName(MODEL)
.timeout(timeout)
.temperature(0.0)
.build();
String question = console.readLine(
"\n\nPlease enter your question - 'exit' to quit: ");
while (!"exit".equalsIgnoreCase(question)) {
memory.add(UserMessage.from(question));
CompletableFuture<Response<AiMessage>> futureResponse = new CompletableFuture<>();
model.generate(memory.messages(), new StreamingResponseHandler<AiMessage>() {
@Override
public void onNext(String token) {
System.out.print(token);
}
@Override
public void onComplete(Response<AiMessage> response) {
memory.add(response.content());
futureResponse.complete(response);
}
@Override
public void onError(Throwable error) {
futureResponse.completeExceptionally(error);
}
});
futureResponse.join();
question = console.readLine("\n\nPlease enter your question - 'exit' to quit: ");
}
}
}
Here we have set the max messages to 3
for the sake of testing it quickly. A higher value can be set if needed. Therefore, the max number of ChatMessage
s that are retained is 3
including question (UserMessage
) and response (AiMessage
).
If we run the program and specify our name first, then ask a few more questions so that the context of name gets evicted after 3 messages. Now if we ask the LLM for the name, the LLM does not have the content as the MessageWindowChatMemory
has evicted those messages. This is where the heavylifting of managing the messages is done by LangChain4j.
The ChatMemory
is a low-level component to manage the messages. However, there are high-level components AiServices
and ConversationalChain
that are available in LangChain4j. We will explore those in the upcoming articles.
The code examples can be found here
Happy Coding!