Getting Started with LLMs

Introduction

Since you are reading this, you must be as excited and intrigued as I am about Artificial Intelligence (AI) and Large Language Models (LLMs). Thanks for coming!

AI is not new. Its origins can be traced back to the 1950s and 1960s. But something significant changed in 2017 with the Google Transformer Architecture and the release of OpenAI GPT-3 in 2020. To oversimplify, a convergence of powerful hardware, access to large data sets, increased investment, and maybe a bit of surprise at how well the models worked, resulted in the explosion we are seeing today.

I have been experimenting with LLMs since their recent surge in popularity. I primarily use the free versions from companies like Google (Gemini), Microsoft (Copilot), OpenAI, and Anthropic (Claude). Each model is different, specializing in a set of use cases, but incredibly valuable. These models have become an integral part of my daily workflow, often replacing my use of Google search.

I primarily use text-based LLMs but experiment with image generation, primarily using Midjourney. I have not done much with voice other than to generate transcripts. I have worked with some specialty uses of LLMs, such as Google's NotebookLM - which I am very impressed with and use regularly.

I have also spent time building a Retrieval-Augmented Generation (RAG) application. This is a type of AI solution that modifies interactions with an LLM so that the responses leverage a specified set of documents to generate a response.

With all of this, I feel like I have only seen the tip of the iceberg. Each time I learn, I find that there is more that I don't know. My list of areas to dig into continues to grow, as does my desire to experiment.

Here are a few areas that I want to dig into:

1. How can I create a better RAG application by fine-tuning my own LLM?

2. How can I use an LLM to test my RAG application for accuracy?

3. For the RAG application that I have built, what is the best LLM to use?

I suspect you have your list. The purpose of this document is to help you understand enough about LLMs to try things out and run experiments. It starts by describing the different LLMs that exist. This is followed by how to get access to them, how My Dev Server provides an affordable way to run these, and some experiments that you might consider to get you started.

LLMs

What are LLMs?

Large Language Models (LLMs) are a type of artificial intelligence (AI) designed to process and generate human language, including text, code, and scripts. They are trained on massive datasets of text and code, allowing them to learn patterns, grammar, and context.

Who Builds LLMs?

LLMs are primarily built by technology companies and research institutions. Some of the most prominent developers include:

OpenAI: Known for models like GPT-3 and GPT-4.
Google: Developer of the BERT, PaLM, and Gemini models.
Meta (Facebook): Creator of the LLaMA family of models.
Anthropic: Developer of the Claude models.
DeepMind: Creator of models like Chinchilla and Gopher.
Research institutions: Universities and research labs worldwide also contribute to LLM development.

Additionally, Hugging Face serves as a platform hosting a large number of open-source LLMs, making them accessible to developers and researchers.

What are LLMs Used For?

LLMs have a wide range of applications, including:

Natural language generation: Creating human-quality text, such as articles, stories, and scripts.
Machine translation: Translating text from one language to another.
Text summarization: Condensing long pieces of text into shorter summaries.
Question answering: Answering questions based on provided information.
Code generation: Writing code based on natural language prompts.
Chatbots and virtual assistants: Providing conversational interfaces.

Where are LLMs Heading?

LLMs are rapidly evolving, and their capabilities are expanding. Future developments may include:

Increased understanding of context: LLMs will become better at understanding the nuances of language and context, leading to more accurate and relevant responses.
Multimodal capabilities: LLMs will be able to process and generate not only text but also images, audio, and other forms of media.
Ethical considerations: As LLMs become more powerful, addressing issues such as bias, misinformation, and privacy will become increasingly important.
New applications: LLMs will likely be used in a wider range of fields, such as healthcare, education, and finance.

Access to LLMs

There are several ways to access and utilize Large Language Models (LLMs). Here are some of the most common methods:

Cloud-Based Platforms

OpenAI API: Provides access to models like GPT-3 and GPT-4 through an API.
Google AI Platform: Offers a platform for building and deploying AI applications, including LLMs.
Hugging Face Transformers: A popular library for working with pre-trained LLMs.

Local Deployment

Hugging Face Transformers: You can download pre-trained models and run them locally using this library.
Custom Deployment: For more advanced users, it's possible to deploy LLMs on custom hardware or cloud infrastructure.

Ollama: A User-Friendly Option

Ollama is a particularly attractive option for individuals and small businesses due to its simplicity and affordability. It provides access to a variety of LLMs, including those from OpenAI and Meta.

Key benefits of using Ollama:

Easy to use: Ollama offers a user-friendly interface that makes it simple to interact with LLMs.
Cost-effective: It provides affordable pricing plans, making it accessible to a wide range of users.
Variety of models: Ollama offers a selection of LLMs, including free models, allowing you to choose the best model for your specific needs.
Customizability: You can customize the behavior of LLMs through various settings and parameters.

Finding the Right Ollama Model

When choosing a model for use with Ollama, it's essential to understand the key terms and factors that influence performance. Here's a breakdown:

Model Type

Instruct: These models are specifically trained to follow instructions and complete tasks based on prompts. They're ideal for tasks like summarization, translation, and creative writing.
Text: These are general-purpose text models capable of various text-related tasks, but might not be as adept at following specific instructions.

Model Size

The size of a model, typically measured in billions of parameters, directly correlates with its capabilities. Larger models generally offer better performance and can handle more complex tasks.

Quantization

This is a technique used to reduce the model's size and memory footprint by reducing the precision of its weights and biases.

FP16: Indicates 16-bit floating-point precision, offering a balance between accuracy and efficiency.
Q4_0, Q5_K_S: These represent different 4-bit or 5-bit quantization techniques, which can significantly reduce the model's size but might lead to some loss of accuracy.

Choosing the Right GPU

When selecting a GPU for running LLMs, several factors come into play, including model size, quantization level, desired performance, use case, and budget.

Here is a table with some example GPUs.

| GPU Model | Cost (Approx) | Memory | Teraflops | LLM Use Cases | | ----- | ----- | ----- | ----- | ----- | | NVIDIA A100 | $10,000+ | 40GB/80GB (HBM) | 5,120 | Large-scale LLMs, research, production | | NVIDIA H100 | $10,000+ | 80GB (HBM3) | 8,192 | Large-scale LLMs, research, production | | NVIDIA A16 | $5,000+ | 80GB (HBM3) | 11,056 | Large-scale LLMs, research, production | | NVIDIA L40 | $5,000+ | 80GB (HBM3) | 10,240 | Large-scale LLMs, research, production | | NVIDIA A30 | $3,000+ | 24GB (GDDR6X) | 5,504 | Medium-to-large LLMs, research, development | | NVIDIA RTX 4090 | $1,500+ | 24GB (GDDR6X) | 24,576 | Medium-to-large LLMs, research, development | | NVIDIA RTX 3090 | $1,000+ | 24GB (GDDR6X) | 10,496 | Medium-to-large LLMs, research, development | | AMD Instinct MI250X | $10,000+ | 64GB (HBM) | 5,760 | Large-scale LLMs, research, production | | AMD Radeon RX 6900 XT | $1,000+ | 16GB (GDDR6X) | 20,480 | Smaller LLMs, research, development | | AMD Radeon RX 6800 XT | $800+ | 16GB (GDDR6X) | 16,384 | Smaller LLMs, research, development |

GPU Memory Required

Here is the formula to use to determine the GPU memory required to run an LLM model. It includes a buffer of 20%:

M = (P * 4B) / (32 / Q) * 1.2

Here's a breakdown of the components:

P: The number of parameters in the model. For example, GPT-3 has 175 Billion parameters, LLaMA-70b has 70 Billion Parameters, etc.
4B: This term is a constant representing the average size of a parameter in bytes. It's generally assumed to be 4 bytes.
Q (Precision or size per parameter): This is the data type used to store the model parameters. It uses quantization to create these different precisions. Common data types include:
- FP32 (32-bit floating point): 4 bytes per parameter
- FP16 (half/BF16) (16-bit floating point): 2 bytes per parameter
- INT8 (8-bit integer): 1 byte per parameter
- INT4 (4-bit integer): 0.5 bytes per parameter
32 / Q: This part calculates the number of parameters per word, given the quantization level (Q).
1.2: This term adds a 20% overhead to account for additional factors like activation functions, gradients, and other internal state.

Here are the GPU memory requirements for a set of models:

| Model | Parameters | Quantization | Formula | GPU Memory | | ----- | ----- | ----- | ----- | ----- | | LLaMA 3.2 | 3.21B | Q4_K_M | M = (3.21B * 4B) / (32 / 4) * 1.2 | 1.93 GB | | Mistral | 7.25B | Q4_0 | M = (7.25B * 4B) / (32 / 4) * 1.2 | 4.35 GB | | Qwen2.5 | 32.8B | Q4_K_M | M = (32B * 4B) / (32 / 4) * 1.2 | 19.68 GB | | Deepseek-v2.5 | 236B | Q8_0 | M = (236B * 4B) / (32 / 8) * 1.2 | 283.2 GB | | LLaMA 3.1-instruct-fp16 | 406B | F16 | M = (406B * 4B) / (32 / 16) * 1.2 | 974.4 GB |

Run LLMs on My Dev Server

Now that you have thought through what you want to do with an LLM, and how to select the right one for your purpose, we should talk about how to get access to the server that you need without breaking the bank.

This is the goal of My Dev Server.

My Dev Server is a platform that makes it easy to create, start, stop, and delete server instances in the cloud. It is designed for someone who has little or no experience with IT Infrastructure. You can get a powerful server and only pay when you are using it.

My Dev Server is the easiest way to get a cloud development server. MyDevServer provides a simple, hassle-free solution for developers:

No need to buy an expensive server yourself.
No need to learn to manage complex cloud IT infrastructure.
Start your server when you need it, and stop it when you don't.
Preserve your files and data in between usage.
Pay only for what you use—no hidden costs.

Give it a try - Experiment with a LLM.

Experiments

If you're a developer with access to a GPU-based server, you have a powerful platform for experimenting with LLMs. Here are some more advanced experiments to consider:

Fine-Tuning and Customization

Fine-tune models for specific tasks: Train an LLM on a custom dataset to specialize it for a particular task, such as customer service or content generation.
Experiment with different hyperparameters: Adjust hyperparameters like learning rate, batch size, and number of epochs to optimize model performance.
Create custom modules or layers: Develop your modules or layers to enhance the LLM's capabilities or address specific challenges.

Integration with Other Tools and Technologies

Combine LLMs with other AI techniques: Integrate LLMs with techniques like reinforcement learning, generative adversarial networks (GANs), or neural machine translation to create more sophisticated applications.
Build custom applications: Create custom applications that leverage LLMs for tasks such as chatbots, virtual assistants, or content generation.
Integrate with existing systems: Integrate LLMs into your existing software systems or workflows to automate tasks or improve user experiences.

Performance Optimization

Optimize for GPU utilization: Experiment with different techniques to maximize GPU utilization and improve inference speed.
Explore quantization and pruning: Use quantization and pruning techniques to reduce the model's size and memory footprint without sacrificing too much accuracy.
Benchmark performance: Measure the performance of different LLMs and hardware configurations to identify the best options for your specific needs.

Ethical Considerations

Address bias: Develop strategies to mitigate bias in LLMs and ensure that they are fair and equitable.
Protect privacy: Implement measures to protect user privacy and prevent the misuse of LLMs.
Consider ethical implications: Be mindful of the ethical implications of LLMs and their potential impact on society.

By exploring these advanced experiments, you can unlock the full potential of LLMs and create innovative applications that address real-world challenges.

Conclusion

As we've explored in this guide, Large Language Models (LLMs) represent a powerful and rapidly evolving technology with vast potential across numerous fields. From understanding the basics of what LLMs are and who builds them, to accessing them through various platforms and running experiments, there's a wealth of opportunities for developers and enthusiasts alike.

Whether you're using cloud-based platforms, local deployments, or user-friendly options like Ollama, the key is to start experimenting and learning. As you delve deeper into the world of LLMs, remember to consider factors such as model size, quantization, and ethical implications.

Platforms like My Dev Server offer an accessible way to get started with LLMs, providing the necessary computational power without the need for extensive IT knowledge or upfront investment. As you continue your journey with LLMs, stay curious, keep experimenting, and don't hesitate to push the boundaries of what's possible.

The field of AI and LLMs is constantly evolving, so keep learning, stay updated with the latest developments, and most importantly, enjoy the process of discovery and creation.

Happy experimenting!