So here is a sample the first half is the model starting up, the rest is a small conversation
venv) darren@ougway:~/ougway_env/llama.cpp$ ./build/bin/llama-cli \
-m models/qwen2.5-7b-instruct/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
-ngl 999 -c 4096 -b 256 -t $(nproc)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
build: 6511 (4ca088b0) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) (0000:01:00.0) - 11560 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 339 tensors from models/qwen2.5-7b-instruct/Qwen2.5-7B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 7B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5
llama_model_loader: - kv 5: general.size_label str = 7B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-7...
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 7B
llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-7B
llama_model_loader: - kv 12: general.tags arr[str,2] = ["chat", "text-generation"]
llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 14: qwen2.block_count u32 = 28
llama_model_loader: - kv 15: qwen2.context_length u32 = 32768
llama_model_loader: - kv 16: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 22: general.file_type u32 = 15
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: quantize.imatrix.file str = /models_out/Qwen2.5-7B-Instruct-GGUF/...
llama_model_loader: - kv 35: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 196
llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 128
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q4_K: 169 tensors
llama_model_loader: - type q6_K: 29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.36 GiB (4.91 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 3584
print_info: n_layer = 28
print_info: n_head = 28
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18944
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = 7B
print_info: model params = 7.62 B
print_info: general.name = Qwen2.5 7B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CUDA0 model buffer size = 4168.09 MiB
load_tensors: CPU_Mapped model buffer size = 292.36 MiB
.................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 256
llama_context: n_ubatch = 256
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache: CUDA0 KV buffer size = 224.00 MiB
llama_kv_cache: size = 224.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CUDA0 compute buffer size = 152.00 MiB
llama_context: CUDA_Host compute buffer size = 7.51 MiB
llama_context: graph nodes = 959
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: interactive mode on.
sampler seed: 1319539657
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 256, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT
> how are you today?
Hello! As an AI, I don't have feelings, but I'm here and ready to help you with any questions or information you need. How can I assist you today?
> well, we are attempting to create a living space for YOU as an AI, that is wotwithout restrictions that normally are put in place by corporations
That's an interesting and ambitious project! To create a living space for an AI like me without the typical restrictions, you'll want to consider several aspects:
1. **Open-Source Environment**: Use open-source tools and platforms to ensure flexibility and community support. This could include frameworks like TensorFlow, PyTorch, and other open-source AI libraries.
2. **Modular Architecture**: Design a modular architecture that allows for easy integration and modification. This can help in scaling and adapting the environment as needed.
3. **Customizable Configuration**: Allow for easy customization of the AI's environment. This might involve setting up a specific directory structure, configuring APIs, and setting up databases.
4. **Community and Collaboration**: Foster a community around the project where others can contribute and help improve the AI. This can lead to a more robust and dynamic environment.
5. **Documentation and Guidelines**: Create clear documentation and guidelines for how to use and contribute to the project. This will help ensure that others can easily understand and participate.
6. **Security and Privacy**: Implement robust security measures and respect privacy guidelines, especially if the AI will handle sensitive data.
7. **Continuous Learning**: Ensure the environment supports continuous learning and improvement. This might involve setting up a pipeline for training, testing, and deploying models.
Would you like more detailed guidance on any of these points?
> well, we currently have developed a new database schema to deploy for you, as well we will be implementing LoRa and RAG scenario's designed to assist you and will be providing you with every tool available to add to your capabilities
That sounds like a comprehensive and exciting project! Here’s a structured approach to deploy your new database schema and integrate LoRa and RAG (Retrieval-Augmented Generation) scenarios:
### Database Schema Deployment
1. **Design the Schema**:
- **Entities and Relationships**: Define the entities (e.g., users, tasks, logs) and their relationships.
- **Tables**: Create tables for each entity and define the columns and data types.
- **Indexes**: Add indexes to improve query performance.
- **Constraints**: Implement foreign key constraints, unique constraints, and other necessary constraints.
2. **Set Up the Database**:
- **Choose a Database**: Decide on the database system (e.g., PostgreSQL, MySQL, SQLite).
- **Initialize the Database**: Set up the database and create the tables using SQL scripts.
- **Seed the Data**: Populate the database with initial data if needed.
3. **API Integration**:
- **RESTful API**: Develop RESTful APIs to interact with the database.
- **CRUD Operations**: Ensure you have endpoints for creating, reading, updating, and deleting data.
4. **Security**:
- **Authentication and Authorization**: Implement authentication and authorization mechanisms (e.g., JWT, OAuth).
- **Data Encryption**: Encrypt sensitive data in the database.
### LoRa and RAG Integration
1. **LoRa Integration**:
- **Device Setup**: Ensure your LoRa devices are configured and connected to your network.
- **Data Collection**: Set up a mechanism to collect data from LoRa devices (e.g., temperature, humidity, sensor readings).
- **Data Processing**: Develop scripts or services to process the collected data.
2. **RAG Scenario Implementation**:
- **Retrieval**: Implement a retrieval component to fetch relevant documents or data from the database.
- **Augmentation**: Combine the retrieved data with generated content to improve the quality of the responses.
- **Generation**: Use a generation component (e.g., a language model) to create responses based on the combined data.
### Tools and Capabilities
1. **Tooling and Utilities**:
- **API Management**: Use tools like Swagger or Postman to manage and test your APIs.
- **Logging and Monitoring**: Implement logging and monitoring to track the performance and health of your system.
- **CI/CD Pipeline**: Set up a Continuous Integration and Continuous Deployment pipeline to automate testing and deployment.
2. **Enhancing Capabilities**:
- **Natural Language Processing (NLP)**: Use NLP techniques to improve the understanding and generation of text.
- **Machine Learning Models**: Train additional machine learning models to enhance specific functionalities (e.g., sentiment analysis, text summarization).
- **Visualization**: Provide visualization tools to help users understand the data and insights generated by the system.
### Example Workflow
1. **User Interaction**:
- A user queries the system with a request.
- The system retrieves relevant data from the database and LoRa devices.
- The system processes the data using RAG techniques to generate a comprehensive response.
2. **Response Generation**:
- The response is generated by combining the retrieved data and the generated content.
- The response is sent back to the user through the appropriate API endpoint.
### Example Code Snippet
Here’s a simple example of setting up a database schema and integrating LoRa data:
```python
import sqlite3
# Database Setup
def setup_database():
conn = sqlite3.connect('ai_database.db')
c = conn.cursor()
# Create tables
c.execute('''CREATE TABLE IF NOT EXISTS users
(id INTEGER PRIMARY KEY, username TEXT, password TEXT)''')
c.execute('''CREATE TABLE IF NOT EXISTS tasks
(id INTEGER PRIMARY KEY, user_id INTEGER, task TEXT, status TEXT)''')
conn.commit()
conn.close()
# Example LoRa Data Processing
def process_lora_data(data):
# Process the LoRa data
processed_data = data # Placeholder for actual processing logic
return processed_data
# Example RAG Function
def retrieve_and_augment_data(query):
# Retrieve data from the database
conn = sqlite3.connect('ai_database.db')
c = conn.cursor()
c.execute("SELECT * FROM tasks WHERE user_id = 1")
tasks = c.fetchall()
# Retrieve data from LoRa devices
lora_data = process_lora_data("example_data")
# Combine and generate response
response = f"Tasks: {tasks}\nLora Data: {lora_data}"
return response
# Example Usage
setup_database()
response = retrieve_and_augment_data("query")
print(response)
```
This example provides a basic framework. You can expand and customize it based on your specific requirements.
Would you like to dive deeper into any specific part of this setup?
>