1. 機器學習協作管理平台簡介 1. Introduction to ML Collaboration Platform
歡迎使用 Citrux AI。本系統提供使用者自助服務的環境,能按需求自由配置容器資源,選取並掛載所需的 CPU、GPU (如 NVIDIA B200, RTX 5090)、記憶體、網路磁碟機及 AI 機器學習框架。 Welcome to Citrux AI. This system provides a self-service environment where users can configure container resources on demand, selecting CPU, GPU (e.g., NVIDIA B200, RTX 5090), memory, network storage, and AI frameworks.
2. 登入平台 2. Platform Login
進入系統前,請輸入您的使用者帳號與密碼。若系統已整合 LDAP/AD,可直接使用企業內部帳號登入。 Please enter your username and password to log in. If LDAP/AD is integrated, you can use your corporate credentials.
🚀 Citrux AI
使用者登入User Login
3. 平台首頁與儀表板 3. Dashboard & Home
登入後預設進入儀表板。此頁面提供全域資源的即時監控數據,協助您快速掌握系統健康度。 After logging in, you will land on the Dashboard. This page provides real-time monitoring data for global resources.
Hello, JessieH
近期訪問專案Recent Projects
| Inference Service | 使用中Active | |
| LLM-Training-v2 | 使用中Active |
4. 開通服務 4. Service Activation
首次登入平台時,若您的帳號尚未開通資源權限,畫面會顯示【雲平台設置】頁面。您需要等待系統自動完成資源池的初始化配置。 Upon first login, if your account has not been provisioned, you will see the "Cloud Setup" page. Please wait for the system to initialize your resources.
Citrux Cloud Region A
Request Time: 2025/08/27 10:13:24系統正在為您配置專屬的儲存空間與網路環境,請稍候。當狀態變更為「已開通」時,即可開始建立專案。 The system is configuring your storage and network. Once the status changes to "Active", you can start creating projects.
5. 機器學習專案 5. Machine Learning Projects
專案是資源管理的基本單位。所有的容器、任務、儲存空間都必須隸屬於某個專案。 Projects are the fundamental unit of resource management. All containers, jobs, and storage must belong to a specific project.
5.1 專案審核列表Project Audit List
查看已送出的專案申請、展延或額度調整申請狀態。View the status of submitted project applications, extensions, or quota adjustments.
5.2 專案列表Project List
列出您參與的所有專案。您可以在此切換專案或新增專案。Lists all projects you are involved in. You can switch or create projects here.
Inference Service
LLM-Dev
5.3 專案詳細資訊Project Details
點擊專案卡片進入詳細頁,可查看資源配額(GPU/CPU/RAM)、成員管理以及當前運行資源概況。Click a project card to view details, including quotas (GPU/CPU/RAM), members, and current resource usage.
5.4 容器管理 5.4 Container Management
5.4.1 建立容器Create Container
5.4.2 容器列表Container List
容器建立後會顯示於此。點擊容器可展開詳細服務面板(Jupyter, SSH, Monitor)。Created containers appear here. Click to expand details and access services (Jupyter, SSH, Monitor).
5.4.3 刪除容器Delete Container
勾選容器並點擊垃圾桶圖示。注意:刪除後,未儲存至掛載磁碟區的資料將遺失。Select a container and click the trash icon. Note: Data not saved to mounted volumes will be lost.
5.4.4 建立自定義鏡像Create Custom Image
將當前容器的環境狀態打包成新的映像檔,供日後重複使用。Commit the current container state into a new image for future reuse.
5.5 任務管理 5.5 Job Management
5.5.1 任務列表Job List
任務適合執行一次性的訓練腳本。任務完成後會自動釋放資源,但保留日誌。Jobs are for one-time training scripts. Resources are released upon completion, but logs are retained.
| ID | Name | Type | Status |
|---|---|---|---|
| #101 | train-gpt | 128x B200 | Success |
| #102 | eval-bert | 1x RTX3090 | Running |
5.5.2 任務排程Job Schedule
設定定時任務(如每週重訓模型)。可設定 Crontab 格式的觸發時間。Set up recurring jobs (e.g., weekly retraining) using Crontab-style scheduling.
5.7 儲存管理 5.7 Storage Management
5.7.1 建立儲存裝置Create Storage
支援 NFS (一般檔案) 與 MinIO (物件儲存)。資料存放於此處可永久保存。Supports NFS (File) and MinIO (Object). Data stored here persists permanently.
5.7.2 傳輸容器Transfer Container
系統會啟動一個輕量容器並掛載儲存空間,提供 SFTP 資訊讓您上傳大量資料。Launches a lightweight container with storage mounted, providing SFTP access for bulk data transfer.
5.8 鏡像管理 5.8 Image Management
管理公用映像檔(由管理員提供)與您的私有自定義映像檔。Manage public images (provided by admins) and your private custom images.
5.11 快速容器服務 (RCS) 5.11 Rapid Container Service (RCS)
RCS 是基於 Kubernetes 原生概念的進階功能,適合部署長期運行的應用程式(如推論 API)。RCS is based on Kubernetes native concepts, suitable for long-running applications (e.g., Inference APIs).
- Deployments
- Services
- Ingress
- Pods
| Name | Replicas | Status |
|---|---|---|
| nginx-web | 3/3 | Healthy |
6. 分佈式訓練叢集 6. Distributed Training Cluster
針對超大型模型(如 LLaMA-3 70B),支援多機多卡分佈式訓練,整合了 Horovod 與 DeepSpeed 框架。For large models (e.g., LLaMA-3 70B), supports multi-node multi-GPU training with Horovod and DeepSpeed.