1. 機器學習協作管理平台簡介 1. Introduction to ML Collaboration Platform

歡迎使用 Citrux AI。本系統提供使用者自助服務的環境,能按需求自由配置容器資源,選取並掛載所需的 CPU、GPU (如 NVIDIA B200, RTX 5090)、記憶體、網路磁碟機及 AI 機器學習框架。 Welcome to Citrux AI. This system provides a self-service environment where users can configure container resources on demand, selecting CPU, GPU (e.g., NVIDIA B200, RTX 5090), memory, network storage, and AI frameworks.

建議環境: 請使用 Chrome 80+ 或 Firefox 80+ 以上版本瀏覽器。 Recommended Environment: Please use Chrome 80+ or Firefox 80+ for the best experience.

2. 登入平台 2. Platform Login

進入系統前,請輸入您的使用者帳號與密碼。若系統已整合 LDAP/AD,可直接使用企業內部帳號登入。 Please enter your username and password to log in. If LDAP/AD is integrated, you can use your corporate credentials.

🚀 Citrux AI

使用者登入User Login

3. 平台首頁與儀表板 3. Dashboard & Home

登入後預設進入儀表板。此頁面提供全域資源的即時監控數據,協助您快速掌握系統健康度。 After logging in, you will land on the Dashboard. This page provides real-time monitoring data for global resources.

Dashboard

Hello, JessieH

總算力核心Total Cores
1,792
執行中任務Active Jobs
187
等待中Pending
34
本月費用Monthly Cost
$847K

近期訪問專案Recent Projects

Inference Service 使用中Active
LLM-Training-v2 使用中Active

4. 開通服務 4. Service Activation

首次登入平台時,若您的帳號尚未開通資源權限,畫面會顯示【雲平台設置】頁面。您需要等待系統自動完成資源池的初始化配置。 Upon first login, if your account has not been provisioned, you will see the "Cloud Setup" page. Please wait for the system to initialize your resources.

Citrux Cloud Region A

Request Time: 2025/08/27 10:13:24
開通中...Provisioning...

系統正在為您配置專屬的儲存空間與網路環境,請稍候。當狀態變更為「已開通」時,即可開始建立專案。 The system is configuring your storage and network. Once the status changes to "Active", you can start creating projects.

5. 機器學習專案 5. Machine Learning Projects

專案是資源管理的基本單位。所有的容器、任務、儲存空間都必須隸屬於某個專案。 Projects are the fundamental unit of resource management. All containers, jobs, and storage must belong to a specific project.

5.1 專案審核列表Project Audit List

查看已送出的專案申請、展延或額度調整申請狀態。View the status of submitted project applications, extensions, or quota adjustments.

5.2 專案列表Project List

列出您參與的所有專案。您可以在此切換專案或新增專案。Lists all projects you are involved in. You can switch or create projects here.

Inference Service

狀態: 使用中Status: Active
到期: 2026/08/27Exp: 2026/08/27

LLM-Dev

狀態: 待審核Status: Pending
申請: 2025/09/01Date: 2025/09/01

5.3 專案詳細資訊Project Details

點擊專案卡片進入詳細頁,可查看資源配額(GPU/CPU/RAM)、成員管理以及當前運行資源概況。Click a project card to view details, including quotas (GPU/CPU/RAM), members, and current resource usage.

5.4 容器管理 5.4 Container Management

5.4.1 建立容器Create Container

建立容器精靈Create Container Wizard
1. 名稱1. Name
2. 映像檔2. Image
3. 資源規格3. Resources
4. 進階 (掛載/共享記憶體)4. Advanced (Mounts/Shm)
啟用共享記憶體Enable Shared Memory

5.4.2 容器列表Container List

容器建立後會顯示於此。點擊容器可展開詳細服務面板(Jupyter, SSH, Monitor)。Created containers appear here. Click to expand details and access services (Jupyter, SSH, Monitor).

my-dev-env Running
ssh root@10.20.69.182 -p 30122

5.4.3 刪除容器Delete Container

勾選容器並點擊垃圾桶圖示。注意:刪除後,未儲存至掛載磁碟區的資料將遺失。Select a container and click the trash icon. Note: Data not saved to mounted volumes will be lost.

5.4.4 建立自定義鏡像Create Custom Image

將當前容器的環境狀態打包成新的映像檔,供日後重複使用。Commit the current container state into a new image for future reuse.

5.5 任務管理 5.5 Job Management

5.5.1 任務列表Job List

任務適合執行一次性的訓練腳本。任務完成後會自動釋放資源,但保留日誌。Jobs are for one-time training scripts. Resources are released upon completion, but logs are retained.

IDNameTypeStatus
#101train-gpt128x B200Success
#102eval-bert1x RTX3090Running

5.5.2 任務排程Job Schedule

設定定時任務(如每週重訓模型)。可設定 Crontab 格式的觸發時間。Set up recurring jobs (e.g., weekly retraining) using Crontab-style scheduling.

5.7 儲存管理 5.7 Storage Management

5.7.1 建立儲存裝置Create Storage

支援 NFS (一般檔案) 與 MinIO (物件儲存)。資料存放於此處可永久保存。Supports NFS (File) and MinIO (Object). Data stored here persists permanently.

5.7.2 傳輸容器Transfer Container

系統會啟動一個輕量容器並掛載儲存空間,提供 SFTP 資訊讓您上傳大量資料。Launches a lightweight container with storage mounted, providing SFTP access for bulk data transfer.

5.8 鏡像管理 5.8 Image Management

管理公用映像檔(由管理員提供)與您的私有自定義映像檔。Manage public images (provided by admins) and your private custom images.

5.11 快速容器服務 (RCS) 5.11 Rapid Container Service (RCS)

RCS 是基於 Kubernetes 原生概念的進階功能,適合部署長期運行的應用程式(如推論 API)。RCS is based on Kubernetes native concepts, suitable for long-running applications (e.g., Inference APIs).

包含功能: 部署 (Deployment)、服務 (Service)、路由 (Ingress)、設定檔 (ConfigMap/Secret) 等 9 大元件。 Includes: Deployments, Services, Ingress, ConfigMaps, Secrets, and more.
RCS Dashboard
  • Deployments
  • Services
  • Ingress
  • Pods
NameReplicasStatus
nginx-web3/3Healthy

6. 分佈式訓練叢集 6. Distributed Training Cluster

針對超大型模型(如 LLaMA-3 70B),支援多機多卡分佈式訓練,整合了 Horovod 與 DeepSpeed 框架。For large models (e.g., LLaMA-3 70B), supports multi-node multi-GPU training with Horovod and DeepSpeed.

6.1 建立叢集Create Cluster

框架類型Framework
Worker 節點數量Worker Nodes
系統會自動配置 SSH 免密登入與 Hostfile,並掛載您的 Home 目錄以同步程式碼。 The system automatically configures SSH and Hostfiles, and mounts your Home directory for code synchronization.