Documentation | Coomia DataGPT

1. Quick Start

Clone the latest release from GitHub:

git clone https://github.com/coomia-ai/datagpt.git

docker-compose.yml – Preconfigured container setup
.env.example – Environment variables template
Docs & setup instructions

Before running, make sure you have:

A Qwen or ChatGPT account for LLM usage.
PostgreSQL, Redis, and Trino configured.
License file license.json ready for import.

2. Configuration

Copy .env.example to .env and adjust parameters according to your setup:

# PostgreSQL
DB_DRIVER=postgresql
DB_HOST=postgres
DB_USER=postgres
DB_PASS=postgresql123
DB_PORT=5432
DB_NAME=datagpt

# Redis
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_DB=0

# Trino
TRINO_USER=admin
TRINO_HOST=trino
TRINO_PORT=8080
TRINO_CATALOG=postgresql
TRINO_SCHEMA=public

# Vector DB (default: pgvector, optional: qdrant)
VECTOR_DB=pgvector
VECTOR_COLLECTION=schema_info
# Qdrant example:
# QDRANT_HOST=qdrant
# QDRANT_PORT=6333

# Embeddings (default: Ollama)
EMBEDDING_MODEL_TYPE=ollama
EMBEDDING_MODEL_NAME=nomic-embed-text
EMBEDDING_MODEL_DIMENSIONS=768
OLLAMA_HOST=http://ollama-service:11434
# OpenAI alternative:
# EMBEDDING_MODEL_TYPE=openai
# EMBEDDING_MODEL_NAME=text-embedding-3-small
# OPENAI_API_KEY=

# LLM selection:
# MODEL_ID=gpt-4o-mini / gpt-4o (OpenAI)
# MODEL_ID=qwen-flash-2025-07-28 (Qwen)
# MODEL_ID=moonshot-v1-32k (Kimi)
# MODEL_ID=deepseek-chat (Deepseek)

LLM & Embedding Recommendations

Model	Use Case	Performance	Notes
OpenAI (gpt-4o / gpt-4o-mini)	General-purpose	Fast	Recommended for users with ChatGPT account; combine with `text-embedding-3-small`
Qwen / Kimi	Large-scale schema & analytics	Fast	Recommended for Chinese users
Deepseek	Advanced analytics	Slower	Good for deep querying but slower than others

3. Running DataGPT

Use Docker Compose to bring the system up:

    docker-compose up -d

Before running, make sure to update the license.json file in the repository with your own license information.

Access the system:

Web UI → http://localhost:3000
API → http://localhost:8000

4. Scaling & Resources

Recommended system specs based on dataset size:

Dataset	CPU	RAM	Usage
<1M rows	2 cores	8 GB	Development
1–100M rows	4–8 cores	16–32 GB	Production
>100M rows	8+ cores	64+ GB	Distributed mode

5. Tips & Best Practices