1. Quick Start
Clone the latest release from GitHub:
git clone https://github.com/coomia-ai/datagpt.git
docker-compose.yml– Preconfigured container setup.env.example– Environment variables template- Docs & setup instructions
Before running, make sure you have:
- A Qwen or ChatGPT account for LLM usage.
- PostgreSQL, Redis, and Trino configured.
- License file
license.jsonready for import.
2. Configuration
Copy .env.example to .env and adjust parameters according to your setup:
# PostgreSQL
DB_DRIVER=postgresql
DB_HOST=postgres
DB_USER=postgres
DB_PASS=postgresql123
DB_PORT=5432
DB_NAME=datagpt
# Redis
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_DB=0
# Trino
TRINO_USER=admin
TRINO_HOST=trino
TRINO_PORT=8080
TRINO_CATALOG=postgresql
TRINO_SCHEMA=public
# Vector DB (default: pgvector, optional: qdrant)
VECTOR_DB=pgvector
VECTOR_COLLECTION=schema_info
# Qdrant example:
# QDRANT_HOST=qdrant
# QDRANT_PORT=6333
# Embeddings (default: Ollama)
EMBEDDING_MODEL_TYPE=ollama
EMBEDDING_MODEL_NAME=nomic-embed-text
EMBEDDING_MODEL_DIMENSIONS=768
OLLAMA_HOST=http://ollama-service:11434
# OpenAI alternative:
# EMBEDDING_MODEL_TYPE=openai
# EMBEDDING_MODEL_NAME=text-embedding-3-small
# OPENAI_API_KEY=
# LLM selection:
# MODEL_ID=gpt-4o-mini / gpt-4o (OpenAI)
# MODEL_ID=qwen-flash-2025-07-28 (Qwen)
# MODEL_ID=moonshot-v1-32k (Kimi)
# MODEL_ID=deepseek-chat (Deepseek)
LLM & Embedding Recommendations
| Model | Use Case | Performance | Notes |
|---|---|---|---|
| OpenAI (gpt-4o / gpt-4o-mini) | General-purpose | Fast | Recommended for users with ChatGPT account; combine with text-embedding-3-small |
| Qwen / Kimi | Large-scale schema & analytics | Fast | Recommended for Chinese users |
| Deepseek | Advanced analytics | Slower | Good for deep querying but slower than others |
3. Running DataGPT
Use Docker Compose to bring the system up:
docker-compose up -d
Before running, make sure to update the license.json file in the repository with your own license information.
Access the system:
- Web UI →
http://localhost:3000 - API →
http://localhost:8000
4. Scaling & Resources
Recommended system specs based on dataset size:
| Dataset | CPU | RAM | Usage |
|---|---|---|---|
| <1M rows | 2 cores | 8 GB | Development |
| 1–100M rows | 4–8 cores | 16–32 GB | Production |
| >100M rows | 8+ cores | 64+ GB | Distributed mode |
5. Tips & Best Practices
- Monitor API usage and cost when using large LLMs.
- For large embeddings, consider using Qdrant or pgvector.
- Always back up Redis and PostgreSQL data volumes.
- Use environment variables for API keys; avoid hardcoding secrets.
- Import your license using
license.jsonbefore running queries.