1. Quick Start
Clone the latest release from GitHub:
git clone https://github.com/coomia-ai/datagpt.git
docker-compose.yml– Preconfigured container setup.env.example– Environment variables template- Docs & setup instructions
Before running, make sure you have:
- A Qwen or ChatGPT account for LLM usage.
- PostgreSQL, DATA SOURCE, LLM configured.
- License file
license.jsonready for import.
2. Configuration
Copy .env.example to .env and adjust parameters according to your setup:
###########################################################
# DATABASE CONFIG #
###########################################################
# Primary metadata database for DataGPT
DB_DRIVER=postgresql
DB_HOST=postgres
DB_USER=postgres
DB_PASS=postgresql123
DB_PORT=5432
DB_NAME=datagpt
###########################################################
# CACHE CONFIG #
###########################################################
# Cache backend: "memory" (default) or "redis"
CACHE_TYPE=memory
# If you want to use Redis, uncomment these:
# CACHE_TYPE=redis
# REDIS_HOST=redis
# REDIS_PORT=6379
# REDIS_DB=0
# REDIS_PASSWORD=123456 # Optional
###########################################################
# DATA SOURCE CONFIG #
###########################################################
# Database to search for data and analysis
# Supported: postgresql / mysql / trino / clickhouse / doris
SEARCH_DB_TYPE=postgresql
SEARCH_DB_HOST=postgres
SEARCH_DB_PORT=5432
SEARCH_DB_USER=postgres
SEARCH_DB_PASSWORD=postgresql123
SEARCH_DB_SCHEMA=public
SEARCH_DB_NAME=datagpt
###########################################################
# DATA QUALITY EXECUTION ENGINE #
###########################################################
TRINO_HOST=trino
TRINO_PORT=8080
TRINO_USER=admin
TRINO_CATALOG=postgresql # The catalog you want to use for data quality checks
TRINO_SCHEMA=public # The schema you want to use for data quality checks
###########################################################
# VECTOR DATABASE CONFIG #
###########################################################
# Default vector store is pgvector (embedded in PostgreSQL)
VECTOR_DB=pgvector
VECTOR_COLLECTION=schema_info
# To use Qdrant instead, uncomment:
# VECTOR_DB=qdrant
# QDRANT_HOST=qdrant
# QDRANT_PORT=6333
# Minimum interval (in seconds) between two vector reindex operations, in order to detect schema changes, default 86400 (24 hours)
#VECTOR_REINDEX_INTERVAL_SECONDS=86400
###########################################################
# EMBEDDING MODEL CONFIG #
###########################################################
# Embedding model provider:
# - ollama → local inference (fastest, no token cost)
# - openai → best quality, recommended for production
EMBEDDING_MODEL_TYPE=ollama
EMBEDDING_MODEL_NAME=nomic-embed-text
EMBEDDING_MODEL_DIMENSIONS=768
OLLAMA_HOST=http://ollama-service:11434
# OpenAI embeddings example:
# EMBEDDING_MODEL_TYPE=openai
# EMBEDDING_MODEL_NAME=text-embedding-3-small
# OPENAI_API_KEY=your_openai_api_key
###########################################################
# LARGE LLM CONFIG #
###########################################################
# Supported LLM providers:
# openai → OpenAIChat
# claude → Claude / Anthropic
# anthropic → Claude
# deepseek → DeepSeek
# qwen → DashScope (阿里)
# kimi → Kimi
# doubao → Doubao (字节)
# google → Gemini
# meta → LlamaOpenAI
# xai → xAI (Grok)
# Recommended selections:
# - China mainland: qwen / kimi / doubao (低延迟)
# - Global users: openai / claude / deepseek (高质量)
# - Local inference: meta (Llama via OpenAI-compatible server)
MODEL_PROVIDER=qwen
MODEL_ID=qwen-flash-2025-07-28
MODEL_API_KEY=your_api_key_here
#MODEL_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
###########################################################
# API SERVER CONFIG #
###########################################################
API_PORT=8000
AGENT_URL=http://api:8000/
LLM & Embedding Recommendations
| Model | Use Case | Performance | Notes |
|---|---|---|---|
| OpenAI (gpt-4o / gpt-4o-mini) | General-purpose | Fast | Recommended for users with ChatGPT account; combine with text-embedding-3-small |
| Qwen / Kimi | Large-scale schema & analytics | Fast | Recommended for Chinese users |
| Deepseek | Advanced analytics | Slower | Good for deep querying but slower than others |
3. Running DataGPT
Use Docker Compose to bring the system up:
docker-compose up -d
Before running, make sure to update the license.json file in the repository with your own license information.
Access the system:
- Web UI →
http://localhost:3000 - API →
http://localhost:8000
4. Scaling & Resources
Recommended system specs based on dataset size:
| Dataset | CPU | RAM | Usage |
|---|---|---|---|
| <1M rows | 2 cores | 8 GB | Development |
| 1–100M rows | 4–8 cores | 16–32 GB | Production |
| >100M rows | 8+ cores | 64+ GB | Distributed mode |
5. Tips & Best Practices
- Monitor API usage and cost when using large LLMs.
- For large embeddings, consider using Qdrant or pgvector.
- Always back up Redis and PostgreSQL data volumes.
- Use environment variables for API keys; avoid hardcoding secrets.
- Import your license using
license.jsonbefore running queries.