Fork
0
代码
介绍
代码
Issues
Pull Requests
流水线
Actions
讨论
Wiki
项目成员
分析
项目设置
Fork
0
a94c56a78fe39e8e0d928b498d9b866939435300
registry
/
deploy
/
pkg
/
k8s
下载当前目录
G
GitHub
Ensure no downtime during rollouts (
#854
)
cc1e665b
创建于
2025年12月17日
历史提交
文件
最后提交记录
最后更新时间
backup.go
Add database backup functionality with GCS integration (#297) <!-- Provide a brief summary of your changes --> Implements automated database backup functionality with Google Cloud Storage integration, including retention policies and MinIO support for local development. ## Motivation and Context <!-- Why is this change needed? What problem does it solve? --> This change adds critical database backup capabilities to ensure data durability and disaster recovery. The solution provides automated backups with configurable retention policies and supports both production (GCS) and development (MinIO) environments. ## How Has This Been Tested? <!-- Have you tested this in a real application? Which scenarios were tested? --> - Tested backup creation and restoration with MinIO in local development - Verified GCS bucket lifecycle policies for automatic deletion after 60 days - Tested backup retention and cleanup logic ## Breaking Changes <!-- Will users need to update their code or configurations? --> No breaking changes. This is an additive feature that doesn't affect existing functionality. ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [x] Documentation update ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply. --> - [x] I have read the [MCP Documentation](https://modelcontextprotocol.io) - [x] My code follows the repository's style guidelines - [x] New and existing tests pass locally - [x] I have added appropriate error handling - [x] I have added or updated documentation as needed ## Additional context <!-- Add any other context, implementation notes, or design decisions --> - Implements 60-day retention period for backups as a safety net - Uses GCS lifecycle rules for automatic cleanup - Includes MinIO setup instructions for local development testing - Port-forwarding commands updated for consistency across documentation - Fixes #184
8 个月前
cert_manager.go
Refactor database implementation to use PostgreSQL instead of MongoDB (#289) Fixes #228 ## Motivation and Context See #19 ## How Has This Been Tested? - Run locally - Run in docker compose - Run infra in minikube - Unit and integration tests all passing :) ## Breaking Changes Yes: users will not be able to access any data they used to have in custom mongodb. We are accepting this breaking change, given the registry is in experimental development status. ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply. --> - [x] I have read the [MCP Documentation](https://modelcontextprotocol.io) - [x] My code follows the repository's style guidelines - [x] New and existing tests pass locally - [x] I have added appropriate error handling - [x] I have added or updated documentation as needed ## Additional context I have not implemented connection pooling support for simplicity, and because cloudnative-pg already provides built-in connection pooling with pgBouncer so if we want it later we can switch to this endpoint easily.
8 个月前
deploy.go
Add monitoring infrastructure: victoriametrics and grafana (#328) This includes infra setup for metrics collection and visualisation ## Motivation and Context It is important to have basic observability setup, so that common issues can be identified quickly ## How Has This Been Tested? - Tested on local - Tested on one of the production setup ## Breaking Changes No ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply. --> - [X] I have read the [MCP Documentation](https://modelcontextprotocol.io) - [X] My code follows the repository's style guidelines - [X] New and existing tests pass locally - [X] I have added appropriate error handling - [ ] I have added or updated documentation as needed ## Additional context - This includes set up of following components: - Victoriametrics single node cluster for storing metrics data with persistent storage configuration for local and production env. - Vmagent with target discovery for mcp-registry pods with scrape interval of 30s. - Grafana for visualising metrics and setting alerts. - Datasource of victoriametrics is pre-configured as config map - Persistent volume for basic configuration - Sqlite for local and Postgres for production setup for storing dashboard, alerts etc details ## Metrics in action <img width="1422" height="666" alt="Screenshot 2025-08-31 at 4 35 28 AM" src="https://github.com/user-attachments/assets/e54ce846-653c-4fa3-aa7a-fe277f0810e6" /> --------- Co-authored-by: Adam Jones <adamj@anthropic.com> Co-authored-by: adam jones <adamj+git@anthropic.com>
8 个月前
ingress.go
Set 429 status code responses in the config map (#849) <!-- Provide a brief summary of your changes --> ## Motivation and Context <!-- Why is this change needed? What problem does it solve? --> Seems that per ingress annotations were not working, so had to put them in the ingress configmap. Did tested this manually on staging so it's confirmed it's working (I was getting 429s instead of 503s) and no traffic was getting into the registry pod. ## How Has This Been Tested? <!-- Have you tested this in a real application? Which scenarios were tested? --> ## Breaking Changes <!-- Will users need to update their code or configurations? --> ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] I have read the [MCP Documentation](https://modelcontextprotocol.io) - [ ] My code follows the repository's style guidelines - [ ] New and existing tests pass locally - [ ] I have added appropriate error handling - [ ] I have added or updated documentation as needed ## Additional context <!-- Add any other context, implementation notes, or design decisions --> Signed-off-by: Radoslav Dimitrov <radoslav@stacklok.com>
5 个月前
monitoring.go
Add failure modes telemetry (#646) Telemetry to cover failure modes which are not covered by container logs and metrics for finding resource constraints. ## Motivation and Context When there is any issue with registry container we should be notified. ## How Has This Been Tested? - Local seup ## Breaking Changes - No ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] I have read the [MCP Documentation](https://modelcontextprotocol.io) - [ ] My code follows the repository's style guidelines - [ ] New and existing tests pass locally - [ ] I have added appropriate error handling - [ ] I have added or updated documentation as needed ## Additional context - No additional exporter is used, taken advantage of opentelemetry collector - It covers metrics related to resource constraints, currently only limited to default namespace. - Takes cares of kubernetes events as logs which are the source of figuring out any problem with service, covers all such scenarios where pod is not able to start yet and get missed because there are no container logs for such cases. Limited to default namespace. - Taken care of daemonset deployment i.e. deploying otel collector as agent by using correct filtering. - Cardinality contributing factors are only pod ids (but have to observe more), node ids will not increase cardinality as scale up will lead to limited nodes. - Shipping of metrics for resources happens every 60s and list of metrics that will be emitted [https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/metadata.yaml](url) - Container errors <img width="1440" height="816" alt="Screenshot 2025-10-10 at 1 21 14 AM" src="https://github.com/user-attachments/assets/ba90a217-2a49-4522-aa44-a98c02adf95b" /> - Resource metrics <img width="1440" height="816" alt="Screenshot 2025-10-10 at 1 23 51 AM" src="https://github.com/user-attachments/assets/3467be96-db3c-4930-afa2-3cbf5f0ced8b" />
6 个月前
postgres.go
Increase GKE node size to e2-standard-2 (#513) ## Summary Bumps the node machine type from e2-small to e2-standard-2 to provide more CPU and memory headroom for the cluster. Also disables PodDisruptionBudgets (PDBs) for single-instance PostgreSQL clusters to prevent node upgrade operations from becoming blocked. ## Changes - Increase GKE node machine type from e2-small to e2-standard-2 - Set `enablePDB: false` for registry and Grafana PostgreSQL clusters ## Context The machine type increase addresses resource constraints causing deployment failures. The PDB change prevents a specific issue where GKE node upgrades get stuck when trying to drain nodes containing single-instance PostgreSQL pods. With `minAvailable: 1` PDBs, evicting the only database instance violates the availability requirement, blocking the drain operation indefinitely. Disabling PDBs for these development/staging single-instance databases allows node maintenance to proceed smoothly. For production deployments requiring high availability, the instances count should be increased to 3 with PDBs enabled. --------- Co-authored-by: Claude <noreply@anthropic.com>
7 个月前
registry.go
Ensure no downtime during rollouts (#854) <!-- Provide a brief summary of your changes --> ## Motivation and Context <!-- Why is this change needed? What problem does it solve? --> The following PR ensures we don't have downtime when we are doing rollouts during deployment/promotions. ## How Has This Been Tested? <!-- Have you tested this in a real application? Which scenarios were tested? --> ## Breaking Changes <!-- Will users need to update their code or configurations? --> ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] I have read the [MCP Documentation](https://modelcontextprotocol.io) - [ ] My code follows the repository's style guidelines - [ ] New and existing tests pass locally - [ ] I have added appropriate error handling - [ ] I have added or updated documentation as needed ## Additional context <!-- Add any other context, implementation notes, or design decisions --> Signed-off-by: Radoslav Dimitrov <radoslav@stacklok.com>
5 个月前