Code Style Rule
Enforce code style standards for ML projects to ensure maintainability and consistency.
Core Principles
Small File Principle (200-400 lines)
- Keep each file within 200-400 lines
- Split into multiple modules when exceeding 400 lines
- Organize related functionality under the same directory
Example structure:
src/model_module/
├── brain_decoder/
│ ├── __init__.py # Factory & Registry (50 lines)
│ ├── base_model.py # Base class (200 lines)
│ ├── transformer.py # Transformer impl (300 lines)
│ └── cnn.py # CNN impl (250 lines)
Immutability First
- Use dataclass for configuration (immutable)
- Avoid mutating input parameters inside functions
- Use
@dataclass(frozen=True)to ensure config immutability
from dataclasses import dataclass
@dataclass(frozen=True)
class ModelConfig:
hidden_dim: int
num_layers: int
dropout: float = 0.1
Error Handling
- Use try/except for exception handling
- Catch specific exception types, avoid bare except
- Log error information for debugging
try:
data = load_data(path)
except FileNotFoundError as e:
logger.error(f"Data file not found: {path}")
raise
Type Hints
- All functions must have type hints
- Use types from the typing module
- Use TypeVar for complex types
from typing import Dict, List, Optional, TypeVar
T = TypeVar('T', bound=Dataset)
def process_data(data: List[Dict], config: Config) -> Optional[DataFrame]:
...
Python Specific Standards
Import Order
# 1. Standard library
import os
from pathlib import Path
# 2. Third-party libraries
import torch
import numpy as np
from hydra import compose, initialize
# 3. Local modules
from src.data_module import DataLoader
from src.model_module import Model
Naming Conventions
# Class names: PascalCase
class DataLoader:
pass
# Functions/variables: snake_case
def load_config():
batch_size = 32
# Constants: UPPER_SNAKE_CASE
MAX_EPOCHS = 100
DEFAULT_LR = 0.001
# Private: underscore prefix
def _internal_function():
pass
Docstrings
def train_model(cfg: Config) -> Model:
"""Train the model.
Args:
cfg: Training configuration object.
Returns:
Trained model instance.
Raises:
ValueError: When configuration is invalid.
"""
...
ML Project Specific Standards
Factory & Registry Pattern
All modules must use factory and registry patterns:
# dataset/__init__.py
DATASET_FACTORY: Dict[str, Type[Dataset]] = {}
def register_dataset(name: str):
def decorator(cls):
DATASET_FACTORY[name] = cls
return cls
return decorator
def DatasetFactory(name: str) -> Type[Dataset]:
return DATASET_FACTORY.get(name, SimpleDataset)
Config-Driven Models
Model __init__ should only accept a cfg parameter:
@register_model('MyModel')
class MyModel(nn.Module):
def __init__(self, cfg: Config):
super().__init__()
# All hyperparameters from cfg
self.hidden_dim = cfg.model.hidden_dim
Directory Structure
run/
├── conf/ # Hydra configs
├── pipeline/ # Workflow scripts
└── outputs/ # Output directory
src/
├── data_module/ # Data module
│ ├── dataset/
│ ├── augmentation/
│ └── utils.py
├── model_module/ # Model module
├── trainer_module/ # Trainer module
└── utils/ # Shared utilities
Prohibited Patterns
❌ Prohibited:
- Files exceeding 800 lines
- Nesting deeper than 4 levels
- Mutable default arguments:
def foo(a=[]): - Global variables (use config instead)
- Bare except:
except: - Hardcoded hyperparameters (use cfg)
- Unused imports
- print() debug statements (use logger)
✅ Recommended:
- Split large files
- Use early returns to reduce nesting
def foo(a=None):- Config-driven parameters
- Specific exception catching
- Type hints
- Docstrings
- Logger for logging
Verification Checklist
Before committing code, ensure:
# Type checking
mypy src/
# Code style
ruff check .
# Tests
pytest
Violations of these rules will be flagged by the code-reviewer agent.
Logging Standards
Logger Naming
import logging
# Use module-level logger with __name__
logger = logging.getLogger(__name__)
Log Levels
| Level | Usage |
|---|---|
DEBUG |
Detailed diagnostic info (tensor shapes, config values) |
INFO |
Training progress, epoch results, key milestones |
WARNING |
Recoverable issues (fallback behavior, deprecation) |
ERROR |
Failures that need attention but don't crash |
CRITICAL |
Unrecoverable errors |
Module __init__.py Standards
Every package __init__.py must define __all__ for explicit public API:
# src/data_module/__init__.py
from .dataset import DatasetFactory, register_dataset
from .augmentation import AugmentationFactory
__all__ = [
"DatasetFactory",
"register_dataset",
"AugmentationFactory",
]