Let's cut to the chase. You've got the DeepSeek R1 model files, you know Azure is the cloud platform for this, and GitHub is where your team lives. The goal isn't just to make it run—it's to build a scalable, secure, and cost-effective inference pipeline that doesn't fall over when traffic spikes. I've built and torn down this stack more times than I care to admit, both for startups and larger teams. The tutorials out there get you to a "Hello World" endpoint. This guide is about getting you to a production system you can trust.

Why This Stack Matters for Real AI Work

DeepSeek R1 isn't just another open-source model. Its architecture demands specific attention to memory bandwidth and parallel compute, which is where Azure's GPU-optimized VMs (like the NCas_T4_v3 series) come into their own. But here's the thing most blogs don't mention: raw compute power is useless without a smooth deployment and iteration cycle.

That's the GitHub piece. It's not just a code repo. It's your control center for model versioning, infrastructure-as-code (using Terraform or Bicep), and—critically—continuous integration and deployment (CI/CD) via GitHub Actions. I've seen teams waste weeks because they treated model deployment as a one-off manual task. When you need to roll back to a previous version of DeepSeek R1 at 2 AM, you'll thank yourself for having a Git-tagged container image and a one-click rollback pipeline.

Azure ties it together with managed services that reduce operational overhead. You could manage your own Kubernetes cluster, but Azure Kubernetes Service (AKS) or even Azure Container Instances (ACI) for simpler setups handle the networking, scaling, and maintenance. The official Microsoft Azure Architecture Center has patterns that are directly applicable, but you need to adapt them for the specific resource hunger of a modern LLM.

The Non-Consensus View: Everyone rushes to put the model on the biggest GPU. The real bottleneck often becomes the network latency between the container serving the API and the model files, or the cold-start time of your serverless endpoint. Optimizing the container image size and using Azure Premium SSD for model storage can have a bigger impact on user-perceived latency than a 10% faster GPU.

Your Deployment Architecture Blueprint

You need a design that balances simplicity with scalability. For most projects, I recommend a two-tier approach.

Tier 1: The Core Inference Service

This is where DeepSeek R1 lives. You package it into a Docker container with a lightweight API server—FastAPI is my go-to. This container exposes a standard POST endpoint (e.g., /v1/completions). The key is to bake the model weights into the container image for fast loading, but store them in a separate volume if the image gets too large for your registry.

Where do you run this container?

  • Azure Container Instances (ACI): Perfect for prototypes, low-traffic demos, or batch jobs. It's serverless containers. You define it, it runs, you pay per second. No cluster management. The downside is scaling is slower than AKS.
  • Azure Kubernetes Service (AKS): The production choice. It manages scaling, health checks, and rolling updates. You define a Kubernetes Deployment and a Horizontal Pod Autoscaler that scales based on CPU or custom metrics (like request queue length).

Tier 2: The Orchestration & Delivery Layer

This is the traffic cop. It handles authentication, rate limiting, request queuing, and routing. You can implement this as a separate, lighter container (using something like NGINX or a Python app) or use Azure API Management (APIM). APIM is powerful but adds cost. For smaller teams, I often start with a simple Python gateway in the same AKS cluster.

Here’s how the pieces connect in a typical, resilient setup:

Component Azure Service GitHub's Role Critical Configuration Tip
Model Registry & CI Azure Container Registry (ACR) GitHub Actions builds & pushes the Docker image on every git tag. Enable ACR geo-replication if your users are global to reduce pull latency.
Compute / Orchestration Azure Kubernetes Service (AKS) GitHub Actions applies Kubernetes manifests (kubectl) from your repo. Use node pools with GPU VMs for inference and cheaper CPU nodes for the gateway.
Secrets & Config Azure Key Vault GitHub Actions retrieves secrets at deploy time, injects as env vars. Never store API keys or connection strings in GitHub, even in private repos.
Monitoring & Logs Azure Monitor / Log Analytics GitHub repository hosts your dashboard definitions (Grafana as code). Create alerts for GPU memory utilization >90% and request latency spikes.

Step-by-Step Deployment from Zero to Inference

Let's walk through a concrete setup. I'm assuming you have the DeepSeek R1 model weights (check the official DeepSeek website for access) and basic familiarity with the command line.

Step 1: Local Preparation & Containerization

First, create a simple directory structure in your GitHub repo. The model weights are large, so we use a .gitignore to exclude them. Your Dockerfile is the most important piece.

# Dockerfile
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model files (assumed to be in ./model/ locally)
COPY ./model ./model

# Copy your inference API code
COPY app.py .

# Expose port
EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Your app.py contains the FastAPI logic to load the model and handle requests. Test this locally first. The big gotcha? Make sure your local PyTorch CUDA version matches the Azure VM's driver. Mismatches cause silent failures.

Step 2: Azure & GitHub Integration

Create an Azure Container Registry and an AKS cluster. You can do this via the portal, but I define everything as code in a /infra folder using Bicep. This way, your entire infrastructure is versioned in GitHub.

The magic happens in the GitHub Actions workflow (.github/workflows/deploy.yml). This workflow should:

  1. Trigger on a push to the main branch or a new tag.
  2. Log in to your Azure ACR using a GitHub Secret (AZURE_CREDENTIALS).
  3. Build and tag the Docker image.
  4. Push it to ACR.
  5. Update the Kubernetes deployment manifest in your repo with the new image tag.
  6. Use kubectl (configured with the AKS cluster credentials) to apply the manifest.

This is where most guides stop. But the deployment is the easy part.

The Real Cost: Optimization and Monitoring

An idle GPU VM is burning money. A misconfigured auto-scaler can bankrupt you overnight. Here's what you actually need to watch.

Pitfall: Setting your AKS cluster autoscaler to scale based on average CPU utilization is nearly useless for LLM inference. The model sits idle waiting for requests, using minimal CPU, but consumes all its GPU memory just by being loaded. Your scaling metric must be request-based or custom.

Implement a request queue in your gateway layer. Expose a metric like pending_requests. Configure the Kubernetes Horizontal Pod Autoscaler (HPA) to scale the number of inference pods based on this custom metric using Azure Monitor. This ensures you spin up new pods only when there's actual work.

For development or highly variable traffic, consider scale-to-zero. You can use Kubernetes Event-Driven Autoscaling (KEDA) to scale your deployment down to zero pods when no requests arrive for a period, and spin them up from the container image when a request hits. The trade-off is a cold-start penalty of several minutes while the model loads.

Set up Azure Budget alerts immediately. Go to Cost Management + Billing, create a budget for your resource group, and set alerts at 50%, 90%, and 100% of your monthly limit. I learned this the hard way.

Security Gotchas Everyone Misses

Deploying an open AI model doesn't mean being open to attacks.

  • Container Registry Scanning: Enable vulnerability scanning in ACR. Your base PyTorch image will have CVEs. You need to assess and patch.
  • Network Isolation: Your AKS cluster should be in a private virtual network (VNet). Use an Azure Application Gateway or APIM as a public ingress point, not a public LoadBalancer service directly on your inference pods.
  • Secret Management: Your inference API might need keys for external services. Store these in Azure Key Vault. Use the Azure Key Vault Provider for Secrets Store CSI Driver in AKS to mount them as volumes in your pod. They never touch your GitHub repo or your container image as environment variables.
  • Model Weights: The DeepSeek R1 weights are valuable IP. While they're in your container image in ACR, ensure the registry is private and access is via Managed Identity for AKS, not admin keys.

Your DeepSeek R1, Azure, and GitHub Questions Answered

What's the most cost-effective Azure VM for running DeepSeek R1 for intermittent, low-volume testing?
Skip the big NCv3 series for testing. Look at the NCas_T4_v3 series (with NVIDIA T4 GPUs). They offer a good balance of GPU memory and cost per hour. For true "burst" testing where you can tolerate a 5-7 minute cold start, use Azure Container Instances with a GPU SKU. You pay only for the seconds the container runs. It's perfect for running a weekend proof-of-concept without leaving a VM running all week.
How do I manage multiple versions of DeepSeek R1 (or different fine-tunes) in the same AKS cluster without conflict?
Use Kubernetes namespaces and distinct service names. Create a namespace like deepseek-r1-v1-2 and another for deepseek-r1-finance-ft. Deploy each model into its own namespace. Your gateway or APIM can then route requests based on a path (/v1.2/completions) or header to the correct backend service. This keeps deployments isolated, and you can apply resource quotas per namespace to prevent one model from hogging all the cluster resources.
My GitHub Actions workflow fails during kubectl deploy with authentication errors. What's the most secure way to connect GitHub to AKS?
Avoid using raw kubeconfig files or service account tokens as GitHub Secrets. The secure method is to use Azure's OpenID Connect (OIDC) federation. You create an Azure Active Directory application and federate it with your GitHub Actions workflow. This allows the workflow to request a short-lived Azure access token directly from Azure, which it then uses to authenticate to AKS. Microsoft's documentation on "Configure OpenID Connect in Azure" outlines this. It eliminates long-lived, powerful secrets in your GitHub repository.
The model loads slowly on pod startup, causing Kubernetes readiness probes to fail and the pod to restart in a loop. How do I fix this?
This is a classic issue. Your readinessProbe in the Kubernetes deployment is likely checking an endpoint too quickly. The fix is two-fold. First, add an initialDelaySeconds of 180-300 seconds to give the model ample time to load from disk into GPU memory. Second, make the probe check a specific lightweight health endpoint (e.g., /health) that simply returns 200 OK once the model is fully loaded in your application code, not the root API endpoint. Don't rely on the default settings.

Building this pipeline feels like a lot of moving parts. It is. But the payoff is a system where updating your model, scaling for a product launch, or diagnosing a performance issue becomes a structured, repeatable process—not a panic-fueled scramble. Start with the simplest working version (maybe just ACI and a manual image push), then layer in the automation from GitHub Actions, then the scaling from AKS. Iterate like you would on any other piece of software. The model is just the beginning; the platform you build around it is what delivers real value.

This guide is based on hands-on deployment experience and architecture patterns. Configuration details for Azure and GitHub services should be verified against their respective official documentation as the platforms evolve.