Enabling UK healthcare data compliance with scalable AI infrastructure

Image 23 - Enabling UK healthcare data compliance with scalable AI infrastructure
About the project

Our client is a UK-based healthtech company delivering state-of-the-art digital solutions for the healthcare sector. Operating at the intersection of service and product development, they expertly leverage their know-how across diverse projects to help businesses build and launch compliant medtech software tools in the United Kingdom and globally.  

Image 24 - Enabling UK healthcare data compliance with scalable AI infrastructure
Client

A UK-based healthtech company delivering compliant digital solutions for the healthcare sector

Image 25 - Enabling UK healthcare data compliance with scalable AI infrastructure
Our role

Data Engineering, MLOps, LLMOps, DevOps, ML Engineering

Image 26 - Enabling UK healthcare data compliance with scalable AI infrastructure
Country

The UK

Image 27 - Enabling UK healthcare data compliance with scalable AI infrastructure
Industry

Healthcare, Medtech, Healthtech

Image 28 - Enabling UK healthcare data compliance with scalable AI infrastructure
Team members

1 MLOps Engineer, 1 ML Engineer

Image 29 - Enabling UK healthcare data compliance with scalable AI infrastructure
Duration

2 months for the initial phase; ongoing support

Technologies and tools we used
Image 30 - Enabling UK healthcare data compliance with scalable AI infrastructure
Mixedbread
Ollama
Ollama
Karpenter
Karpenter
LLaMA3
LLaMA3
Image 31 - Enabling UK healthcare data compliance with scalable AI infrastructure
Mixedbread
Ollama
Ollama
Karpenter
Karpenter
LLaMA3
LLaMA3
Image 32 - Enabling UK healthcare data compliance with scalable AI infrastructure
Mixedbread
Ollama
Ollama
Karpenter
Karpenter
LLaMA3
LLaMA3
Challenges

Our Healthtech client faced a critical challenge due to stringent UK regulations on healthcare data. Any healthcare-related patient information, including data processed by AI, must remain within the UK. This extends beyond mere storage, requiring all data processing on UK-based servers.

Under the UK’s data protection framework, this mandate results in significant penalties for non-compliance, including substantial fines, reputational damage, and even operational shutdowns. The UK’s Data Protection Act 2018, for instance, strictly controls how personal information is collected, stored, and used.

Leveraging global AI infrastructure, mostly located outside the UK, posed a critical challenge for our client. Available commercial cloud LLM processing services were nonexistent in the UK for their specific needs, which drove us to use only UK-based servers and specific regional AWS services for compliance.

Adding to this complexity, the anticipated AI model workloads were highly irregular, characterized by unpredictable spikes rather than consistent demand. Scaling LLMs for high traffic also posed a significant challenge, as achieving seamless scaling isn’t straightforward with large models.

Given the substantial cost of GPU resources essential for AI models, our client needed a cost-efficient, dynamically scalable solution to meet fluctuating demand without prohibitive continuous expenses.
This complex array of challenges demanded an in-depth understanding of both technical intricacies and healthcare sector specifics.

Key challenges in building UK-compliant AI infrastructure
The solution

Honeycomb Software addressed the client’s multifaceted challenge by designing a custom-built, flexible, and highly cost-effective AI infrastructure solution.

Approach
to the challenges  

We deployed a dedicated infrastructure optimized for open-source LLM models directly within the UK region to bypass the lack of commercial cloud services could manage data locally.

Our team engineered an “scale-to-zero” architecture for unpredictable workloads and high costs. It automatically activates resources and the AI model as needed, shutting down expensive components during idle times.

For flexible LLMOps and future adaptability, we adopted containerization and microservices, leveraging tools like Docker and Kubernetes. This allowed us to containerize models with dependencies for consistent deployment.

Kubernetes manages these containers across servers, dynamically scaling them with demand. This inherently modular and flexible approach simplifies updates and enables easy interchangeability of different LLM models as our client’s needs evolve.

To address this challenge, we developed a solution using autoscaling groups with multiple GPU types, ensuring that if one type is unavailable, others can be spun up. Our solution also supports multi-GPU inference, allowing several smaller GPUs to be combined as a substitute for a larger one.

Our deep LLMOps and ML engineering expertise was crucial for navigating the complexities of deploying and managing LLMs in a regulated environment, allowing us to optimize model performance, manage dependencies, and ensure data integrity within UK data regulations.

As an AWS-certified partner, we adopted the best operational practices and provided the foundational cloud architecture aligned with AWS Well-Architected Framework essential for a robust, secure, and scalable deployment. This guaranteed the infrastructure with high availability and reliability, critical for healthcare applications.

To optimize performance and handle large models efficiently, we used advanced inference optimization techniques like model sharding, which distributes models across multiple GPUs when they don’t fit on a single one, and multi-GPU inference.

Finally, our business-centric approach went beyond just software development. We embedded ourselves in client operations to deeply understand their core business drivers and regulatory pressures, allowing us to prioritize a cost-efficient, compliant solution that delivered tangible business value.

Technical Implementation
& Architecture overview  

Honeycomb Software built a fully localized, autoscaling LLM platform on AWS in the London region. The implementation is divided into five core layers, each optimized for performance, cost-efficiency, and compliance:

1.Infrastructure Provisioning & Security 

  • AWS EU-West-2 (London) only: All compute, storage, and networking resources live in private subnets within the London region, satisfying UK Data Protection Act 2018 residency requirements.
  • IaC & GitOps: Terraform/Terragrunt codify VPC, EKS, EFS, S3, RDS, Lambda, IAM roles/policies; FluxCD drives Kubernetes manifests for consistent, auditable deployments.
  • Network Controls: VPC endpoints for S3, RDS, EFS; Security Groups and IAM policies lock down access to UK-only endpoints; audit trails via CloudTrail & GuardDuty.

2. Document Indexer LLM Flow (Schema & Pre-Processing)

S3 → Lambda Trigger:

  • When a PDF or image is uploaded to an S3 bucket, a Lambda function is triggered (Diagram 1).

OCR Extraction:

  • The Indexer sends unstructured raw data to the LLM to extract the structured output.

LLM Structuring:

  • The Indexer sends unstructured raw data to the LLM to extract the structured output.

Cost Optimization:

  • If there are no requests from LLM for a defined period of time the GPU node pool is scaled down to zero. Models are stored on EFS within a region so the data is available for the next scale-up in any AZ within the region.
Documenter Indexer LLM Flow
Technical Implementation
& Architecture overview  

3. Containerization & Autoscaled Inference

EKS + Docker & Helm:

  • All inference services run as containerized microservices on EKS, with Helm charts managed by FluxCD.

Karpenter + HPA Scale-to-Zero:

  • Karpenter provisions GPU and CPU node pools on demand. Kubernetes HPA scales pods from zero up to N based on request rate and GPU metrics.

Cold-Start Optimization:

  • Within the GPU node pool, we support multiple GPU types to maximize solution availability. Our architecture also enables multi-GPU inference, allowing several smaller GPUs to be used in place of a single larger one.
GPU node pool with EFS (schema)
Technical Implementation
& Architecture overview  

4. Inference Performance Optimizations 

Model Sharing & Multi-GPU Inference:

  • Large models are split across GPUs with NCCL communication, ensuring high utilization.

Latency & Utilization Benchmarks:

  • We benchmarked under 100–1,000 req/s loads to confirm sub-50 ms latency and >90 % GPU utilization with sharing.

By combining UK-only AWS services, serverless OCR, open source LLMs, Kubernetes autoscaling, and inference optimizations, we delivered a compliant, resilient, and cost-efficient AI platform — fully aligned with strict data-residency rules and real-world usage patterns.

The results

Honeycomb Software successfully delivered a fully operational and compliant AI infrastructure for our client, effectively addressing their core challenges.

Key results

Ensured regulatory compliance: The implemented solution guarantees that all healthcare data processing, including sensitive AI personal information, strictly adheres to UK data regulations by occurring entirely on UK-based servers.

Achieved optimized cost efficiency: By reducing GPU costs by 80% through agile architecture, our client can now effectively manage expensive resources during periods of no demand and scale rapidly to meet peak capacity needs.

Delivered high scalability and reliability: The architecture is designed to handle thousands of requests per second, ensuring high performance and reliability even during peak loads.

Provided architectural flexibility: The modular design allows our client the agility to interchange different LLM models as their requirements evolve, future-proofing their AI capabilities.

benefits of the scalable AI infrastructure for the UK
Future Plans  

The completion of the infrastructure phase marks a significant milestone in this long-term project. Our client now possesses a foundational, compliant, and scalable AI infrastructure, ready for ongoing utilization and further development. Honeycomb Software continues its partnership to support future phases of this project.

Get in touch with us!

    Thank you for reaching out!

    We’ll review your message and get back to you soon.

    In the meantime, feel free to explore our case studies or submit another request.