
Table of Contents
Introduction
On August 2, 2026, EU AI Act obligations for high-risk AI systems become fully enforceable. For any financial services company that uses machine learning to assess creditworthiness or assign a credit score to an individual, that date is not a policy milestone — it is an engineering deadline. Annex III of the Act explicitly lists creditworthiness assessment and credit scoring of natural persons as a high-risk use case. Penalties for non-compliance reach €15 million or 3% of global annual turnover, whichever is higher. As of May 2026, that window is fourteen weeks. Most MLOps teams are not ready, and the gap is not where they think it is.
Why Credit Scoring AI Triggers Full Annex III Obligations
The EU AI Act does not require your model to be unreliable or biased to put it in scope. It requires only that the model’s output has the potential to meaningfully affect an individual’s access to credit. That covers essentially every automated lending decision model in production: FICO-style scorecards, gradient boosting models for loan origination, deep-learning-based fraud risk scores used in approval flows, and pricing engines that adjust interest rates by risk tier.
The distinction regulators draw is between narrow statistical reporting (out of scope) and decision-support systems that inform or automate individual credit outcomes (in scope). If your model’s output touches an applicant’s decision flow, you are in scope.
What this triggers is not a one-time audit. It is an ongoing engineering obligation. Article 9 requires a continuous risk management system throughout the model’s lifecycle not a pre-launch checklist. Article 12 requires automatic event logging with enough detail to enable post-hoc reconstruction of the system’s behavior on any given inference call.
Your credit model’s training pipeline is now regulatory infrastructure.
The Logging Problem Nobody Talks About
Explainability has received most of the industry attention. Practitioners debate SHAP vs. LIME, argue about counterfactual explanations, and invest in model cards. Those efforts are real and necessary. But they are not the hardest part.
The hardest part is Article 12 logging, and most MLOps platforms are not built for it.
Article 12 requires logs to capture the operating conditions of the system, the input data used to produce each output, and the decisions or recommendations made. For a credit scoring model running at scale, that means logging at the individual inference level, not at the batch level. It means correlating model version, feature values, output score, and outcome back to a specific applicant decision. It means storing those logs in a tamper-evident format for a period sufficient to support regulatory review.
The gap is not between your model’s accuracy and the benchmark. The gap is between what your model does and what you can prove it did.
A digital-first NBFC we worked with in India had built a solid MLOps pipeline: automated retraining, drift monitoring with Evidently, champion-challenger scoring, and weekly business reviews. None of that touched Article 12 compliance. Their inference logs were aggregated. Their feature values were not persisted. Their model version at inference time was not recorded in the data store that held approval outcomes. The compliance gap was not in the model. It was in the plumbing.
The Credit Model Compliance Stack (CMCS)
The CMCS is a five-layer framework for bringing a credit scoring MLOps pipeline into EU AI Act compliance. Work through the layers in order. Each layer is a prerequisite for the one above it.
Layer 1: Model Registry with Lineage
Every model version in production must be traceable to its training data, training code, hyperparameters, and evaluation metrics at the point of deployment. Tools: MLflow Model Registry, Vertex AI Model Registry, or equivalent. Acceptance criterion: you can reconstruct the exact model artifact that produced any given inference.
Layer 2: Inference-Level Event Logging
Every inference call must generate a structured log record containing: model version ID, input feature vector (or a hash linked to a retrievable record), output score, timestamp, and the downstream decision applied (approved, declined, referred). Logs must be append-only and stored separately from the application database. Acceptance criterion: you can reconstruct the decision path for any individual application within 24 hours of a regulatory request.
Layer 3: Data Governance for Training Sets
Article 10 requires that training data be relevant, sufficiently representative, and free from errors. Your data governance documentation must record the source, preprocessing steps, bias assessment methodology, and any exclusions applied to training datasets. Acceptance criterion: a written data governance record exists for every model version in the registry.
Layer 4: Human Oversight Mechanism
High-risk AI systems require a human override mechanism. For credit scoring, this means a review queue for edge-case decisions, a defined escalation protocol, and audit logs showing when human reviewers were engaged and what decisions they made. Acceptance criterion: the override rate and review queue disposition are reportable metrics in your risk management dashboard.
Layer 5: Continuous Risk Monitoring
Article 9 requires continuous risk management. For MLOps, this translates to: population stability index (PSI) monitoring for input drift, performance monitoring against a labeled ground-truth sample at defined intervals, and an incident response protocol for when thresholds are crossed. Acceptance criterion: automated alerts fire when model performance or input distribution deviates beyond defined thresholds, and the response protocol is documented.
What You Can Realistically Ship in Fourteen Weeks
Fourteen weeks is enough to achieve compliance on Layer 1, Layer 2, and Layer 4 if the engineering team is focused and the scope is limited to existing production models. It is not enough to rebuild your data governance documentation from scratch, especially if training datasets were assembled without audit-trail discipline.
A phased approach:
Weeks 1 to 3: Audit existing inference logs and identify gaps against Article 12. Stand up inference-level logging in staging. Define the structured log schema and storage architecture.
Weeks 4 to 7: Deploy inference logging to production. Validate log completeness by replaying a sample of historical decisions and confirming reconstruction. Backfill model registry entries for all current production model versions.
Weeks 8 to 10: Build the human oversight queue. Define the decision boundary conditions that trigger mandatory human review. Instrument the override log.
Weeks 11 to 12: Complete the data governance documentation for the three to five highest-risk model versions. Run a bias assessment and record the methodology.
Weeks 13 to 14: Conduct an internal compliance review against the five CMCS layers. Identify residual gaps and triage by risk level. Prepare the technical documentation package.
This is aggressive. It requires a dedicated engineering resource for eight weeks minimum. It also requires a compliance function that can review and sign off on documentation at each stage, not at the end.
The teams that will not make August 2 are the ones that are still treating this as a legal project with an IT dependency.
What This Means for Financial Services Leaders
The EU AI Act transforms model risk management from a best practice into an operational requirement with enforcement teeth. For lending institutions, this is not incremental compliance work — it requires rearchitecting the MLOps stack around observability and auditability.
The concrete steps you can take this week without engaging anyone externally: pull a sample of inference records from your top three credit models and check whether you can reconstruct a specific individual decision (applicant ID, feature values, model version, output, outcome) in under an hour. If you cannot, that is your compliance gap, and it is the one that matters most.
The next step after that is scoping the inference logging build. Most teams can ship the core logging layer in three to four weeks with two engineers. The data governance documentation takes longer and requires a different skill set — specifically, someone who understands both the training pipeline and the regulatory documentation obligation.
About the author: The Codelynks ML engineering team has delivered production MLOps systems for lending and risk platforms across India and the GCC. Connect on LinkedIn
FAQ’s
Any AI system used to assess the creditworthiness of individuals or assign credit scores falls under Annex III of the EU AI Act as a high-risk system, regardless of the underlying model type or the lender’s size.
Article 12 requires automatic, tamper-evident event logging at the inference level, capturing the model version, input data, output, and operating conditions for each decision. Aggregate or batch logs do not satisfy the requirement.
Non-compliance with high-risk AI system obligations under Article 99 carries penalties of up to €15 million or 3% of global annual turnover, whichever is higher. National competent authorities in each EU member state have enforcement powers.
For a team with an existing MLOps stack, building inference-level logging to Article 12 standards typically takes three to six weeks, depending on the complexity of the model-serving infrastructure and the number of models in scope.
Yes. The EU AI Act applies to providers and deployers of AI systems that affect EU residents, regardless of where the company is headquartered.