Jentic API AI-Readiness Framework (JAIRF) Specification¶
Version 0.2.0¶
This document uses the key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY as defined in BCP 14 RFC2119 RFC8174 when, and only when, they appear in all capitals, as shown here.
This document is licensed under The Apache License, Version 2.0.
Introduction¶
The Jentic API AI-Readiness Framework (JAIRF) defines a standardized methodology for evaluating the suitability of an API for consumption by intelligent agents, LLM-driven orchestration systems, and automated integrations. This specification defines six scored dimensions, grouped into three pillars, and a unified scoring and weighting model that produces an overall AI-Readiness Index between 0 and 100.
JAIRF is designed to evaluate an API’s ability to be:
- Interpretable — easily understood by reasoning systems
- Operable — safe and predictable to execute
- Discoverable — findable, indexable, and contextually exposed
- Governable — aligned with secure and trustworthy practices
The framework may be applied to:
- Single APIs
- Entire API portfolios
- API gateways or registries
- Design-time and runtime validation tools
JAIRF MAY be implemented as a feedback layer during API delivery, a CI/CD quality gate, a governance control, an automated readiness classifier, or as part of an AI-agent execution platform.
Scope¶
This specification defines:
- The conceptual model for API AI-readiness
- The scoring dimensions and required signals
- The weighting and aggregation model
- The readiness levels and classification thresholds
- The normative behaviours required for compliant scoring engines
This specification explicitly does not define:
- How an API provider MUST design APIs
- A proprietary scoring algorithm inaccessible to auditors
- Enforcement or certification process
JAIRF is intended to be implementation-agnostic, vendor-neutral, and API-format-agnostic, supporting (for now):
- OpenAPI 3.x
- OpenAPI 2.x
Terminology¶
The following terms are used in this specification:
API¶
A machine-accessible interface defined by an operational contract (e.g., OpenAPI).
Signal¶
A measurable property extracted from an API (e.g., summary coverage, auth strength). Signals are usually normalised to a value between 0 and 1. In other cases, a signal MAY be binary.
Dimension¶
A thematic grouping of signals that together represent a pillar of API AI-readiness. JAIRF defines six dimensions.
Group¶
A higher-level category that bundles dimensions.
JAIRF defines three groups:
- FDX — Foundational & Developer Experience
- AIRU — AI-Readiness & Usability
- TSD — Trust / Safety / Discoverability
Aggregation¶
The mathematical process of combining normalised signals into dimension scores and dimension scores into the final readiness score.
Harmonic Mean¶
The weighted harmonic mean is used at the final aggregation stage to penalise imbalanced APIs where one weak dimension would otherwise be masked by strong dimensions.
Normalisation¶
The method by which raw measurements (counts, ratios, textual assessments) are converted to a consistent scale. All global normalisation rules are defined in Normalisation Rules.
Readiness Level¶
A human-interpretable categorization mapped from the final readiness score:
| Level | Range | Interpretation |
|---|---|---|
| Level 0 | < 40 | Not Ready |
| Level 1 | 40 - 60 | Foundational |
| Level 2 | 60 - 75 | AI-Aware |
| Level 3 | 75 - 90 | AI-Ready |
| Level 4 | > 90 | Agent-Optimized |
Grading¶
A letter grade applied to each dimension and to the overall score.
Framework Architecture¶
JAIRF defines a three-layer scoring architecture:
- Signals Layer (raw metrics → 0–1 scale)
- Dimension Layer (normalised signals → 0–100)
- Aggregation Layer (weighted harmonic mean → final score)
This architecture MUST be preserved by any conformant implementation.
A conformant engine MUST:
- Compute each dimension exactly as defined
- Apply weights as defined (or disclose weights)
- Use weighted harmonic aggregation
- Apply gating rules prior to final scoring
- Apply readiness classification and grading
The scoring mechanism MUST be deterministic and reproducible for any API input.
Dimensional Model Overview¶
The Jentic API AI-Readiness Framework (JAIRF) evaluates an API’s maturity across six dimensions, grouped into three pillar categories. These pillars represent a staged view of maturity: beginning with structural soundness (can this be used at all), then semantic readiness (can AI understand it), and finally trust & exposure (can it be used safely and found by the right agents).
The purpose of this section is conceptual orientation. Normative definitions of each dimension appear in the Dimension Definitions section.
Pillar Categories¶
Foundational and Developer Experience (FDX)¶
Assesses whether an API is structurally valid, standards-compliant, and usable by humans and tooling. This establishes the prerequisite conditions for any form of automated interpretation.
Dimensions included:
- Foundational Compliance
- Developer Experience & Tooling Compatibility
AI-Readiness & Usability (AIRU)¶
Evaluates the semantic clarity, intent expression, operational structure, and agent-operability of an API. These dimensions determine how effectively intelligent agents can interpret, plan, and execute API operations.
Dimensions included:
- AI-Readiness & Agent Experience
- Agent Usability
Trust, Safety, & Discoverability¶
Ensures that the API can be safely exposed to AI systems and that it can be effectively located, classified, and reasoned about within automated discovery environments.
Dimensions include:
- Security
- AI Discoverability
Dimensional Structure¶
Each dimension represents a distinct aspect of AI readiness and MUST be evaluated independently. The framework assumes that:
- No dimension compensates for another- a weakness in any dimension constrains overall readiness.
- Dimensions progress in logical maturity order (Foundational correctness → semantic clarity → safe exposure and discoverability).
- All dimensions score in the range [0–100], derived from normalised signals.
- The final AI-Readiness score is computed later using a weighted harmonic mean of the six dimensions.
Dimension Definitions¶
This section defines normative requirements for each of the six dimensions of the Jentic API AI-Readiness Framework (JAIRF).
Foundational Compliance (FC)¶
The Foundational Compliance dimension evaluates whether an API is structurally valid, standards-conformant, and parsable by modern API tooling. To do this, the presence or absence of specification-level correctness, resolution integrity, structural soundness, and linting quality, are all measured.
An API MUST NOT be considered AI-ready if it fails foundational parsing or contains severe structural defects.
Required Signals¶
| Signal | Description | Normalisation Rule |
|---|---|---|
spec_validity | MUST indicate whether the API parses as valid OpenAPI. | binary |
resolution_completeness | MUST represent the proportion of $ref references that successfully resolve | coverage |
lint_results | Aggregated quality score from linter diagnostics, weighted by severity. | inverse weighted categorical ratio |
structural_integrity | MUST score schema correctness and coherence (e.g., oneOf misuse, contradictory typing, impossible constraints). | logarithmic dampening |
Spec Validity (spec_validity)¶
spec_validity = 1 # if specification parses successfully
spec_validity = 0 # otherwise
Resolution Completeness (resolution_completeness)¶
resolution_completeness = resolved_refs / total_refs
If total_refs = 0, the value MUST be 1.0.
Example¶
resolution_completeness = 0.92
# (23 of 25 $ref links resolved successfully)
Lint Results (lint_results)¶
This signal leverages core ruleset of Spectral and Redocly, with Jentic opinions applied, as well as a Jentic custom ruleset.
weighted_cost = SQRT((1.0 * critical) + (0.6 * errors) + (0.0025 * warnings) + (0.001 * info) + (0.0 * hint))
lint_results = max(0, 1 - (weighted_cost / 25))
Structural Integrity (structural_integrity)¶
The structural integrity signal measures whether the API's underlying data model is coherent enough for automated reasoning. Unlike linting errors — which often relate to style, documentation, or optional best practices - structural issues reflect semantic or logical defects that prevent reliable interpretation by developers, tools, and AI agents.
Formula:
Structural integrity MUST be calculated using a logarithmic dampening curve:
structural_integrity = max(0, 1 - log10(1 + structural_issues) / log10(1 + structural_issue_threshold))
Where:
structural_issuesis the total count of structural defects detected.structural_issue_thresholdrepresents the point where structural reliability collapses. Once an API has more than ~15 schema-breaking or integrity flaws, automated interpretation is no longer trustworthy.
The formula yields a smooth decay curve, prevents early collapse, but penalises structural issues more heavily than cosmetic issues.
A structural issue MUST be recorded when any of the following occur:
| Category | Examples |
|---|---|
| Invalid model shape | type: object but no properties defined; objects with additionalProperties: false but empty |
| Contradictory typing | type: string + format: int32; arrays using incompatible items definitions |
| Impossible constraints | minimum > maximum; exclusiveMinimum > exclusiveMaximum; enum values violating type, and contradictory schema constructs |
| Broken polymorphism | oneOf/anyOf/allOf inconsistent; missing or invalid discriminator; unreachable or contradictory sub-schemas |
| Response/request undefined | requestBody: {}; missing schemas; empty content definitions |
| Non-evaluable example | Examples that are invalid JSON, violate the declared schema, or contradict field constraints |
| Unresolvable or circular schema structures | Schemas that reference non-existent fields; recursive references without a valid base schema |
Dimension Score¶
FC = 100 × (spec_validity + resolution_completeness + lint_results + structural_integrity) / 4
Example¶
spec_validity: 1.0 # spec parsed successfully
resolution_completeness: 0.92 # 92% of refs resolved
lint_results: 0.85 # post-weighting
structural_integrity: 0.88 # schema irregularities
FC = 100 * (1.0 + 0.92 + 0.85 + 0.88) / 4
≈ 91.25
Developer Experience & Tooling Compatibility (DXJ)¶
DXJ evaluates clarity, documentation quality, example coverage, and compatibility with Jentic’s ingestion pipeline. An API with strong DXJ SHOULD be predictable and pleasant for both human developers and automated tooling.
Required Signals¶
| Signal | Description | Normalisation Rule |
|---|---|---|
example_density | MUST indicate presence of examples across eligible locations. | coverage |
example_validity | MUST show schema-conformance of examples. | coverage |
doc_clarity | MUST quantify linguistic clarity of summaries and descriptions. | min-max inverted |
response_coverage | MUST indicate presence of meaningful success and error responses. | coverage |
tooling_readiness | MUST measure ingestion/bundling health within Jentic pipelines. | inverse error |
Example Density (example_density)¶
example_density MUST measure coverage of examples across all eligible specification locations. It represents whether each location includes at least one example, and NOT how many examples are provided.
example_density = present_examples / expected_examples
Where:
- each eligible location contributes one (
1) toexpected_examples, regardless of whether the location supports both example and examples fields from an OpenAPI perspective. - multiple examples defined inside an examples array MUST not increase
present_examplesbeyond1.
If expected_examples = 0, the value MUST be 1.0.
Example Validity (example_validity)¶
example_validity = valid_examples / total_examples
Doc Clarity (doc_clarity)¶
doc_clarity = 1 - ((readability_score - 8) / (16 - 8))
Where readability_score ∈ [8, 16] (8 is easy to read, 16 would be legaleses / hard to parse.). See readability_score definition in Appendix A: General Definitions.
Response Coverage (response_coverage)¶
Measures the percentage of operations that define complete response outcomes. Uses graded scoring per operation based on response categories present.
Formula:
response_coverage = sum(operation_response_coverage) / total_operations
Where:
Each operation receives an operation_response_coverage from 0.0 to 1.0:
- +0.25 if at least one 2XX response is defined
- +0.25 if at least one 4XX response is defined
- +0.25 if at least one 5XX response is defined
- +0.25 if default response is defined
If total_operations = 0, the value MUST be 1.0.
Tooling Readiness (tooling_readiness)¶
tooling_readiness = max(0, 1 - (ingestion_errors / 15))
Tooling Readiness Threshold Unlike structural integrity, tooling_readiness should not be treated as a correctness gate.It reflects how much effort is required for Jentic (or a developer) to prepare the API for downstream use in our tools. For independent implementers, this threshold should reflect the gates for the relevant target tooling. 15 is chosen as an initial threshold.
| Tooling Ingestion Issues | Score | Interpretation |
|---|---|---|
| 0 - 3 | 0.85 - 1.0 | Easily ingested |
| 4 - 8 | 0.6 - 0.8 | Cleanup recommended |
| 9 - 14 | 0.3 - 0.5 | High friction |
| 15+ | 0.0 | Cannot reliably ingest |
Dimension Score¶
DXJ = 100 × (example_density + example_validity + doc_clarity + response_coverage + tooling_readiness) / 5
Example¶
example_density: 0.66 # 2/3 expected example points populated
example_validity: 0.80 # valid & typed
doc_clarity: 0.75 # readable, non-legalistic
response_coverage: 0.70 # most methods cover non-2xx cases
tooling_readiness: 0.90 # imports cleanly
DXJ = 100 * (0.66 + 0.80 + 0.75 + 0.70 + 0.90) / 5
≈ 76.2
AI-Readiness & Agent Experience (ARAX)¶
ARAX evaluates whether an API is semantically interpretable by AI systems—specifically whether it provides enough contextual meaning for intelligent agents to infer intent, constraints, and expected behaviors. ARAX measures semantic clarity and expressiveness, including descriptive coverage, datatype specificity, and error semantics.
Required Signals¶
| Signal | Description | Normalisation Rule |
|---|---|---|
summary_coverage | MUST represent presence of concise summaries across specification objects with a summary field (e.g.,operations/tags/info etc). | coverage |
description_coverage | MUST represent descriptive completeness across applicable API specification objects with a description field. | coverage |
type_specificity | MUST quantify richness of datatype modelling. | weighted categorical |
policy_presence | SHOULD represent inclusion of SLA/rate-limit/policy metadata. | coverage |
error_standardization | SHOULD favour structured error formats (RFC 9457/7807). | coverage |
opid_quality | MUST evaluate operationId coverage, uniqueness (with collision penalty), and casing consistency. | composite |
ai_semantic_surface | MAY provide bonus uplift for AI-oriented metadata. | bonus multiplier |
Summary Coverage (summary_coverage)¶
summary_coverage = summaries_present / summaries_expected
Where:
summaries_expectedMUST take into account every specification object withsummaryfixed field.
Example¶
summary_coverage = 0.78
# 78% of operations/tags/info objects etc. include a summary field
Description Coverage (description_coverage)¶
description_coverage = described_elements / describable_elements
Where:
describable_elementsMUST take into account every specification object with adescriptionfixed field.
Example¶
description_coverage = 0.82
# 82% of info objects, operations, schemas, parameters etc. include a description
Type Specificity (type_specificity)¶
type_specificity =
(
(1.0 × strong_types)
+ (0.75 × formatted_strings)
+ (0.50 × enum_fields)
+ (0.25 × weak_strings)
) / total_fields
This rewards APIs that model values semantically, not just as loosely typed strings.
Policy Presence (policy_presence)¶
policy_presence = operations_with_policy_metadata / total_operations
This helps AI reason about risk, performance, and operational constraints.
Error Standardisation (error_standardisation)¶
error_standardization = operations_using_RFC9457 / total_operations
This should also cater for RFC7807 which was replaced by RFC9457.
This is important for helping AI reason about failure modes, not just success paths.
OperationId Quality (opid_quality)¶
coverage = ops_with_operation_id / total_operations
uniqueness = unambiguous_operation_ids / ops_with_operation_id / total_collision_issues
casing_consistency = dominant_casing_count / ops_with_operation_id
opid_quality = coverage × uniqueness × casing_consistency
Where:
unambiguous_operation_idsis count of operationIds whose lowercase form appears exactly once (i.e., no case-insensitive duplicates)total_collision_issuesis the total number of case-sensitive operationId collisions (e.g., getUser vs GetUser vs getUser would count as 2 collision issues)dominant_casing_countis count of operationIds using the most common casing style
If total_operations = 0, then opid_quality MUST be 1.0. If total_collision_issues = 0, then it is having no effect on uniqueness (i.e., it is 1.0). If ops_with_operation_id = 0, then uniqueness = 1.0 and casing_consistency = 1.0
Casing styles detected:
- camelCase, PascalCase, snake_case, kebab-case
- SCREAMING_SNAKE_CASE, lowercase, UPPERCASE
If multiple ops appear to offer “getUser” operation, uniqueness drops and harms AI inference.
AI Semantic Surface (ai_semantic_surface)¶
ai_semantic_surface_bonus = 1 + (0.05 * ai_hint_coverage)
Example¶
bonus = 1.015
# 30% of operations include hints like x-intent, workflows, Arazzo links
Dimension Score¶
core_arax = (summary_coverage + description_coverage + type_specificity + policy_presence + error_standardization + opid_quality) / 6
ARAX = 100 × core_arax × (1 + 0.05 × ai_semantic_surface)
Example¶
summary_coverage: 0.80 # 80% described
description_coverage: 0.82 # 82% described
type_specificity: 0.76 # formats & enums used meaningfully
policy_presence: 0.40 # limited SLA/policy metadata
error_standardization: 0.50 # only half define RFC9457 (of RFC8707)
opid_quality: 0.90 # coverage & uniqueness & consistency
ai_semantic_surface: 0.30 # minimal Arazzo/AI hints
core = (0.8 + 0.82 + 0.76 + 0.40 + 0.50 + 0.90) / 6
= 0.6966666667
bonus = 1 + (0.05 * 0.30) = 1.015
ARAX ≈ 100 * 0.6966666667 * 1.015
≈ 70.71
Agent Usability (AU)¶
Agent Usability evaluates whether autonomous agents can operate the API reliably, safely, and efficiently. The dimension measures operational composability: navigation, complexity, redundancy, safety, and tool-calling alignment.
Required Signals¶
| Signal | Description | Normalisation Rule |
|---|---|---|
complexity_comfort | Measures document size, endpoint density, and schema complexity, penalised using a logistic curve. | logistic shaping |
distinctiveness | MUST quantify semantic separation between operations. | inverse semantic similarity |
pagination | MUST represent the ration of paginated GET resources. | coverage |
hypermedia_support | HATEOAS/JSON:API/HAL affordances. | coverage |
intent_legibility | MUST represent verb-object semantic clarity. | similarity (LLM assisted) |
safety | MUST evaluate idempotency & sensitive operation protection. | heuristic penalty |
tool_calling_alignment | SHOULD represent alignment with LLM tool-calling expectations. | coverage |
navigation | MUST represent pagination & hypermedia affordances. | composite |
Complexity Comfort (complexity_comfort)¶
raw_complexity = 0.5 × normalised_endpoint_count
+ 0.5 × normalised_schema_depth
complexity_comfort = 1 / (1 + exp(6 × (raw_complexity - 0.45)))
Normalised Endpoint Count (normalised_endpoint_count)¶
A normalised measure of how large an API is relative to a typical operational complexity threshold.
API usability for agents degrades as endpoint count grows — but only after a certain point. A 3-endpoint API and a 12-endpoint API should not have radically different usability penalties. But a 200-endpoint platform should carry a greater complexity signal.
Formula:
normalised_endpoint_count = min(1, total_operations / endpoint_baseline)
Where:
total_operationsis the count of unique forward-callable operations (method + path pairs). Callbacks and webhooks MUST NOT be counted as endpoints.endpoint_baselineis set at50.
Normalised Schema Depth (normalised_schema_depth)¶
A normalised indicator of schema nesting and structure depth across all schemas.
Agent reasoning difficulty can be correlated to object depth, degree of polymorphism, and highly nested schemas.
Formula:
normalised_schema_depth = min(1, max_schema_depth / depth_baseline)
Where:
max_schema_depthis the deepest nesting found across all schemasdepth_baselineis set at8.
Schemas referenced by callbacks/webhooks MUST be included in normalised_schema_depth, because they contribute to the overall semantic model complexity.
Distinctiveness (distinctiveness)¶
distinctiveness = 1- avg_semantic_similarity
Pagination (pagination)¶
pagination = paginated_collection_GETs / collection_GETs
Hypermedia Support (hypermedia_support)¶
hypermedia_support = hypermedia_responses / total_navigable_responses
Navigation (navigation)¶
navigation_readiness = (0.6 × pagination) + (0.4 × hypermedia_support)
navigation = navigation_readiness × (1 + 0.03 × links_coverage)
Example¶
navigation = 0.443 # usable navigation with modest hint enrichment
pagination = 0.60 # 60% of eligible list endpoints are paginated
hypermedia_support = 0.20 # HAL/JSON:API/HATEOAS limited
links_bonus = 1.0075 # 25% of responses include link relations
Intent Legibility (intent_legibility)¶
intent_legibility = mean_semantic_alignment
Distinctiveness is simply the opposite of redundancy. if multiple endpoints do the same thing, the API becomes harder for AI to choose correctly.
For distinctiveness, operation-to-operation comparison uses embedding-based semantic similarity (cosine similarity), which captures meaning rather than string matching. Levenshtein distance MAY be used as a fallback for very small APIs (but likely not too valuable).
For intent_legibility, each operation is compared to a set of canonical verb-object pairs (e.g., create-resource, list-resources), ensuring that agent routing matches human-intended meaning.
Safety (safety)¶
safety = ((idempotent_correctness + sensitive_ops_protected) / (2 * total_operations))
Tool Calling Alignment (tool_calling_alignment)¶
tool_calling_alignment = operations_mappable_to_ai_tool_calls / total_operations
Dimension Score¶
# Pagination and hypermedia combine into `navigation` (with links bound [0,1]).
navigation_readiness = 0.6 * pagination + 0.4 * hypermedia_support
navigation = navigation_readiness * (1 + 0.03 * links_coverage)
AU = 100 × (complexity_comfort + distinctiveness + navigation + intent_legibility + safety + tool_calling_alignment) / 6
Example¶
complexity_comfort: 0.74 # logistic-shaped
distinctiveness: 0.71 # low semantic collision
pagination: 0.60 # most list GETs paginated
hypermedia_support: 0.20 # few HAL/JSON:API affordances
links_coverage: 0.25 # some link relations present
intent_legibility: 0.80
safety: 0.68
tool_calling_alignment: 0.72
navigation_readiness = (0.6 * 0.60) + (0.4 * 0.20)
= 0.44
links_bonus = 1 + (0.03 * 0.25) = 1.0075
navigation = 0.44 * 1.0075 ≈ 0.443
AU = 100 * (0.74 + 0.71 + 0.443 + 0.80 + 0.68 + 0.72) / 6
≈ 68.8
Security (SEC)¶
Security evaluates trustworthiness, authentication strength, and operational risk posture in the context of agent-based automation. This dimension measures authentication coverage, secret hygiene, transport security, sensitive data handling, and OWASP risk.
Required Signals¶
| Signal | Description | Normalisation Rule |
|---|---|---|
auth_coverage | Evaluates whether authentication is correctly applied to sensitive or modifying operations, using intent-aware heuristics. | heuristic penalty |
auth_strength | MUST evaluate strength of security schemes. | weighted categorical |
transport_security | MUST require HTTPS for externally exposed hosts. | binary |
secret_hygiene | MUST detect and penalise hardcoded credentials. | binary |
sensitive_handling | MUST score protection of PII/sensitive fields. | coverage ratio |
owasp_posture | SHOULD reflect severity-weighted risk findings. | severity weighted inverse with dampening |
Auth Coverage (auth_coverage)¶
auth_coverage = protected_sensitive_ops / sensitive_ops_expected
If sensitive_ops_expected = 0, then auth_coverage MUST be 1.0.
Sensitive Operation Determination (sensitive_ops_expected)¶
sensitive_ops_expected represents the count of operations that ought to require authentication. This value is NOT the same as “operations that declare security”, it reflects intent-aware inference of security requirements.
As a guiding principle, an operation SHOULD be classified as a sensitive operation if any of the following are true:
- it performs a state changing action
- uses HTTP methods such as:
POST,PUT,PATCH,DELETE - has summaries/descriptions which suggest state change (e.g., "approve", "update", "assign", "create", "cancel"), even if HTTP verb is misused
- it accesses or returns sensitive or personal data (customer records, user profiles, payment data, or any OpenAPI Schema Object containing detected PII fields)
- it performs privileged or administrative actions
- it exposes operational or system-level behaviours (configuration management details, system logs, workflow executions)
LLM reasoning MAY be used to help perform classification.
Auth Strength (auth_strength)¶
The auth_strength signal measures the robustness and correctness of authentication mechanisms declared by the API. It evaluates the average strength of all security schemes using normative scores based on IANA auth-scheme definitions, OAuth2 best practices, OIDC, API Key placement, and mutual TLS.
Formula:
auth_strength = sum(strength_scores) / schemes_count
Where:
strength_scoresis the list of calculated scheme strengthsschemes_countis the number of defined schemes
The following table outlines the auth_strength scoring weights:
| Scheme Type | Description | Example | Strength | Rationale |
|---|---|---|---|---|
none | No authentication mechanism | no security: block | 0.00 | Unsafe for sensitive APIs; permitted only when sensitive_ops_expected = 0. |
http / basic | Base64 user:pass | scheme: basic | 0.10 | Plaintext credentials; easily leaked (RFC7617). |
http / oauth | OAuth 1.0 | scheme: oauth | 0.20 | Deprecated; insecure signature model (RFC5849). |
http / digest | Digest Access Auth | scheme: digest | 0.20 | Outdated; limited protection (RFC7616). |
apiKey (query) | API key in query string | in: query | 0.15 | Very high leakage risk (logs, proxies, URLs). |
apiKey (header/cookie) | API key in header or cookie | in: header | 0.50 | Moderate security; lacks identity, scoping, or rotation controls. |
http / scram-sha-1 | SCRAM with SHA-1 | scheme: scram-sha-1 | 0.25 | Uses deprecated SHA-1 hashing (RFC7804). |
http / negotiate | Kerberos/NTLM | scheme: negotiate | 0.35 | Legacy; violates HTTP semantics (RFC4559). |
http / bearer (opaque) | Opaque bearer token | scheme: bearer | 0.60 | Security depends entirely on token distribution (RFC6750). |
http / vapid | WebPush VAPID | scheme: vapid | 0.60 | Token model similar to bearer; moderate trust (RFC8292). |
http / scram-sha-256 | SCRAM with SHA-256 | scheme: scram-sha-256 | 0.65 | Modern and stronger, still password-based (RFC7804). |
http / bearer (JWT) | Signed JWT bearer token | bearerFormat: JWT | 0.75 | Cryptographically verifiable claims; supports scopes. |
http / privatetoken | Privacy Pass | scheme: privatetoken | 0.75 | Strong privacy-preserving cryptographic identity (RFC9577). |
http / hoba | HTTP Origin-Bound Authentication | scheme: hoba | 0.80 | Asymmetric client-bound authentication (RFC7486). |
http / concealed | Concealed HTTP authentication | scheme: concealed | 0.85 | Modern, high-assurance privacy-preserving authentication (RFC9729). |
http / dpop | Demonstration of Proof-of-Possession | scheme: dpop | 0.90 | Prevents replay; binds token to client (RFC9449). |
http / gnap | GNAP framework | scheme: gnap | 0.90 | Modern alternative to OAuth 2.0 (RFC9635). |
http / mutual | HTTP Mutual Authentication | scheme: mutual | 0.95 | Cryptographically binding client/server identities (RFC8120). |
oauth2 / password | Resource Owner Password Credentials | flow: password | 0.30 | Deprecated; violates least-privilege; insecure. |
oauth2 / implicit | Browser implicit flow | flow: implicit | 0.35 | Deprecated; exposes tokens via redirects. |
oauth2 / clientCredentials | Server-to-server | flow: clientCredentials | 0.85 | Strong, scoped, recommended for machine-to-machine. |
oauth2 / authorizationCode (PKCE) | Best practice auth flow | flow: authorizationCode | 0.90 | Most secure OAuth2 flow; protects public clients. |
openIdConnect | OIDC Discovery + JWKs | type: openIdConnect | 1.00 | Gold-standard identity-bound access. |
mutualTLS | Client TLS certificates | type: mutualTLS | 1.00 | Hardware-backed identity; strongest available. |
If no security schemes are defined, auth_strength MUST return 1.0 (not applicable—no schemes to evaluate). This does not imply the API is secure; gating rules handle misconfigurations involving sensitive operations.
Transport Security (transport_security)¶
transport_security = secure_public_endpoints / public_endpoints
Transport security is evaluated only for endpoints intended to be externally reachable. Internal, localhost, cluster, or sandbox endpoints are excluded from the penalty. The score is based on whether externally exposed endpoints use HTTPS.
Public endpoints include:
- FQDN-based hosts (e.g., api.example.com)
- Partner-facing or customer-facing URLs
- Any endpoint not explicitly marked internal
Internal endpoints include:
- localhost / 127.0.0.*
- service mesh /
.internal/.clusterhosts - explicitly flagged
x-internal,dev,sandbox,mock
Secret Hygiene (secret_hygiene)¶
secret_hygiene = 1 if no secrets embedded else 0
Sensitive Handling (sensitive_handling)¶
sensitive_handling = protected_pii_fields / detected_pii_fields
If no PII detected, = 1.0.
OWASP Posture (owasp_posture)¶
weighted_cost = (1.0 × critical) + (0.6 × errors) + (0.025 × warnings) + (0.005 × info)
owasp_posture = max(0, 1 - (sqrt(weighted_cost) / 5))
Dimension Score¶
Base Security¶
base_security = (auth_coverage + auth_strength + transport_security +
secret_hygiene + sensitive_handling + owasp_posture) / 6
Sensitivity Factor¶
Based on the intent of an operation / endpoint, we determine sensitivity_factor as:
- None/Low: 1.0
- Moderate: 0.9
- High: 0.75
Exposure Factor¶
Based on the intended audience or exposure of an endpoint, we determine exposure_factor as:
- internal: 1.0
- partner: 0.9
- public: 0.8
Scaled Security Score¶
security_scaled = base_security × sensitivity_factor × exposure_factor
Where:
- sensitivity_factor ∈ {1.00, 0.90, 0.75}
- exposure_factor ∈ {1.00, 0.90, 0.80}
Gating¶
Gating caps apply after scaling
| Condition | Cap |
|---|---|
| Hardcoded credentials | ≤ 20 |
| Sensitive op w/o auth | ≤ 40 |
| PII unprotected & non-internal | ≤ 50 |
| Public HTTP, not HTTPS | ×0.8 multiplier |
Security Formula¶
SEC = 100 × Security_final
# where security_final is security_scaled after capping rules
Example¶
auth_coverage: 0.85
auth_strength: 0.80
transport_security: 1.00 # all https
secret_hygiene: 1.00
sensitive_handling: 0.70
owasp_posture: 0.65
base_security = (0.85+0.80+1.00+1.00+0.70+0.65)/6 = 0.833
sensitivity_factor: 0.90 # moderate (profile data)
exposure_factor: 0.80 # public API
Security = 100 * 0.833 * 0.90 * 0.80
≈ 59.9
AI Discoverability (AID)¶
AID evaluates how easily AI systems can locate, classify, and route to the API across registries, workflows, and knowledge bases.
The scoring framework does NOT hide unsafe APIs, but we apply a risk-aware discount so that high risk reduces discoverability rather than erasing it.
Required Signals¶
| Signal | Description | Normalisation Rule |
|---|---|---|
descriptive_richness | MUST assess depth and clarity of descriptions. | coverage with semantic weights |
intent_phrasing | MUST evaluate verb-object clarity of summaries and descriptions. | semantic similarity (LLM assisted) |
workflow_context | MAY include Arazzo/MCP/workflow references. | coverage |
registry_signals | SHOULD detect llms.txt, APIs.json, MCP, externalDocs. | coverage (multi-indicator) |
domain_tagging | SHOULD detect domain/taxonomy classification. | coverage |
Descriptive Richness (descriptive_richness)¶
The descriptive_richness signal evaluates the semantic value of textual descriptions within an API description. It measures whether descriptions are sufficiently clear and detailed for AI systems to infer purpose, behaviour, and domain context.
Applies to all describable elements, including but not limited to:
info.descriptioninfo.summary- operation-level
summaryanddescription - parameter, header, and response descriptions
- schema descriptions
Elements MAY be excluded if they cannot reasonably carry semantic meaning (e.g., empty marker schemas).
descriptive_richness = Σ(element_descriptive_score) / (2 × number_of_describable_elements)
# Each describable element MAY earn up to 2 points (1 for clarity + 1 for depth)
Where: element_descriptive_score = clarity_score + depth_score
The final value MUST be normalised to the range [0, 1].
Clarity Score¶
The clarity_score evaluates how understandable and readable a description is.
| Level | Description | Score |
|---|---|---|
| High clarity | Clear, direct, specific wording; purpose-first phrasing; low cognitive load | 1.0 |
| Moderate clarity | Understandable but verbose, generic, or weakly phrased | 0.5 |
| Low clarity | Boilerplate text, legalese, placeholders, or content likely to confuse AI systems | 0.0 |
Clarity MAY be evaluated using an LLM-based classifier. Flesch–Kincaid or equivalent readability indicators SHOULD be considered as supporting signals but MUST NOT be used alone.
Descriptions containing placeholder text (e.g., “foo”, “Lorem ipsum”, or obviously templated strings) MUST receive a clarity score of 0.0.
Depth Score¶
The depth_score evaluates the degree of semantic specificity and operational detail. Descriptions providing actionable content SHALL receive higher depth scores.
| Level | Description | Score |
|---|---|---|
| High depth | Contains domain cues AND behavioural or structural detail | 1.0 |
| Medium depth | Contains domain cues but minimal behavioural detail | 0.5 |
| Low depth | Generic text without meaningful semantics or context | 0.0 |
Depth SHOULD consider:
- domain-specific terminology (e.g., “booking segment”, “AML profile”)
- entity relationships
- state constraints and lifecycle behaviour
- when/why an operation SHOULD be called
- key field semantics or constraints
Descriptions that merely restate a field name or summary MUST be scored at 0.0.
Intent Phrasing (intent_phrasing)¶
...To be defined...
Workflow Context (workflow_context)¶
workflow_context = operations_with_workflow_refs / total_operations
Registry Signals (registry_signals)¶
Presence of machine-readable artifacts such as:
- llms.txt
- APIs.json
- API Gateway registry metadata
- MCP registry metadata
- externalDocs link to developer portals etc.
registry_signals = present_indicators / total_indicators
Domain Tagging (domain_tagging)¶
domain_tagging = ops_with_domain_tags / total_operations
Dimension Score¶
AID_raw = 100 × (descriptive_richness + intent_phrasing + workflow_context + registry_signals + domain_tagging) / 5
Soft Risk Discount¶
risk_index
= exposure_weight × sensitivity_weight × (1−base_security)
risk_discount = 1 − 0.5 × risk_index clamped to [0.6,1.0]
AID = AIDraw × risk_discount
Example¶
descriptive_richness: 0.80
intent_phrasing: 0.75
workflow_context: 0.20
registry_signals: 0.40
domain_tagging: 0.60
AID_raw = 100 * (0.80+0.75+0.20+0.40+0.60) / 5 ≈ 55.0
Security from above gave base_security ≈ 0.833;
exposure_weight: 1.0 # public
sensitivity_weight: 0.9
risk_index = 1.0 * 0.9 * (1- 0.833) = 0.1503
risk_discount = 1- (0.5 * 0.1503) ≈ 0.9248
AID ≈ 55.0 * 0.9248 ≈ 50.8
Normalisation Rules¶
This section defines the global normalisation functions used throughout the JAIRF framework. All signals MUST be normalised into the range [0, 1] before dimensional aggregation. Unless otherwise noted, higher values represent better API quality or AI-readiness.
Normalisation ensures that heterogeneous signals (e.g., counts, proportions, categorical values, text-derived scores) can be consistently aggregated within and across dimensions.
Coverage Normalisation¶
Coverage is used when measuring presence versus absence of some feature (e.g., summaries, examples, pagination).
Formula:
coverage = present / expected
If there are _zero expected occurrences, then:
coverage = 1.0
This prevents false penalties where the concept is not applicable.
Inverse Error Normalisation¶
Used when the presence of errors decreases quality (linting findings, structural issues, ingestion errors).
Formula:
inverse = max(0, 1 - (issue_count / threshold))
Notes:
thresholdrepresents the point where the score SHOULD drop to zero. Thresholds are dimension-specific.- Once
issue_count ≥ threshold, the signal is floored at 0.
Min–Max Inverted Normalisation¶
Applied when lower input values are better (e.g., readability burden).
Formula:
inverted = 1 - (x - min) / (max - min)
- If
x ≤ min, theninverted = 1.0. - If
x ≥ max, theninverted = 0.0.
Weighted Categorical Normalisation¶
Used for discrete categories that map to quality levels (e.g., security scheme strength).
Each category is assigned a weight in [0, 1]. This allows qualitative or enumerated factors to be converted into normalized numeric values.
Formula:
weighted = category_weight / max_weight
Where max_weight is the maximum weight in the category set.
Composite Signal Normalisation¶
Used when a signal is derived from multiple measurable sub-signals (e.g., operationId quality).
Formula:
composite = Σ(sub_score[i] × weight[i]) / Σ(weight[i])
# Example for illustration:
opid_quality = coverage × uniqueness × casing_consistency
All sub-scores MUST first be individually normalised into [0, 1].
Severity-Weighted Inverse¶
Applies weighted penalties for different severity levels.
Formula:
weighted_cost =
(1.0 × critical)
+ (0.6 × errors)
+ (0.025 × warnings)
+ (0.005 × info)
signal = max(0, 1 - (weighted_cost / max_cost))
Where max_cost is an upper bound chosen per dimension (e.g., 25 for foundational lint).
Logarithmic Dampening¶
Used where a smooth decline is preferred rather than linear penalty (e.g., structural complexity).
Formula:
log_dampened = 1 - ( logBaseN(1 + issues) / logBaseN(1 + threshold) )
Where:
- Base MAY be 10 or e, depending on implementation preference. RECOMMENDATION is 10 for easier understanding.
- Provides early penalty, slower collapse near threshold.
Semantic Similarity¶
similarity(i, j) ∈ [0, 1]
distinctiveness = 1 - similarity
Similarity is computed from a combination of:
- operationId
- summary
- description
- path + HTTP method
- optional LLM semantic embedding comparison
Weighted approach:
similarity(i, j) =
0.35 * embedding_similarity
+ 0.25 * opId_similarity
+ 0.20 * summary_similarity
+ 0.20 * path_similarity
Where similarity = cosine similarity or Levenshtein-based fallback for small APIs.
The evaluator MUST ensure:
- identical or near-identical meanings → similarity ≥ 0.85
- opposites or different purposes → similarity ≤ 0.20
Logistic Shaping¶
Used to avoid over-penalising large APIs for complexity if they are well-structured.
Formula:
logistic = 1 / (1 + exp(k × (value - midpoint)))
kcontrols steepness (recommended: 5–7).midpointdefines neutrality point (recommended: ~0.4–0.5).
Binary Checks¶
Used when a single violation SHOULD drop a score to zero or one.
Examples:
- hardcoded credentials
- presence/absence of HTTPS
- presence of required auth
Formula:
binary = 1 if condition_passes else 0
Bonus Multipliers¶
Used when optional metadata adds value especially in terms of agent usability and AI-discoverability (e.g., hypermedia links, AI intent hints, registry presence).
Formula:
score = base_score * (1 + bonus_factor * coverage)
Where:
bonus_factortypically ∈ [0.01, 0.10]coverageis a normalised [0, 1] value
Bonus MUST NOT push the signal above 1.0 after dimension aggregation.
Context Scaling¶
Used when sensitivity or exposure magnifies or attenuates risk.
Formula:
scaled = base_score × sensitivity_factor × exposure_factor
Where:
- sensitivity_factor ∈ {1.0, 0.9, 0.75}
- exposure_factor ∈ {1.0, 0.9, 0.8}
Heuristic Penalty Normalisation¶
Used when qualitative rule-based deductions apply (e.g., unsafe idempotency patterns).
Formula:
score = 1.0 - Σ(penalty[i] × severity_weight[i])
Clamped to [0, 1].
Example:
# Agent Usability- safety:
# An API begins with a score of 1.0, then receives deductions for risk findings:
# Missing auth on sensitive write: −0.15
# Non-idempotent PUT/DELETE: −0.10
# Overlapping unsafe verbs: −0.05
safety = 1 − ( 0.15 + 0.10 + 0.05 )
Soft Risk Discounts¶
Applied in Discoverability scoring when security posture SHOULD diminish visibility, not erase it.
Formula:
risk_discount = 1 - (0.5 × risk_index)
Clamped to [0.6, 1.0], to avoid total suppression.
Where:
risk_index = exposure_weight × sensitivity_weight × (1 - base_security)
Scoring Model & Formulae¶
This section NORMATIVELY defines how signals, dimensions, and the final AI-Readiness score MUST be computed.
Scoring Pipeline¶
The scoring process MUST proceed in the following order:
- Raw Measurements (unbounded counts, coverage, structural errors, semantic density, etc.)
- Normalised Signals (each MUST be converted to a value ∈ [0,1])
- Dimension Scores (each MUST be computed on a 0–100 scale)
- Weighted Aggregation (weighted harmonic mean MUST be used)
- Gating Rules (MUST be applied BEFORE readiness level classification)
- Final Readiness Score (0–100)
- Readiness Level Classification
Normalised Signals¶
JAIRF requires each signal to be normalised to a number on the interval [0,1]. Signal normalisation rules are defined in Appendix A, and MUST be applied consistently across all implementations.
A signal value of:
1.0MUST represent optimal quality0.0MUST represent unusable, absent, contradictory, or failing input
Signals MUST NOT exceed the range [0,1] after normalisation.
Dimension Scoring¶
Each dimension score MUST be the arithmetic mean of its normalised signals:
DimensionScore = 100 × (Σ normalised_signals) / (count_of_signals)
Where:
- All signals MUST be normalised first
- All signals MUST contribute equally (v1.0 does NOT use per-signal weighting)
- Result MUST be clamped into the range [0,100]
Implementations MAY emit warnings if insufficient data is present (e.g., zero expected counts across multiple signals) but MUST NOT penalise a dimension for irrelevance.
Dimension Weights¶
The JAIRF final score MUST use the following weight distribution across the six dimensions:
| Dimension | Weight |
|---|---|
| Foundational Compliance (FC) | 0.16 |
| Developer Experience & Jentic Compatibility (DXJ) | 0.18 |
| AI-Readiness & Agent Experience (ARAX) | 0.24 |
| Agent Usability (AU) | 0.20 |
| Security (SEC) | 0.12 |
| AI Discoverability (AID) | 0.10 |
These weights MUST sum to 1.0.
Weighted Harmonic Aggregation¶
JAIRF uses a weighted harmonic mean, not an arithmetic mean. The harmonic mean MUST be used to enforce that weaknesses in one dimension CANNOT be offset by strengths in others.
Implementations MUST compute:
FinalScore = (Σ weights) / (Σ (weight / (dimensionScore + epsilon)))
Where:
epsilonMUST be a small positive constant to avoid division-by-zero. Recommended: epsilon = 0.000001dimensionScoreMUST be in the range [0,100] before inclusion in the harmonic calculation- The final score MUST also be clamped to [0,100]
The harmonic mean MUST be considered core to the JAIRF model.
Gating Rules¶
Gating rules MUST override or constrain dimension scores to ensure safety and correctness. They MUST be applied immediately before readiness-level classification.
| Condition | Effect | Rationale |
|---|---|---|
| Foundational Compliance score < 40 | API MUST be classified as Level 0 ("Non-Compliant") | If the API cannot be structurally validated, no higher-order AI reasoning is safe or possible. |
| Hardcoded credentials detected | Security score MUST be capped at 20 | Hardcoded secrets represent an immediate, systemic security failure and cannot be compensated for by other strengths. |
| Sensitive operations lacking auth (internal) | Security score MUST be capped at 40 | Internal APIs may permit limited trust boundaries, but unauthenticated sensitive operations remain high-risk. |
| Sensitive operations lacking auth (partner) | Security score MUST be capped at 30 | Partner-facing APIs must enforce authentication on sensitive operations; failure is a severe but not catastrophic risk. |
| Sensitive operations lacking auth (public) | Security score MUST be capped at 20 | Public unauthenticated sensitive operations are critical vulnerabilities and must be treated as near-fail conditions. |
| Unprotected PII on partner/public APIs | Security score MUST be capped at 50 | Exposure of identifiable data without proper controls violates trust and regulatory expectations. |
| Non-TLS public endpoints (http://) | Security score MUST be multiplied by 0.5 | Plaintext transport exposes tokens, credentials, and PII; catastrophic for external integrations. |
Gating MUST NOT alter the raw signals or other dimension scores directly; gating applies only to the affected dimension score.
If multiple caps apply, the most restrictive MUST be applied.
Final Score & Readiness Levels¶
The final readiness score MUST be mapped to one of the following readiness levels.
| Final Score | Level | Name | Meaning |
|---|---|---|---|
| < 40 | Level 0 | Not Ready | Fundamentally unsuitable for AI or agents |
| 40–60 | Level 1 | Foundational | Developer-ready, partially AI-usable |
| 60–75 | Level 2 | AI-Aware | Semantically interpretable, safe for guided use |
| 75–90 | Level 3 | AI-Ready | Structurally rich, semantically clear, agent-friendly |
| 90+ | Level 4 | Agent-Optimized | Highly composable, predictable, automation-ready |
This table MUST be treated as normative. Scoring libraries MUST return both the numeric score and readiness level.
Grading Bands¶
Implementations MAY additionally compute a letter grade for UX/visualisation.
| Letter Grade | Score Range |
|---|---|
| A+ | 90–100 |
| A | 80–89 |
| A− | 70–79 |
| B+ | 67–69 |
| B | 63–66 |
| B− | 60–62 |
| C+ | 57–59 |
| C | 53–56 |
| C− | 50–52 |
| D+ | 47–49 |
| D | 43–46 |
| D− | 40–42 |
| F | < 40 |
Grades SHOULD NOT be used as substitutes for readiness levels.
Appendix A: General Definitions¶
| Term | Description |
|---|---|
| Normalisation | The process of converting raw measurements into a standard 0–1 scale so that different signals can be compared fairly. A normalised value of 1 represents “ideal” and 0 represents “unusable or absent.” |
| Signal | A measurable property inside a dimension (e.g., auth_coverage or summary_coverage). Dimensions are made of multiple signals, and each signal is normalised before aggregation. |
| Dimension | A thematic scoring category of API readiness (e.g., Foundational Compliance, AI-Readiness). Each dimension reflects a different aspect of what makes an API usable by AI systems. |
| Aggregation | The step where multiple normalised signals are mathematically combined into a single score. Aggregation happens first at the dimension level, and then across all dimensions for the final score. |
| Harmonic Mean | A type of average that penalises imbalanced scores. If one dimension is very weak, the harmonic mean reduces the overall score more than a standard average would. Used to ensure that no single “strong” category can mask a critical weakness in another. |
| Clamped or Bounded | When a value is restricted to remain within a minimum and maximum range. For example, risk_discount ∈ [0.6, 1.0] means it cannot go below 0.6 or above 1.0. |
| Weighted | A calculation where some values count more than others. Weights may allow emphasis (e.g., security > discovery). |
| Logistic Shaping | A smoothing technique that stops large APIs from being unfairly penalised simply because they have many endpoints. It scales penalties gradually rather than linearly. |
| Coverage | The percentage of “where this should exist” versus “where it actually exists.” For example: auth_coverage = operations_with_auth / ops_that_require_auth. |
| Penalty | A downward adjustment applied when a risk or deficiency is identified (e.g., missing pagination, weak auth, or unsafe error handling). |
| Bonus (or Uplift) | A small upward modifier applied when extra AI-friendly metadata is present (e.g., workflows, AI intent hints). |
| Readability Score | A measure of how easy text is to understand, based on approximate grade-level complexity (e.g., Flesch–Kincaid). Lower scores indicate simpler, more direct language. APIs targeting AI consumption benefit from 8–12 range; >16 introduces interpretation risk for models. ≤ 8th grade: universally readable 9–12: general technical audience (expected for API docs) > 16: post-grad, cognitively expensive, increases model misinterpretation risk. This score is determined by the LLM evaluating the readability_score. |