Headshot photo of man with short dark hair, dark full mustache and beard and glasses.Amit Siddhartha, Metapercept Technology Services LLP
May 15, 2025

A Path to Accuracy and Efficiency

As AI adoption accelerates across industries, the challenge shifts from model size to data quality. Unstructured and inconsistently formatted content introduces ambiguity, leading to hallucinations and inefficiencies in AI systems.

This article explores a structured approach to content curation and data enrichment using standards like DITA-XML, RDF, and OWL ontologies to build semantically rich datasets that are optimized for training and dynamic retrieval. By converting raw content into modular, typed DITA topics and enriching them with domain-specific metadata and relationships, enterprises can construct knowledge graphs that support accurate, explainable, and context-aware AI outputs.

We present real-world implementations across Information Technology, Software, Banking, Financial compliance, MedTech research and compliance, and Legal Contract automation where semantically enriched, intent-aligned content has improved retrieval accuracy, reduced content duplication, and enabled intelligent search using a combination of SPARQL queries, knowledge graphs, and vector-based retrieval models.

Technical implementation details include the use of Protégé for ontology design, Neo4j for knowledge graph storage, and the integration of structured content with Retrieval-Augmented Generation (RAG) pipelines for dynamic AI performance.

The paper demonstrates how semantic enrichment not only enhances content usability but also reduces computational cost and training time. This approach provides a scalable path to building trustworthy, domain-aligned AI systems driven by meaningfully structured knowledge.

Introduction

As the adoption of AI accelerates across industries, the focus has shifted from building large-scale models to curating high-quality training data. Structured semantic data—rich with contextual meaning, metadata, and defined relationships—enhances AI’s reasoning, retrieval accuracy, and explainability.

This article presents a pragmatic approach to transforming unstructured information into semantically enriched data using RDF standards and DITA-XML standards to support ontology-driven AI systems and reduce computational overhead.

Computational Impacts of Unstructured Input

Organizations store vast amounts of unstructured content in formats like DOCX, PDFs, emails, or HTML. However, such data lacks clarity, consistency, and machine-understandable semantics. Training AI models directly on this content often leads to:

  • Hallucinations and incorrect outputs
  • Redundant or conflicting knowledge
  • High compute costs in retrieval and indexing
  • Inability to trace or explain AI decisions

Large Language Models (LLMs) perform better with structured input—especially content annotated with taxonomy and domain semantics. Without structuring and enrichment, these models operate in an ambiguous knowledge space.

Beyond accuracy concerns, unstructured content imposes a computational and financial burden.

  • Increased Processing Time
  • Higher GPU/CPU Utilization
  • Memory Overhead
  • Complexity in Querying

Example

In one contract management project, the organization was using an LLM-based assistant to extract clause-specific insights from thousands of agreements. Without content structuring, the model had to process entire documents each time, resulting in query response times of 30–40 seconds per file and GPU consumption spikes by over 60%. After converting contracts to modular DITA topics and mapping them to an ontology, response time dropped to under 5 seconds, with a 40% reduction in compute resource usage.

This clearly illustrates how structured, semantically enriched data not only improves AI accuracy but also reduces the total cost of ownership (TCO) for AI implementation.

Structured Semantic Enrichment with DITA-XML and Ontologies

A robust solution begins by converting unstructured or semi-structured files into DITA-XML, a modular, topic-based authoring format. This standard introduces:

  • Topic typing (concept, task, reference)
  • Metadata tagging (audience, subject, keywords)
  • Conditional processing for reuse and variation
  • Content modularization for easy classification and reuse

The next layer involves enriching the content using RDF triples, OWL-based ontologies, and semantic labels. These define domain-specific relationships, such as “Regulation X applies to Clause Y” or “Procedure Z treats Condition A,” making data suitable for knowledge graph integration.

Technical Implementation Overview

The implementation involved a multi-layer architecture

  1. Data Layer – Custom scripts + oXygen XML for transforming raw content to DITA. Enrichment Layer – RDF annotation with Protégé and SPARQL rule sets.
  2. DB Layer – Neo4j with domain-specific ontologies.
  3. Retrieval & AI Layer – Combined use of vector databases and symbolic graph traversal via SPARQL.

 

 

Real-World Case Studies

Case Study 1. Financial Compliance AI System

Legacy content from FASB and IFRS was converted into DITA-XML with rich domain-specific labels (e.g., assets, provisions, disclosures). A Neo4j-based ontology mapped compliance documents, reducing the content turnaround time from one month to 7 business days and reducing conversion costs by 50%.

Case Study 2. MedTech R&D – Semantic PubMed Retrieval

Medical literature from medical database was processed into DITA topics and enriched with MeSH terms. A biomedical knowledge graph was constructed to power a smart search for drug-disease-treatment relationships, aiding clinical research teams.

Case Study 3. Contract Lifecycle Automation

A contract management firm transitioned its legacy clauses to DITA, labelled by contract type, risk factor, and action. SPARQL-enabled clause retrieval allowed AI to recommend compliant or risk-optimized content, streamlining legal reviews. Here is one sample Ontology overview for content categorization and query logic.

 

 

Benefits and Key Takeaways

  • Improved AI Accuracy: Semantic data reduces ambiguity and hallucination in AI outputs.
  • Traceable and Explainable AI: Ontologies offer logical structure and reasoning paths.
  • Reduced Computing Overhead: Well-labelled data allows efficient pre-filtering and targeting.
  • Domain Alignment: Semantic tags map content directly to enterprise taxonomy.
  • Dynamic AI Retrieval: Combining Vector DB + Knowledge Graph enables intent-based content answers.

 

Cost estimation for AI compute on unstructured vs. structured (semantic) content

Conclusion

AI is only as smart as the data it’s trained on. Enterprises must invest in structured, semantic-rich content pipelines before building or scaling their AI solutions. By leveraging DITA-XML for modularity, RDF and OWL for enrichment, and Neo4j for knowledge storage, organizations can build systems that are smarter, faster, and more aligned with human understanding. These foundations empower LLMs and SLMs to retrieve, reason, and respond with greater reliability and explainability—turning static documents into actionable knowledge.

Investing in semantic structuring of content enables explainable, cost-effective AI that retrieves information by meaning—not just keywords.

Connect with Amit Siddhartha on LinkedIn – https://www.linkedin.com/in/amit-siddhartha-28a7b38/
Amit is a seasoned expert in technical documentation, content architecture, and AI-driven knowledge systems. As the CEO of Metapercept Technology Services LLP and CTO of DITAXPRESSO, he leads enterprise transformation projects using DITA-XML, ontology engineering, and semantic enrichment for AI and LLMs. He is also a professional member of IEEE and ACM, contributing to research in AI content modelling and retrieval.