CIDM Matters is an electronic newsletter published on the 1st and 15th of every month. Browse these articles published in CIDM Matters or subscribe to the newsletter by selecting the “Join Us” option on the navigation bar.
Building Intelligent AI Systems with Structured Semantic Data
Amit Siddhartha
May 15, 2025
A Path to Accuracy and Efficiency
As AI adoption accelerates across industries, the challenge shifts from model size to data quality. Unstructured and inconsistently formatted content introduces ambiguity, leading to hallucinations and inefficiencies in AI systems.
This article explores a structured approach to content curation and data enrichment using standards like DITA-XML, RDF, and OWL ontologies to build semantically rich datasets that are optimized for training and dynamic retrieval. By converting raw content into modular, typed DITA topics and enriching them with domain-specific metadata and relationships, enterprises can construct knowledge graphs that support accurate, explainable, and context-aware AI outputs.
We present real-world implementations across Information Technology, Software, Banking, Financial compliance, MedTech research and compliance, and Legal Contract automation where semantically enriched, intent-aligned content has improved retrieval accuracy, reduced content duplication, and enabled intelligent search using a combination of SPARQL queries, knowledge graphs, and vector-based retrieval models.
Technical implementation details include the use of Protégé for ontology design, Neo4j for knowledge graph storage, and the integration of structured content with Retrieval-Augmented Generation (RAG) pipelines for dynamic AI performance.
The paper demonstrates how semantic enrichment not only enhances content usability but also reduces computational cost and training time. This approach provides a scalable path to building trustworthy, domain-aligned AI systems driven by meaningfully structured knowledge.
Introduction
As the adoption of AI accelerates across industries, the focus has shifted from building large-scale models to curating high-quality training data. Structured semantic data—rich with contextual meaning, metadata, and defined relationships—enhances AI’s reasoning, retrieval accuracy, and explainability.
This article presents a pragmatic approach to transforming unstructured information into semantically enriched data using RDF standards and DITA-XML standards to support ontology-driven AI systems and reduce computational overhead.
Computational Impacts of Unstructured Input
Organizations store vast amounts of unstructured content in formats like DOCX, PDFs, emails, or HTML. However, such data lacks clarity, consistency, and machine-understandable semantics. Training AI models directly on this content often leads to:
- Hallucinations and incorrect outputs
- Redundant or conflicting knowledge
- High compute costs in retrieval and indexing
- Inability to trace or explain AI decisions
Large Language Models (LLMs) perform better with structured input—especially content annotated with taxonomy and domain semantics. Without structuring and enrichment, these models operate in an ambiguous knowledge space.
Beyond accuracy concerns, unstructured content imposes a computational and financial burden.
- Increased Processing Time
- Higher GPU/CPU Utilization
- Memory Overhead
- Complexity in Querying
Example
In one contract management project, the organization was using an LLM-based assistant to extract clause-specific insights from thousands of agreements. Without content structuring, the model had to process entire documents each time, resulting in query response times of 30–40 seconds per file and GPU consumption spikes by over 60%. After converting contracts to modular DITA topics and mapping them to an ontology, response time dropped to under 5 seconds, with a 40% reduction in compute resource usage.
This clearly illustrates how structured, semantically enriched data not only improves AI accuracy but also reduces the total cost of ownership […]