Data Governance — Datasets & Art. 10 Compliance — EU AI Act

Data Governance — Datasets & Art. 10 Compliance

Article 10 of the EU AI Act establishes strict requirements for the data used to train, validate, and test high-risk AI systems. Providers must implement appropriate data governance and management practices, examine datasets for biases, ensure statistical relevance and representativeness, and comply with personal data protection requirements. The Data Governance module in Venvera provides a comprehensive registry for documenting datasets, tracking their quality, assessing bias, and managing privacy considerations. This article covers every feature of the module.

List View

Step 1 — Open the Datasets List

Navigate to EU AI Act → Datasets from the sidebar. The list view displays all registered datasets in a paginated table, sorted by last-updated date. Each row shows the dataset name, linked AI system, type badge, size, quality score, bias assessment status, and whether the dataset contains personal data.

Step 2 — Search Datasets

The search bar filters datasets by name and description using case-insensitive partial matching. This is useful for finding datasets related to specific AI systems, domains, or data types. For example, searching "customer" will match "Customer Transaction Training Data" and "Customer Feedback Validation Set".

Step 3 — Filter by Type

Use the Type dropdown to filter by dataset type:

Training — Data used to train the AI model. This is typically the largest dataset and most critical for Art. 10 compliance. Training data must be relevant, representative, free of errors, and complete.
Validation — Data used to tune hyperparameters and evaluate model performance during development. Validation data should be independent from training data to avoid overfitting.
Testing — Data used for final evaluation of model performance before deployment. Testing data should be representative of real-world conditions and must not have been used during training or validation.
Operational — Data processed by the AI system during its live operation. Tracking operational data is important for post-market monitoring and detecting data drift.

Step 4 — Filter by Bias Assessment Status

Use the Bias Assessment filter to identify datasets based on the state of their bias assessment:

Not Started — No bias assessment has been initiated for this dataset. High-risk AI system datasets should have a bias assessment before deployment.
In Progress — A bias assessment is underway. The assessor has begun documenting findings but has not yet completed the evaluation.
Completed — The bias assessment has been completed and documented. The dataset record includes the assessment findings and any mitigation actions taken.

Creating a New Dataset

Click + Add Dataset to open the creation form. Complete all required fields:

Field	Type	Required	Description
Name	Text input	Required	A descriptive name for the dataset. The name should clearly identify the data content, the AI system it relates to, and its purpose. For example: "Credit Risk Model — Training Dataset v3.2 (2024 Customer Transactions)". Maximum 300 characters. Names should be unique within the same AI system to avoid confusion.
Description	Textarea	Optional	A detailed description of the dataset contents, provenance, and characteristics. Include information about the data sources, the time period covered, the number of records, the feature set, and any preprocessing steps applied. This description serves as the primary reference for anyone reviewing the dataset and is included in technical documentation exports. Maximum 5,000 characters.
AI System	Dropdown	Optional	Select the AI system that uses this dataset. The dropdown lists all AI systems in your inventory. Linking a dataset to an AI system enables the system detail page to show related datasets and supports completeness tracking. A dataset can be linked to one AI system; if the same data is used by multiple systems, create separate dataset records for each.
Type	Dropdown	Optional	Select the dataset type: Training, Validation, Testing, or Operational. The type determines the compliance checks and recommendations that apply to this dataset. Training datasets trigger the most comprehensive requirements under Art. 10.
Size	Text input	Optional	The size of the dataset in a human-readable format (e.g., "2.4 GB", "1.2 million records", "500,000 rows × 42 features"). This is a free-text field that accommodates whatever size description is most meaningful for your context. Including both the storage size and the record count is recommended.
Personal Data	Checkbox	Optional	Check this box if the dataset contains personal data as defined by the GDPR (Regulation (EU) 2016/679). When checked, additional privacy-related fields become relevant, including the GDPR Legal Basis. Datasets containing personal data are subject to additional scrutiny under both the EU AI Act and the GDPR, and must be handled with particular care regarding data minimisation, purpose limitation, and storage limitation.
GDPR Legal Basis	Dropdown	Optional	If the dataset contains personal data, select the GDPR legal basis for processing. Options include: Consent (Art. 6(1)(a)) — The data subject has given consent to the processing. Contractual Necessity (Art. 6(1)(b)) — Processing is necessary for a contract with the data subject. Legal Obligation (Art. 6(1)(c)) — Processing is necessary for a legal obligation. Vital Interests (Art. 6(1)(d)) — Processing is necessary to protect vital interests. Public Interest (Art. 6(1)(e)) — Processing is necessary for a task in the public interest. Legitimate Interest (Art. 6(1)(f)) — Processing is necessary for legitimate interests, balanced against data subject rights. This field is critical for demonstrating GDPR compliance alongside AI Act compliance. If the dataset uses special category data (Art. 9 GDPR), additional safeguards must be documented.

Art. 10 Key Requirements: The EU AI Act requires that training, validation, and testing datasets shall be relevant, sufficiently representative, and to the best extent possible, free of errors and complete in view of the intended purpose. Datasets must be subject to appropriate data governance and management practices concerning: (a) relevant design choices; (b) data collection processes and data origin; (c) relevant data-preparation processing operations (annotation, labelling, cleaning, updating, enrichment, aggregation); (d) the formulation of assumptions; (e) an assessment of the availability, quantity, and suitability of the data needed; (f) examination for possible biases likely to affect health, safety, or fundamental rights; (g) appropriate measures to detect, prevent, and mitigate possible biases.

Dataset Detail Page

Clicking a dataset in the list view opens its detail page, which is organised into several sections:

Dataset Information

The top section displays all the fields entered during creation: name, description, linked AI system (clickable link), type, size, and timestamps. This provides a quick reference for anyone reviewing the dataset record.

Privacy Card

If the dataset is flagged as containing personal data, a dedicated privacy card is displayed. This card shows:

Personal Data Indicator — A prominent badge confirming the dataset contains personal data.
GDPR Legal Basis — The selected legal basis with a brief explanation of what it means.
Data Protection Considerations — Auto-generated recommendations based on the dataset type and legal basis. For example, if the dataset is a Training dataset with a Consent legal basis, the system recommends verifying that consent covers AI training use, implementing data minimisation, and establishing retention limits.
Cross-Reference to GDPR Module — If your organisation uses Venvera's GDPR processing activities module, a link to the relevant processing activity record is provided for integrated compliance management.

Bias Assessment

The bias assessment section is a structured evaluation of the dataset for potential biases that could affect the fairness and non-discrimination of the AI system. The section includes:

Status — Not Started, In Progress, or Completed. Update this as you progress through the assessment.
Assessment Notes — A rich text field for documenting bias findings, methodologies used, protected characteristics examined, statistical measures applied, and mitigation actions taken. Include references to specific tools or techniques used (e.g., demographic parity analysis, equalised odds testing, disparate impact ratios).
Identified Biases — A summary of any biases detected, their potential impact, and the mitigation measures implemented.
Mitigation Actions — A record of steps taken to address identified biases (e.g., resampling, reweighting, data augmentation, removal of proxy variables).

Warning — Bias in Training Data: Bias in training data is one of the most significant risks in AI systems and a major focus of the EU AI Act. Art. 10(2)(f) specifically requires examination for possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights, or lead to discrimination. Failure to conduct thorough bias assessments can result in the AI system being non-compliant with the regulation, regardless of its technical performance metrics. Document your bias assessment methodology and findings thoroughly.

Quality Score

The quality score is a percentage (0–100%) reflecting the overall data quality of the dataset. This score is displayed as a progress bar with colour coding:

Green (≥80%) — The dataset meets or exceeds data quality requirements. Data is well-documented, errors are minimal, coverage is comprehensive, and bias has been assessed and mitigated.
Amber (≥50% and <80%) — The dataset has room for improvement. Some quality issues exist but are not critical. Review the dataset description and bias assessment for areas to improve.
Red (<50%) — The dataset has significant quality concerns. Major issues exist with completeness, accuracy, bias, or documentation. Address these issues before using the dataset for training or validation of high-risk AI systems.

The quality score can be set manually based on your data quality assessment, or it can be informed by automated data profiling tools if integrated. Update the score as you improve the dataset and address identified issues.

Tip — Dataset Versioning: When you make significant changes to a dataset (e.g., adding new records, removing biased features, applying new preprocessing), consider creating a new dataset record rather than updating the existing one. This preserves the audit trail and allows you to compare the quality and bias assessments of different dataset versions over time. Link all versions to the same AI system for easy reference.