Data Governance — Datasets & Art. 10 Compliance
Article 10 of the EU AI Act establishes strict requirements for the data used to train, validate, and test high-risk AI systems. Providers must implement appropriate data governance and management practices, examine datasets for biases, ensure statistical relevance and representativeness, and comply with personal data protection requirements. The Data Governance module in Venvera provides a comprehensive registry for documenting datasets, tracking their quality, assessing bias, and managing privacy considerations. This article covers every feature of the module.
List View
Navigate to EU AI Act → Datasets from the sidebar. The list view displays all registered datasets in a paginated table, sorted by last-updated date. Each row shows the dataset name, linked AI system, type badge, size, quality score, bias assessment status, and whether the dataset contains personal data.
The search bar filters datasets by name and description using case-insensitive partial matching. This is useful for finding datasets related to specific AI systems, domains, or data types. For example, searching "customer" will match "Customer Transaction Training Data" and "Customer Feedback Validation Set".
Use the Type dropdown to filter by dataset type:
- Training — Data used to train the AI model. This is typically the largest dataset and most critical for Art. 10 compliance. Training data must be relevant, representative, free of errors, and complete.
- Validation — Data used to tune hyperparameters and evaluate model performance during development. Validation data should be independent from training data to avoid overfitting.
- Testing — Data used for final evaluation of model performance before deployment. Testing data should be representative of real-world conditions and must not have been used during training or validation.
- Operational — Data processed by the AI system during its live operation. Tracking operational data is important for post-market monitoring and detecting data drift.
Use the Bias Assessment filter to identify datasets based on the state of their bias assessment:
- Not Started — No bias assessment has been initiated for this dataset. High-risk AI system datasets should have a bias assessment before deployment.
- In Progress — A bias assessment is underway. The assessor has begun documenting findings but has not yet completed the evaluation.
- Completed — The bias assessment has been completed and documented. The dataset record includes the assessment findings and any mitigation actions taken.
Creating a New Dataset
Click + Add Dataset to open the creation form. Complete all required fields:
| Field | Type | Required | Description |
|---|---|---|---|
| Name | Text input | Required | A descriptive name for the dataset. The name should clearly identify the data content, the AI system it relates to, and its purpose. For example: "Credit Risk Model — Training Dataset v3.2 (2024 Customer Transactions)". Maximum 300 characters. Names should be unique within the same AI system to avoid confusion. |
| Description | Textarea | Optional | A detailed description of the dataset contents, provenance, and characteristics. Include information about the data sources, the time period covered, the number of records, the feature set, and any preprocessing steps applied. This description serves as the primary reference for anyone reviewing the dataset and is included in technical documentation exports. Maximum 5,000 characters. |
| AI System | Dropdown | Optional | Select the AI system that uses this dataset. The dropdown lists all AI systems in your inventory. Linking a dataset to an AI system enables the system detail page to show related datasets and supports completeness tracking. A dataset can be linked to one AI system; if the same data is used by multiple systems, create separate dataset records for each. |
| Type | Dropdown | Optional | Select the dataset type: Training, Validation, Testing, or Operational. The type determines the compliance checks and recommendations that apply to this dataset. Training datasets trigger the most comprehensive requirements under Art. 10. |
| Size | Text input | Optional | The size of the dataset in a human-readable format (e.g., "2.4 GB", "1.2 million records", "500,000 rows × 42 features"). This is a free-text field that accommodates whatever size description is most meaningful for your context. Including both the storage size and the record count is recommended. |
| Personal Data | Checkbox | Optional | Check this box if the dataset contains personal data as defined by the GDPR (Regulation (EU) 2016/679). When checked, additional privacy-related fields become relevant, including the GDPR Legal Basis. Datasets containing personal data are subject to additional scrutiny under both the EU AI Act and the GDPR, and must be handled with particular care regarding data minimisation, purpose limitation, and storage limitation. |
| GDPR Legal Basis | Dropdown | Optional | If the dataset contains personal data, select the GDPR legal basis for processing. Options include:
|
Dataset Detail Page
Clicking a dataset in the list view opens its detail page, which is organised into several sections:
Dataset Information
The top section displays all the fields entered during creation: name, description, linked AI system (clickable link), type, size, and timestamps. This provides a quick reference for anyone reviewing the dataset record.
Privacy Card
If the dataset is flagged as containing personal data, a dedicated privacy card is displayed. This card shows:
- Personal Data Indicator — A prominent badge confirming the dataset contains personal data.
- GDPR Legal Basis — The selected legal basis with a brief explanation of what it means.
- Data Protection Considerations — Auto-generated recommendations based on the dataset type and legal basis. For example, if the dataset is a Training dataset with a Consent legal basis, the system recommends verifying that consent covers AI training use, implementing data minimisation, and establishing retention limits.
- Cross-Reference to GDPR Module — If your organisation uses Venvera's GDPR processing activities module, a link to the relevant processing activity record is provided for integrated compliance management.
Bias Assessment
The bias assessment section is a structured evaluation of the dataset for potential biases that could affect the fairness and non-discrimination of the AI system. The section includes:
- Status — Not Started, In Progress, or Completed. Update this as you progress through the assessment.
- Assessment Notes — A rich text field for documenting bias findings, methodologies used, protected characteristics examined, statistical measures applied, and mitigation actions taken. Include references to specific tools or techniques used (e.g., demographic parity analysis, equalised odds testing, disparate impact ratios).
- Identified Biases — A summary of any biases detected, their potential impact, and the mitigation measures implemented.
- Mitigation Actions — A record of steps taken to address identified biases (e.g., resampling, reweighting, data augmentation, removal of proxy variables).
Quality Score
The quality score is a percentage (0–100%) reflecting the overall data quality of the dataset. This score is displayed as a progress bar with colour coding:
- Green (≥80%) — The dataset meets or exceeds data quality requirements. Data is well-documented, errors are minimal, coverage is comprehensive, and bias has been assessed and mitigated.
- Amber (≥50% and <80%) — The dataset has room for improvement. Some quality issues exist but are not critical. Review the dataset description and bias assessment for areas to improve.
- Red (<50%) — The dataset has significant quality concerns. Major issues exist with completeness, accuracy, bias, or documentation. Address these issues before using the dataset for training or validation of high-risk AI systems.
The quality score can be set manually based on your data quality assessment, or it can be informed by automated data profiling tools if integrated. Update the score as you improve the dataset and address identified issues.