CompTIA DA0-002 Practice Test 2026

Updated On : 4-Feb-2026

Prepare smarter and boost your chances of success with our CompTIA DA0-002 practice test 2026. These CompTIA Data+ Exam (2025) test questions helps you assess your knowledge, pinpoint strengths, and target areas for improvement. Surveys and user data from multiple platforms show that individuals who use DA0-002 practice exam are 40–50% more likely to pass on their first attempt.

Start practicing today and take the fast track to becoming CompTIA DA0-002 certified.

11110 already prepared

111 Questions
CompTIA Data+ Exam (2025)
4.8/5.0

Page 1 out of 12 Pages

Think You're Ready?

Your Final Exam Before the Final Exam.
Dare to Take It?

A data analyst created a dashboard to illustrate the traffic volume and mean response time for a call center. The traffic data is current, but the mean response time has not updated for more than an hour. Which of the following is the best way to verify the data's freshness?

A. Refactoring the code base

B. Testing for network connectivity issues

C. Checking the last time the calculation script ran

D. Determining the number of calls with no timestamps

C.   Checking the last time the calculation script ran

Summary

This question tests the ability to assess data freshness within a dashboard or reporting environment. When metrics fail to update, analysts must first determine whether the pipeline or calculation scripts feeding the data have run as expected. Directly checking the time of the last execution is the most reliable way to confirm data freshness before looking for more complex causes like connectivity or code refactoring.

Correct Option

Option C – Checking the last time the calculation script ran: This is the best way to verify data freshness. By confirming when the calculation or ETL script last executed, you can determine if the data pipeline stopped or failed. If it hasn’t run in over an hour, that explains the outdated mean response time. This method directly targets the root of data freshness rather than unrelated issues like code structure or timestamps.

Incorrect Options

Option A – Refactoring the code base: Refactoring improves code readability and maintainability but does not verify the freshness of data. It involves restructuring existing code without changing external behavior. If the data isn’t updating, refactoring will not reveal when or why the pipeline stopped. This is more of a development practice than a troubleshooting step for stale data.

Option B – Testing for network connectivity issues: While network problems can disrupt data transfers, testing connectivity does not directly confirm when the data was last updated. If the calculation script or ETL process didn’t run, connectivity may not be the issue. Connectivity tests are usually secondary after confirming pipeline execution status.

Option D – Determining the number of calls with no timestamps: This checks data quality, not data freshness. Missing timestamps can affect metrics but do not confirm whether the data pipeline is running or when it last ran. This step might be useful for cleaning data but not for verifying if the metric updates are current.

Reference
CompTIA Data+ (DA0-001) Exam Objectives – Official CompTIA Website

A data analyst team needs to segment customers based on customer spending behavior. Given one million rows of data like the information in the following sales order table:

Customer_ID
Region

Amount_spent

Product_category

Quantity_of_items

00123

East

20000

Baby

4

00124

West

30000

Home

6

00125

South

40000

Garden

7

00126

North

50000

Furniture

8

00127

East

60000

Baby

10

Which of the following techniques should the team use for this task?

A. Standardization

B. Concatenate

C. Binning

D. Appending

C.   Binning

Summary

This question assesses your understanding of data-preparation and analysis techniques for segmenting customers based on spending behavior. When a dataset contains large amounts of numerical data, analysts often group or categorize values into intervals to simplify patterns and identify customer segments. This process—called binning—is frequently used in analytics and data science to classify customers into spending tiers or behavioral groups.

Correct Option

Option C – Binning: Binning (or discretization) groups continuous data values into ranges or “bins.” In this case, the team could create bins such as “low spenders,” “medium spenders,” and “high spenders” based on the Amount_spent column. This makes it easier to analyze patterns and behaviors among customer segments. Binning is especially useful for large datasets like the one described with one million rows.

Incorrect Options

Option A – Standardization: Standardization transforms numerical variables to a standard scale (e.g., z-scores). While useful for some statistical models, it does not inherently segment customers into spending groups. Standardization focuses on normalization of data rather than categorizing or grouping continuous variables.

Option B – Concatenate: Concatenation means combining strings or columns together. For instance, merging Region and Product_category fields into one column. This process doesn’t create spending segments or group customers. It’s about text/data manipulation, not data grouping or classification.

Option D – Appending: Appending refers to adding new rows or records to an existing dataset. While this can expand the dataset, it does not help with customer segmentation or grouping based on spending behavior. Appending is a data-loading operation, not a segmentation or transformation technique.

Reference
CompTIA Data+ Certification (Official Website)

A senior manager needs a report that can be generated and accessed at any time. Which of the following delivery methods should a data analyst use?

A. Ad hoc

B. Dynamic

C. Self-service

D. Static

C.   Self-service

Question Explanation
1. Question Restatement
A senior manager needs a report that can be generated and accessed at any time. Which of the following delivery methods should a data analyst use?
A. Ad hoc
B. Dynamic
C. Self-service
D. Static



Correct Answer: C) Self-service


2. Correct Answer Justification Self-service is a reporting and analytics delivery model that provides business users (like a senior manager) with direct, controlled access to data tools and dashboards. This allows them to generate reports and answer their own questions on-demand, at any time, without needing to submit a request to a data analyst or IT department for each new report.
This approach empowers the end-user, reduces the burden on the data team for repetitive requests, and drastically improves the speed of decision-making. The data analyst's role in a self-service model is to create and maintain the underlying data models, dashboards, and tools that enable this access, ensuring data governance and quality are maintained.
This aligns directly with CompTIA Data+ Objective 5.3: "Given a scenario, use the appropriate method for dashboard delivery.", which includes self-service as a key delivery method.

3. Incorrect Answer Analysis
A) Ad hoc: An ad hoc report is a one-off, custom report created by a data analyst to answer a specific, non-routine question. It is not designed for ongoing, "any time" access by a business user. Once created and delivered, it is typically static. The question describes an ongoing need, not a single request.

B) Dynamic: A dynamic report is interactive and allows the user to change parameters (like filters, date ranges, or dimensions) within the framework of a pre-built report. While this offers flexibility, it still implies that the report itself was built and delivered by an analyst. "Dynamic" describes a characteristic of a report (interactivity) rather than the primary delivery method that enables "any time" access. Self-service platforms are the conduit for delivering dynamic reports to users.

D) Static: A static report is a "snapshot in time" delivered in a fixed format (e.g., a PDF, a printed report, a non-editable spreadsheet). It contains data that was current at the moment of its creation and does not change. It cannot be regenerated or accessed on-demand by the manager; it must be re-created and re-sent by the analyst, which contradicts the requirement of being available "at any time."

4. Key Concepts and Terminology
Self-Service Analytics: A business intelligence (BI) approach that enables non-technical users to access and work with corporate data to perform queries and generate reports themselves, typically through user-friendly dashboards and interfaces.
Ad Hoc Report: A report generated for a specific, immediate, and unique business question, not part of a standard reporting suite.
Static Report: A non-interactive report containing historical data that does not update automatically.
Dynamic Report: An interactive report that allows users to manipulate parameters to change the data view without altering the underlying dataset or structure.
Dashboard: A visual display of the most important information needed to achieve one or more objectives, consolidated and arranged on a single screen so the information can be monitored at a glance. This is the primary tool for self-service delivery.

5. Real-World Application A data analyst builds an interactive dashboard in a tool like Tableau, Power BI, or Qlik that displays key sales, marketing, and operational metrics. The analyst publishes this dashboard to a portal and grants access to the senior manager. The manager can now:
Log in to the portal at any time—during a meeting, at night, on the weekend. Refresh the data to see the most current information.
Use filters to view data for specific regions, time periods, or product lines.
Drill down into details without ever emailing the analyst for a new report.


6. References and Resources CompTIA Data+ Exam Objectives (DA0-002):
Domain 5.0: Data Visualization, Section 5.3:
"Given a scenario, use the appropriate method for dashboard delivery." This objective explicitly covers self-service, static, and ad hoc reporting.
Industry practice: Self-service BI is a cornerstone of modern data-driven organizations, as promoted by leading analytics platforms (Gartner, Forrester).


7. Visual Aids (if applicable) The following chart contrasts the delivery methods based on who generates the report and its flexibility:
Delivery Method Generated By Flexibility Best For Self-Service End User High (On-demand) Ongoing, ad-hoc exploration Ad Hoc Data Analyst Medium (One-time) Unique, non-routine questions Static Data Analyst None (Fixed Snapshot) Formal, archived reporting Dynamic Data Analyst (built) / User (used) Medium (Pre-built Interactivity) Parameterized, routine analysis

8. Common Mistakes
Confusing "ad hoc" (a one-time request) with "self-service" (the ability to perform ad hoc analysis yourself). The key differentiator is who generates the report.
Selecting "dynamic" because it sounds like it enables action. While dynamic reports are a feature of self-service platforms, the term itself does not capture the empowerment of the user to generate reports independently.
Overlooking the phrase "accessed at any time." This keyword phrase directly points to a system that is always available to the user, which is the definition of a self-service portal.

9. Cross-Reference to Exam Domains
Primary Domain: 5.0 Data Visualization (This domain covers all aspects of reporting and delivery methods).
Secondary Domain: 1.0 Data Concepts and Environments (Understanding how users interact with data systems).


10. Summary The correct answer is C) Self-service because it is the only delivery method that empowers the senior manager to independently generate and access the report at any time without relying on the data analyst for each new request. The analyst provides the tool (a dashboard), and the manager uses it on their own schedule, which is the core principle of self-service analytics. The other options either require the analyst's intervention for each use (Ad hoc, Static) or describe a characteristic of a report rather than the empowering delivery method itself (Dynamic).

A data analyst is preparing a survey for Paralympic Games athletes. Which of the following should the analyst consider when creating this survey?

A. Idioms

B. Refresh speed

C. Refresh speed

D. Granularity

B.   Refresh speed

Question

A data analyst is preparing a survey for Paralympic Games athletes. Which of the following should the analyst consider when creating this survey?

Options:
A. Idioms
B. Color contrast
C. Refresh speed
D. Granularity


Correct Answer: B. Color contrast

Question Restatement

When designing a survey for Paralympic athletes — who may have various disabilities — what is the most important accessibility consideration to ensure the survey is usable by everyone?

Correct Answer Justification — Why B Is Correct

Color contrast is a core principle of accessibility. People with low vision or color blindness can have difficulty reading text or distinguishing visual elements if there is insufficient contrast between text and background. The Web Content Accessibility Guidelines (WCAG) set minimum color contrast ratios to ensure readability. Because this survey targets Paralympic athletes, the analyst must make sure the design accommodates visual impairments, making color contrast the most important factor of the options given.
This aligns with CompTIA Data+ objectives on ethical and accessible data collection practices (Domain 1.3). It demonstrates inclusivity and compliance with accessibility standards.

Incorrect Answer Analysis

Idioms: While avoiding culturally specific phrases is good practice in general, idioms mainly create language or cultural barriers, not accessibility barriers for people with disabilities.
Refresh speed: This relates to dashboards or live-reporting environments. Surveys are static forms and do not depend on refresh speed, so it’s irrelevant here.
Granularity: This refers to the level of detail in the data you collect. While important for analysis, it does not address making the survey accessible to people with disabilities.

Key Concepts and Terminology

Color Contrast means the difference in brightness or hue between text and its background. High contrast (like black text on a white background) improves readability.
Accessibility is the practice of designing systems usable by people with different abilities, including visual, auditory, motor, or cognitive impairments. WCAG (Web Content Accessibility Guidelines) is the international standard for digital accessibility, including color contrast ratios. Inclusive Design is designing surveys and systems with all potential users in mind from the beginning, rather than adapting later.

Real-World Application

For example, if a survey uses dark gray text on a slightly lighter gray background, many visually impaired users may find it unreadable. Instead, using black text on a white background or providing an accessible theme helps all users, including those with low vision or color blindness. This approach also supports compliance with laws like the ADA (Americans with Disabilities Act) or Section 508 in the US.

References and Resources

CompTIA Data+ (DA0-002) Exam Objectives: Domain 1.3 (data acquisition ethics and accessibility).
WCAG 2.1 Guidelines on color contrast.
Section 508 Accessibility Standards (U.S.).


Common Mistakes

Many test-takers confuse accessibility with data detail (granularity) or technical performance (refresh speed). Another common error is assuming idioms or wording are the primary barrier for Paralympic athletes, when visual and functional accessibility is actually more important.

Domain Cross-Reference

This question touches on Domain 1 (Data Concepts and Environments) and Domain 3 (Data Analysis and Visualization), both of which stress ethical and inclusive data collection methods.

Summary

The correct answer is color contrast because ensuring high contrast between text and background is essential to make a survey accessible to Paralympic athletes, some of whom may have visual impairments. The other options do not directly address accessibility in survey design.

Which of the following tables holds relational keys and numeric values?

A. Fact

B. Graph

C. Dimensional

D. Transactional

A.   Fact

Question Which of the following tables holds relational keys and numeric values?

Options:
A. Fact
B. Graph
C. Dimensional
D. Transactional


Correct Answer: A. Fact

Question Restatement

The question is asking: In a typical data warehouse or relational database used for analytics, which table type stores the measures (numbers) and the keys linking to descriptive tables?

Correct Answer Justification — Why A Is Correct

A fact table is a central table in a star or snowflake schema within a data warehouse. It primarily stores:
Foreign keys (relational keys) referencing dimension tables.
Quantitative data or measures, such as sales revenue, quantity sold, or hours worked.
Fact tables provide the numeric values analysts aggregate (sum, average, min/max) during analysis. They are designed for fast retrieval and aggregation.
This matches the CompTIA Data+ Domain 1.4 objective, which covers data structures and environments used for analytics.


Incorrect Answer Analysis

Graph: Graph databases store data as nodes and edges, often used for network relationships or social graph analysis. They do not hold relational keys and numeric measures in the same way a fact table does.
Dimensional: This term usually refers to dimension tables, which store descriptive (categorical) data such as customer names, regions, or product categories. They rarely hold numeric measures—mainly descriptive attributes.
Transactional: Transactional tables store individual events or transactions (like each purchase record). While they do contain data, they are optimized for processing transactions, not for analytical aggregations like a fact table.

Key Concepts and Terminology

Fact Table: A table containing measures (quantitative data) and keys to dimension tables.
Dimension Table: A table containing descriptive or categorical attributes to give context to measures.
Star Schema: A database schema where a central fact table connects to multiple dimension tables.
Foreign Key: A column in one table linking to a primary key in another, enabling relational joins.
Measures vs. Attributes: Measures are numeric and aggregatable; attributes describe or categorize.

Real-World Application

In a sales data warehouse:
The fact table holds Order_ID, Customer_ID, Product_ID, Date_ID, along with numeric measures like Quantity_Sold, Unit_Price, and Total_Revenue.
Dimension tables hold descriptive data like customer demographics, product descriptions, or calendar information.
Analysts can then sum or average these measures by joining the fact table to dimension tables.

References and Resources

CompTIA Data+ (DA0-002) Exam Objectives: Domain 1.4 (data structures for analytics).
Kimball & Ross, The Data Warehouse Toolkit — foundational book on fact and dimension tables.
Microsoft, Snowflake, AWS Redshift documentation on data warehouse schemas.

Common Mistakes

Confusing dimension tables (descriptive data) with fact tables (numeric measures).
Assuming transactional tables automatically equal fact tables. Transactional tables capture events but aren’t structured for analysis like fact tables.
Thinking graph tables are part of standard relational warehousing—they are a different data model altogether.

Domain Cross-Reference

Domain 1: Data Concepts and Environments (data structures for analytics).
Domain 3: Data Analysis and Visualization (aggregating and summarizing measures from fact tables).

Summary

The correct answer is Fact Table (A) because it holds both the relational keys linking to dimensions and the numeric measures used in analysis. Dimension tables hold descriptive data, transactional tables store event-level records, and graph databases store nodes/edges—not numeric measures tied to relational keys.

The following SQL code returns an error in the program console:

SELECT firstName, lastName, SUM(income)

FROM companyRoster

SORT BY lastName, income

Which of the following changes allows this SQL code to run?

A. SELECT firstName, lastName, SUM(income) FROM companyRoster HAVING SUM(income) > 10000000

B. SELECT firstName, lastName, SUM(income) FROM companyRoster GROUP BY firstName, lastName

C. SELECT firstName, lastName, SUM(income) FROM companyRoster ORDER BY firstName, income

D. SELECT firstName, lastName, SUM(income) FROM companyRoster

B.   SELECT firstName, lastName, SUM(income) FROM companyRoster GROUP BY firstName, lastName

Question

The following SQL code returns an error in the program console:

SELECT firstName, lastName, SUM(income)
FROM companyRoster
SORT BY lastName, income;


Which of the following changes allows this SQL code to run?

Options:
A. Use HAVING SUM(income) > 10000000
B. Use GROUP BY firstName, lastName
C. Use ORDER BY firstName, income
D. Leave out the aggregate entirely


Correct Answer: B. GROUP BY firstName, lastName

Question Restatement

This question asks: When using an aggregate function like SUM() in SQL, how can you correctly group and retrieve non-aggregated columns to avoid an error?

Correct Answer Justification — Why B Is Correct

In SQL, when you include an aggregate function like SUM(income) in a SELECT statement alongside non-aggregated columns (firstName, lastName), you must use a GROUP BY clause to tell the database how to group the rows before aggregation. Without GROUP BY, SQL does not know how to combine multiple rows for each person.

The corrected code:

SELECT firstName, lastName, SUM(income) FROM companyRoster GROUP BY firstName, lastName ORDER BY lastName, SUM(income);

This syntax correctly groups income by each first and last name, then allows sorting or ordering by the aggregated values. This is exactly what Option B specifies — adding GROUP BY firstName, lastName fixes the error.

Incorrect Answer Analysis

A. HAVING SUM(income) > 10000000: HAVING is used to filter after aggregation, but it does not resolve the requirement to group non-aggregated columns. You still need a GROUP BY even if you use HAVING.

C. ORDER BY firstName, income: ORDER BY alone does not fix the aggregation issue. The original error is not about sorting—it’s about missing GROUP BY.

D. SELECT firstName, lastName, SUM(income) FROM companyRoster: Leaving it as-is still produces an error because of mixing aggregated and non-aggregated columns without a GROUP BY.

Key Concepts and Terminology

Aggregate Function: A function (SUM, COUNT, AVG, MAX, MIN) that returns a single value from multiple rows.

GROUP BY Clause: Groups rows sharing values of specified columns into summary rows, one for each unique group. Required when mixing aggregates and non-aggregates.

HAVING Clause: Filters groups after aggregation (similar to WHERE but for grouped data).

ORDER BY Clause: Sorts the returned rows; does not affect grouping or aggregation.

Real-World Application

If you’re analyzing company payroll, you might want to see total income per employee. Using GROUP BY firstName, lastName aggregates multiple paychecks or commissions under each employee’s name. Without it, SQL cannot compute a single SUM per employee and throws an error.

References and Resources

CompTIA Data+ (DA0-002) Exam Objectives: Domain 2.1 (data manipulation using SQL).
W3Schools SQL GROUP BY documentation.
ANSI SQL standard on aggregate functions.

Common Mistakes

Trying to mix aggregate and non-aggregate columns without using GROUP BY.
Confusing HAVING (filter groups) with WHERE (filter rows before grouping).
Using SORT BY instead of ORDER BY—SQL uses ORDER BY as the correct syntax.

Domain Cross-Reference

Domain 2: Data Mining and Manipulation (performing aggregations and grouping).
Domain 3: Data Analysis and Visualization (summarizing and displaying data properly).

Summary

The SQL fails because it mixes aggregated and non-aggregated columns without grouping. Adding GROUP BY firstName, lastName (Option B) fixes the issue, letting SQL correctly aggregate income per person. HAVING filters groups after aggregation but does not replace GROUP BY. ORDER BY only sorts results and does not resolve the aggregation error.

A user needs a report that shows the main causes of customer churn rate in a three-year period. Which of the following methods provides this information?

A. Inferential

B. Descriptive

C. Prescriptive

D. Predictive

B.   Descriptive

Question

A user needs a report that shows the main causes of customer churn rate in a three-year period. Which of the following methods provides this information?

Options:
A. Inferential
B. Descriptive
C. Prescriptive
D. Predictive

Correct Answer: B. Descriptive

Question Restatement

This question asks: If someone wants to understand the causes of customer churn over the past three years, which data analysis method is appropriate?

Correct Answer Justification — Why B Is Correct

Descriptive analytics focuses on summarizing historical data to understand what happened. It identifies patterns, trends, and relationships in past data but does not predict or prescribe future actions.
In this scenario, the user wants a report of the main causes of churn over a three-year historical period. This is a classic descriptive use case:
It looks backward (historical data).
It identifies and summarizes causes or factors contributing to churn. It presents aggregated metrics like percentages, averages, or rankings.
CompTIA Data+ Domain Alignment:
Domain 3.1 (Data Analysis) includes recognizing appropriate types of analysis—descriptive analytics is used for summarizing historical data.

Incorrect Answer Analysis

Inferential: Involves drawing conclusions about a population from a sample (using statistical inference). While inferential stats can estimate population parameters, the question is about summarizing causes from full historical data, not inferring from samples.
Prescriptive: Goes beyond prediction to recommend actions or strategies to achieve desired outcomes. It would tell you how to reduce churn, not just describe its causes.
Predictive: Uses historical data to forecast future trends. It would predict future churn rates or risk but not summarize past causes.


Key Concepts and Terminology

Descriptive Analytics: Analyzes past data to answer “what happened” or “what is happening.”
Inferential Statistics: Uses samples to make conclusions about larger populations.
Predictive Analytics: Uses statistical or machine learning models to forecast future outcomes.
Prescriptive Analytics: Recommends decisions or actions to achieve optimal outcomes.
Customer Churn Rate: Percentage of customers who stop doing business over a given time period.

Real-World Application

A telecom company collects data over three years about customer cancellations. By using descriptive analytics, they identify patterns such as:
High churn rates in certain geographic regions.
Churn linked to contract type or billing disputes.
This report helps business leaders understand why churn happened but does not predict future churn or prescribe remedies directly.


References and Resources

CompTIA Data+ Exam Objectives: Domain 3 (Data Analysis and Visualization).
Gartner Analytics Maturity Model (descriptive → diagnostic → predictive → prescriptive).
Industry standards: Business Intelligence dashboards summarizing KPIs and causes.


Common Mistakes

Choosing predictive analytics because of the assumption churn always involves forecasting—this question explicitly says “in a three-year period” (historical), so it’s descriptive.
Thinking prescriptive applies because you want to reduce churn—prescriptive would tell you how to act, but the question asks only for main causes.
Confusing inferential (sample-based) with descriptive (full historical dataset).

Domain Cross-Reference

Domain 3: Data Analysis and Visualization (identifying the appropriate analysis type).
Domain 1: Data Concepts (understanding different analytics methods).


Summary

The correct answer is B. Descriptive because the report summarizes historical data to identify main causes of churn over the last three years. Predictive and prescriptive analytics look forward or recommend actions, and inferential statistics deal with drawing conclusions from samples.

Which of the following best describes the semi-structured data that is gathered when web scraping?

A. JSON

B. CSV

C. CSS

D. HTML

A.   JSON

Question

Which of the following best describes the semi-structured data that is gathered when web scraping?

Options:
A. JSON
B. CSV
C. CSS
D. HTML

Correct Answer: A. JSON

Question Restatement

This question asks: When you scrape data from a website, which data format typically represents semi-structured data?

Correct Answer Justification — Why A Is Correct

JSON (JavaScript Object Notation) is a lightweight, text-based format often used to transmit data between web servers and applications. It is semi-structured, meaning it does not require a fixed schema like a relational database but still uses a structured key–value pair format.
When web scraping, data is frequently returned in JSON form via APIs or embedded JavaScript objects on the page. JSON is widely used in modern web applications and is easier to parse programmatically than raw HTML.
This aligns with CompTIA Data+ Domain 1.2, which covers data formats and recognizing structured, semi-structured, and unstructured data types.

Incorrect Answer Analysis

CSV (Comma-Separated Values): This is a structured, tabular format (rows and columns). While easy to work with, CSV lacks the nested structure of JSON and is more rigid. It’s not the typical format sent during web scraping.

CSS (Cascading Style Sheets): This defines how HTML elements are displayed (fonts, colors, layout). It’s a styling language, not a data format.

HTML (HyperText Markup Language): HTML is primarily unstructured or loosely structured markup for displaying content. While you scrape HTML, the extracted data is often parsed and transformed into JSON for analysis.

Key Concepts and Terminology

Structured Data: Organized in rows and columns, stored in relational databases (examples: SQL tables, CSV).

Semi-Structured Data: Does not follow a rigid table structure but has some organizational properties, such as tags or key–value pairs (examples: JSON, XML).

Unstructured Data: Raw content without predefined structure (examples: text documents, images, audio).

JSON: A semi-structured data format using key–value pairs and arrays.

Web Scraping: Extracting data from websites or APIs, often returning JSON for programmatic use.

Real-World Application

A data analyst scraping a retail website for product prices might: Use Python’s requests library to hit an API endpoint.
Receive the response in JSON format containing product IDs, prices, and descriptions.
Load the JSON into a pandas DataFrame for further analysis.
This semi-structured approach allows for flexible handling of hierarchical data without predefined tables.

References and Resources

CompTIA Data+ (DA0-002) Exam Objectives: Domain 1.2 (data types and formats).
W3C JSON Standard (https://www.json.org). Python libraries like json and BeautifulSoup for web scraping.

Common Mistakes

Assuming HTML is semi-structured. HTML is markup meant for display, not data exchange, and scraping it often requires cleaning and transforming.
Confusing CSV with semi-structured data. CSV is structured, not semi-structured.
Thinking CSS contains data. CSS only controls presentation.

Domain Cross-Reference

Domain 1: Data Concepts and Environments (structured vs. semi-structured vs. unstructured data).
Domain 2: Data Mining and Manipulation (data acquisition, including web scraping).

Summary

The best answer is A. JSON because JSON is a semi-structured format commonly returned when web scraping. CSV is structured, HTML is markup, and CSS is for styling—not for holding semi-structured data.

A data analyst is analyzing the following dataset:

Transaction Date

Quantity

Item

Item Price

12/12/12

11

USB Cords

9.99

11/11/11

3

Charging Block

8.89

10/10/10

5

Headphones

50.15

Which of the following methods should the analyst use to determine the total cost for each transaction?

A. Parsing

B. Scaling

C. Compressing

D. Deriving

D.   Deriving

Question

A data analyst is analyzing the following dataset:

Transaction Date

Quantity Item

Item Price

Example rows:
12/12/12 — 11 — USB Cords — 9.99
11/11/11 — 3 — Charging Block — 8.89
10/10/10 — 5 — Headphones — 50.15


Which of the following methods should the analyst use to determine the total cost for each transaction?

Options:
A. Parsing
B. Scaling
C. Compressing
D. Deriving

Correct Answer: D. Deriving

Question Restatement

This question asks: If you have quantity and item price per transaction, what method do you use to create a new column (total cost) from existing columns?

Correct Answer Justification — Why D Is Correct

Deriving refers to creating a new variable or field from one or more existing variables. In this case:
Existing fields: Quantity and Item Price.
Derived field: Total Cost = Quantity × Item Price.
This is a textbook example of a calculated or derived field — you’re not cleaning or scaling data, you’re computing a new metric from existing data.

CompTIA Data+ Domain Alignment:

Domain 2.3 (Data Manipulation): Using transformations, derived fields, and calculated metrics to enrich datasets.

Incorrect Answer Analysis

Parsing: Breaking data into smaller parts or extracting specific elements from a string (like splitting “12/12/12” into month, day, year). Parsing is not about performing calculations between fields.

Scaling: Adjusting data to a different magnitude or range (like normalization or standardization for machine learning). Scaling changes existing values, not creating new ones.

Compressing: Reducing the size of data for storage efficiency. This has nothing to do with creating new fields or calculating totals.

Key Concepts and Terminology

Derived Field (or Calculated Column): A new data field created using a formula or expression applied to existing data fields.
Data Transformation: The process of converting, combining, or deriving new variables to make data more useful for analysis.
Parsing: Extracting meaningful components from strings or raw data.
Scaling: Adjusting numbers to a consistent range or unit.

Real-World Application

In a retail analytics scenario, analysts routinely create derived fields such as:
Total Revenue per transaction (Quantity × Price).
Profit Margin (Revenue − Cost).
Customer Lifetime Value (sum of transactions per customer).
This process enriches the dataset and allows for more meaningful metrics and KPIs.

References and Resources

CompTIA Data+ Exam Objectives: Domain 2 (Data Mining and Manipulation).
Data Warehousing & BI Concepts — Derived Columns in ETL tools like Informatica, Talend, or SQL’s calculated columns. Microsoft Power BI / Tableau calculated fields documentation.

Common Mistakes

Thinking parsing applies because of the text columns (dates/items). Parsing is only about breaking text, not multiplying fields.

Mistaking scaling for any numeric change. Scaling changes the magnitude, not the structure.
Overlooking derived fields as a basic data manipulation technique.

Domain Cross-Reference

Domain 2: Data Mining and Manipulation (deriving and transforming fields).
Domain 3: Data Analysis (calculating metrics from existing data).

Summary

The correct answer is D. Deriving because determining the total cost for each transaction requires creating a new metric (Quantity × Item Price) from existing data fields. Parsing, scaling, and compressing don’t create calculated fields.

A company has a document that includes the names of key metrics and the standard for how those metrics are calculated company-wide. Which of the following describes this documentation?

A. Data dictionary

B. Data explainability report

C. Data lineage

D. Data flow diagram

A.   Data dictionary

Question

A company has a document that includes the names of key metrics and the standard for how those metrics are calculated company-wide. Which of the following describes this documentation?

Options:
A. Data dictionary
B. Data explainability report
C. Data lineage
D. Data flow diagram


Correct Answer: A. Data dictionary

Question Restatement

This question asks: What do you call the document that defines metrics, their names, and standardized calculations across the organization?

Correct Answer Justification — Why A Is Correct

A data dictionary is a centralized repository or document that defines:
Data elements (field names, data types, permissible values).
Key metrics and how they are calculated.
Standard definitions across departments to ensure consistency.
This is exactly what the scenario describes — a document holding key metric names and calculation standards for company-wide use.
CompTIA Data+ Domain Alignment:
Domain 4.1 (Data Governance): Understanding and applying data dictionaries, data catalogs, and business glossaries.

Incorrect Answer Analysis

Data explainability report: This describes how a model or system produces its results (common in AI/ML for transparency), not the standard definitions of me
trics.

Data lineage: This shows the data’s origin, movement, and transformations over time (where data came from, how it changed), not its definitions or calculation standards.

Data flow diagram: This is a visual map of how data flows between systems or processes, not a document of metric definitions.

Key Concepts and Terminology

Data Dictionary: A detailed description of each data element in a system, including name, definition, data type, constraints, and sometimes calculations.

Business Glossary: Similar to a data dictionary but more business-focused, defining terms and metrics used across departments.

Data Governance: A framework ensuring consistent, reliable, and secure data management practices.

Data Lineage: The life cycle and transformations of data as it moves through systems.

Real-World Application

In a retail company, a data dictionary might define “Customer Lifetime Value” and the exact calculation formula used, ensuring marketing and finance teams are consistent.
In healthcare, a data dictionary may standardize definitions like “Readmission Rate” to comply with HIPAA or regulatory reporting.

References and Resources

CompTIA Data+ Exam Objectives: Domain 4 — Data Governance, Quality, and Controls.
DAMA-DMBOK (Data Management Body of Knowledge) on Data Dictionaries and Glossaries.
NIST Data Governance Standards.

Common Mistakes

Confusing data lineage (data origin and flow) with data dictionary (definitions and standards).
Thinking data flow diagrams or ERDs are equivalent to data dictionaries — they’re visual, not definitional.
Believing a “data catalog” and “data dictionary” are exactly the same — a catalog is broader (includes metadata), whereas a dictionary focuses on definitions and attributes.

Domain Cross-Reference

Domain 4: Data Governance, Quality, and Controls — understanding documentation like data dictionaries, business glossaries, and standards.

Summary

The correct answer is A. Data dictionary because it defines key metrics and their calculation standards across the company, ensuring consistency and compliance. Data lineage, explainability reports, and flow diagrams serve other purposes (movement tracking, transparency, or visuals) rather than defining metrics.

Page 1 out of 12 Pages