Enhance Data Analysis with APPROX_COUNT_DISTINCT

Radu PARVU

2 months ago

Introduction

Oracle introduced approximate query processing features to help clients analyse massive datasets with dramatically better performance and lower resource consumption.

One of the most useful functions in this category is: APPROX_COUNT_DISTINCT, this function provides an approximate count of unique values in a dataset while using significantly less memory and CPU than a traditional COUNT(DISTINCT ...).

For modern analytics, data warehousing, telemetry, clickstream processing, IoT, fraud analytics, and real-time dashboards, this feature can reduce query execution time from minutes to seconds.

What is `APPROX_COUNT_DISTINCT`?

APPROX_COUNT_DISTINCT is an Oracle aggregate function that estimates the number of distinct values in a column.

Traditional distinct counting:

COUNT(DISTINCT customer_id)

select COUNT(DISTINCT emp_id) from emp;

Approximate distinct counting:

select APPROX_COUNT_DISTINCT(emp_id) from emp;

The approximate version trades a very small amount of accuracy for major gains in:

Speed
Scalability
Memory efficiency
Parallel execution performance

Why Traditional `COUNT(DISTINCT)` is Expensive

Distinct counting is computationally expensive because typically Oracle must:

Read all rows
Sort or hash unique values
Eliminate duplicates
Maintain large memory structures

On very large datasets (billions of rows), this can cause:

High PGA memory usage
Temp tablespace spills
Long execution times
Heavy CPU utilization

Example:

			
SELECT COUNT(DISTINCT customer_id)
FROM sales_transactions;

If the table contains 5 billion rows, Oracle may need substantial sorting and hashing resources.

How would `APPROX_COUNT_DISTINCT` Work

Based on own deduction I would suspect that internally, Oracle uses probabilistic algorithms based on variants of:

HyperLogLog (HLL)
Sketch-based cardinality estimation

Instead of storing every distinct value, Oracle stores a compact statistical representation (“sketch”) of the data.

This allows Oracle to estimate cardinality with very small error rates.

Typical error rate:

Around 1–2%
Often even lower in practical workloads

Disclaimer: please note that the statements in this paragraph are solely baed on own deduction bout how I would implement a similar functionality and not any source related o connected in any way to Oracle Corp.

Basic Syntax

APPROX_COUNT_DISTINCT(expression)

Example:

			
SELECT APPROX_COUNT_DISTINCT(customer_id)
FROM orders;

Simple Example

Sample Table

			
CREATE TABLE website_visits (
    visitor_id NUMBER,
    page_name  VARCHAR2(100)
);

Insert sample data:

			
INSERT INTO website_visits VALUES (101, 'HOME');
INSERT INTO website_visits VALUES (102, 'HOME');
INSERT INTO website_visits VALUES (103, 'PRODUCT');
INSERT INTO website_visits VALUES (101, 'CHECKOUT');
INSERT INTO website_visits VALUES (104, 'HOME');
COMMIT;

		

Exact Distinct Count:

			
SELECT COUNT(DISTINCT visitor_id)
FROM website_visits;

Approximate Distinct Count:

			
SELECT APPROX_COUNT_DISTINCT(visitor_id)
FROM website_visits;

On small datasets, results are often exact.

Performance Comparison (based on observations)

Traditional Query:

			
SELECT COUNT(DISTINCT customer_id)
FROM large_sales;

Approximate Query

			
SELECT APPROX_COUNT_DISTINCT(customer_id)
FROM large_sales;

Typical improvements on huge datasets:

Metric	COUNT(DISTINCT)	APPROX_COUNT_DISTINCT
CPU Usage	High	Low
Memory Usage	High	Very Low
Temp Space	Large	Minimal
Scalability	Moderate	Excellent
Speed	Slower	Much Faster

Real-World Use Cases

Use Case 1: Website Analytics

Problem

A large e-commerce platform wants to know:

Unique daily visitors
Unique product viewers
Unique buyers

Exact precision is not necessary for dashboards.

Solution

			
SELECT
    visit_date,
    APPROX_COUNT_DISTINCT(visitor_id) AS unique_visitors
FROM web_logs
GROUP BY visit_date
ORDER BY visit_date;

		

Benefits

Near real-time reporting
Reduced query cost
Faster dashboard refreshes
Smaller development and testing costs

Use Case 2: Telecom Call Analysis

Telecom companies OLTP databases typically process billions of call records.

Requirements:

Estimate unique subscribers per tower
estimate usage per network element

Sample SQL:

			
SELECT
    tower_id,
    APPROX_COUNT_DISTINCT(subscriber_id)
FROM call_records
GROUP BY tower_id;

		

Benefits:

Fast network utilization analysis
Efficient operational dashboards

Use Case 3: IoT Sensor Analytics

Millions of devices stream events continuously.

Requirement:

Estimate unique active devices hourly

Sample SQL would look like:

			
SELECT
    TRUNC(event_time, 'HH24') AS hour_slot,
    APPROX_COUNT_DISTINCT(device_id)
FROM sensor_events
GROUP BY TRUNC(event_time, 'HH24');

		

Use Case 4: Fraud Detection

Banks, financial institutions monitor unique card usage patterns.

Sample SQL which could help identify unusual activity spikes quickly.

			
SELECT
    merchant_id,
    APPROX_COUNT_DISTINCT(card_number)
FROM card_transactions
GROUP BY merchant_id;

		

Use Case 5: Data Warehousing

Data warehouses often aggregate massive fact tables.

			
SELECT
    region,
    APPROX_COUNT_DISTINCT(customer_id)
FROM sales_fact
GROUP BY region;

		

This is extremely useful in Large star schema environments

Grouped Aggregations

APPROX_COUNT_DISTINCT works naturally with GROUP BY.

Example:

			
SELECT
    department_id,
    APPROX_COUNT_DISTINCT(employee_id)
FROM employee_attendance
GROUP BY department_id;

		

So, for my demo table:

I get:

Using with Date Dimensions

Useful for:

Monthly active users
Quarterly customer trends
Annual analytics

Sample:

			
SELECT
    TRUNC(order_date, 'MM') AS sales_month,
    APPROX_COUNT_DISTINCT(customer_id)
FROM sales
GROUP BY TRUNC(order_date, 'MM')
ORDER BY sales_month;

		

Combining with Other Aggregates

Sample:

			
SELECT
    product_category,
    COUNT(*) AS total_sales,
    SUM(amount) AS revenue,
    APPROX_COUNT_DISTINCT(customer_id) AS unique_customers
FROM sales
GROUP BY product_category;

		

Exact vs Approximate Results

For analytics workloads, we have measured 0.36%. This is usually acceptable.

When to Use `APPROX_COUNT_DISTINCT`

Ideal scenarios.Use It When:

Working with billions of rows
Building dashboards
Running exploratory analytics
Processing streaming data
Real-time reporting matters
Small estimation error is acceptable

When NOT to Use It

Avoid approximate counting for:

Financial reconciliation
Regulatory reporting
Auditing
Billing systems
Legal compliance reports
Scientific computations requiring precision

Use exact counting instead:

COUNT(DISTINCT ...)

Parallel Query Benefits

Example:

			
SELECT /*+ parallel(8) */
       APPROX_COUNT_DISTINCT(customer_id)
FROM huge_sales_table;

Benefits:

Excellent scalability
Efficient distributed aggregation
Lower inter-process communication

Partitioned Tables

We consistently observed that it works very efficiently on partitioned fact tables. Example:

Sample SQL:

			
SELECT
    sales_region,
    APPROX_COUNT_DISTINCT(customer_id)
FROM sales_partitioned
WHERE sales_date >= DATE '2026-01-01'
GROUP BY sales_region;

		

Materialized Views and Approximation

Approximate aggregation can significantly improve refresh performance for analytical materialized views.

Accuracy Expectations

Approximation algorithms are designed for:

Statistical consistency
Stable error bounds
High scalability

In most practical environments:

Error rates are very low
Results are repeatable
Performance gains are dramatic

Best Practices

Use for Analytics

Best suited for BI/reporting workloads.

Combine with Partitioning

Partitioned tables improve performance further.

Use Parallelism

Avoid for Financial Accuracy

Use exact counting where precision is mandatory.

Validate Accuracy

Before production deployment:

Compare approximate vs exact results
Measure acceptable error thresholds

Common Mistakes

Mistake 1: Using in Auditing

Approximation should never replace precise audit reporting.

Mistake 2: Expecting Exact Results

Approximate means estimated!!

Mistake 3: Testing Only on Small Data

Benefits become visible on large datasets.

Key Advantages Summary

Advantage	Benefit
Faster execution	Better user experience
Lower memory usage	Reduced resource pressure
Scales to billions of rows	Enterprise-ready
Parallel-friendly	High throughput
Ideal for dashboards	Real-time analytics

Final Thoughts

APPROX_COUNT_DISTINCT is one of Oracle Database’s most powerful analytical optimization features for modern large-scale workloads.

It enables organizations to:

Analyze massive datasets quickly
Build responsive dashboards
Reduce infrastructure strain
Improve query scalability
Support near real-time analytics

While it should not replace exact counting in precision-critical systems, it is an outstanding solution for:

Data warehousing
Big data analytics
Streaming platforms
Operational dashboards
Customer behavior analysis
IoT and telemetry systems

For modern enterprise analytics, approximate query processing is no longer optional—it is becoming essential.

Introduction

What is APPROX_COUNT_DISTINCT?

Why Traditional COUNT(DISTINCT) is Expensive

How would APPROX_COUNT_DISTINCT Work

Basic Syntax

Example:

Simple Example

Sample Table

Insert sample data:

Exact Distinct Count:

Approximate Distinct Count:

Performance Comparison (based on observations)

Real-World Use Cases

Use Case 1: Website Analytics

Problem

Solution

Benefits

Use Case 2: Telecom Call Analysis

Benefits:

Use Case 3: IoT Sensor Analytics

Use Case 4: Fraud Detection

Use Case 5: Data Warehousing

Grouped Aggregations

Using with Date Dimensions

Combining with Other Aggregates

Exact vs Approximate Results

When to Use APPROX_COUNT_DISTINCT

Ideal scenarios.Use It When:

When NOT to Use It

Use exact counting instead:

Parallel Query Benefits

Partitioned Tables

Materialized Views and Approximation

Accuracy Expectations

Best Practices

Use for Analytics

Combine with Partitioning

Use Parallelism

Avoid for Financial Accuracy

Validate Accuracy

Common Mistakes

Mistake 1: Using in Auditing

Mistake 2: Expecting Exact Results

Mistake 3: Testing Only on Small Data

Key Advantages Summary

Final Thoughts

Share this:

What is `APPROX_COUNT_DISTINCT`?

Why Traditional `COUNT(DISTINCT)` is Expensive

How would `APPROX_COUNT_DISTINCT` Work

When to Use `APPROX_COUNT_DISTINCT`