Site icon Radu Pârvu

Enhance Data Analysis with APPROX_COUNT_DISTINCT

Introduction

Oracle introduced approximate query processing features to help clients analyse massive datasets with dramatically better performance and lower resource consumption.

One of the most useful functions in this category is: APPROX_COUNT_DISTINCT, this function provides an approximate count of unique values in a dataset while using significantly less memory and CPU than a traditional COUNT(DISTINCT ...).

For modern analytics, data warehousing, telemetry, clickstream processing, IoT, fraud analytics, and real-time dashboards, this feature can reduce query execution time from minutes to seconds.

What is APPROX_COUNT_DISTINCT?

APPROX_COUNT_DISTINCT is an Oracle aggregate function that estimates the number of distinct values in a column.

Traditional distinct counting:

COUNT(DISTINCT customer_id)
select COUNT(DISTINCT emp_id) from emp;

Approximate distinct counting:

select APPROX_COUNT_DISTINCT(emp_id) from emp;

The approximate version trades a very small amount of accuracy for major gains in:

Why Traditional COUNT(DISTINCT) is Expensive

Distinct counting is computationally expensive because typically Oracle must:

  1. Read all rows
  2. Sort or hash unique values
  3. Eliminate duplicates
  4. Maintain large memory structures

On very large datasets (billions of rows), this can cause:

Example:

SELECT COUNT(DISTINCT customer_id)
FROM sales_transactions;

If the table contains 5 billion rows, Oracle may need substantial sorting and hashing resources.

How would APPROX_COUNT_DISTINCT Work

Based on own deduction I would suspect that internally, Oracle uses probabilistic algorithms based on variants of:

Instead of storing every distinct value, Oracle stores a compact statistical representation (“sketch”) of the data.

This allows Oracle to estimate cardinality with very small error rates.

Typical error rate:

Disclaimer: please note that the statements in this paragraph are solely baed on own deduction bout how I would implement a similar functionality and not any source related o connected in any way to Oracle Corp.

Basic Syntax

APPROX_COUNT_DISTINCT(expression)

Example:

SELECT APPROX_COUNT_DISTINCT(customer_id)
FROM orders;

Simple Example

Sample Table

CREATE TABLE website_visits (
visitor_id NUMBER,
page_name VARCHAR2(100)
);

Insert sample data:

INSERT INTO website_visits VALUES (101, 'HOME');
INSERT INTO website_visits VALUES (102, 'HOME');
INSERT INTO website_visits VALUES (103, 'PRODUCT');
INSERT INTO website_visits VALUES (101, 'CHECKOUT');
INSERT INTO website_visits VALUES (104, 'HOME');
COMMIT;

Exact Distinct Count:

SELECT COUNT(DISTINCT visitor_id)
FROM website_visits;

Approximate Distinct Count:

SELECT APPROX_COUNT_DISTINCT(visitor_id)
FROM website_visits;

On small datasets, results are often exact.

Performance Comparison (based on observations)

Traditional Query:

SELECT COUNT(DISTINCT customer_id)
FROM large_sales;

Approximate Query

SELECT APPROX_COUNT_DISTINCT(customer_id)
FROM large_sales;

Typical improvements on huge datasets:

MetricCOUNT(DISTINCT)APPROX_COUNT_DISTINCT
CPU UsageHighLow
Memory UsageHighVery Low
Temp SpaceLargeMinimal
ScalabilityModerateExcellent
SpeedSlowerMuch Faster

Real-World Use Cases

Use Case 1: Website Analytics

Problem

A large e-commerce platform wants to know:

Exact precision is not necessary for dashboards.

Solution
SELECT
visit_date,
APPROX_COUNT_DISTINCT(visitor_id) AS unique_visitors
FROM web_logs
GROUP BY visit_date
ORDER BY visit_date;
Benefits

Use Case 2: Telecom Call Analysis

Telecom companies OLTP databases typically process billions of call records.

Requirements:

Sample SQL:

SELECT
tower_id,
APPROX_COUNT_DISTINCT(subscriber_id)
FROM call_records
GROUP BY tower_id;
Benefits:

Use Case 3: IoT Sensor Analytics

Millions of devices stream events continuously.

Requirement:

Sample SQL would look like:

SELECT
TRUNC(event_time, 'HH24') AS hour_slot,
APPROX_COUNT_DISTINCT(device_id)
FROM sensor_events
GROUP BY TRUNC(event_time, 'HH24');

Use Case 4: Fraud Detection

Banks, financial institutions monitor unique card usage patterns.

Sample SQL which could help identify unusual activity spikes quickly.

SELECT
merchant_id,
APPROX_COUNT_DISTINCT(card_number)
FROM card_transactions
GROUP BY merchant_id;

Use Case 5: Data Warehousing

Data warehouses often aggregate massive fact tables.

SELECT
region,
APPROX_COUNT_DISTINCT(customer_id)
FROM sales_fact
GROUP BY region;

This is extremely useful in Large star schema environments

Grouped Aggregations

APPROX_COUNT_DISTINCT works naturally with GROUP BY.

Example:

SELECT
department_id,
APPROX_COUNT_DISTINCT(employee_id)
FROM employee_attendance
GROUP BY department_id;

So, for my demo table:

I get:

Using with Date Dimensions

Useful for:

Sample:

SELECT
TRUNC(order_date, 'MM') AS sales_month,
APPROX_COUNT_DISTINCT(customer_id)
FROM sales
GROUP BY TRUNC(order_date, 'MM')
ORDER BY sales_month;

Combining with Other Aggregates

Sample:

SELECT
product_category,
COUNT(*) AS total_sales,
SUM(amount) AS revenue,
APPROX_COUNT_DISTINCT(customer_id) AS unique_customers
FROM sales
GROUP BY product_category;

Exact vs Approximate Results

For analytics workloads, we have measured 0.36%. This is usually acceptable.

When to Use APPROX_COUNT_DISTINCT

Ideal scenarios.Use It When:

When NOT to Use It

Avoid approximate counting for:

Use exact counting instead:
COUNT(DISTINCT ...)

Parallel Query Benefits

Example:

SELECT /*+ parallel(8) */
APPROX_COUNT_DISTINCT(customer_id)
FROM huge_sales_table;

Benefits:

Partitioned Tables

We consistently observed that it works very efficiently on partitioned fact tables. Example:

Sample SQL:

SELECT
sales_region,
APPROX_COUNT_DISTINCT(customer_id)
FROM sales_partitioned
WHERE sales_date >= DATE '2026-01-01'
GROUP BY sales_region;

Materialized Views and Approximation

Approximate aggregation can significantly improve refresh performance for analytical materialized views.

Accuracy Expectations

Approximation algorithms are designed for:

In most practical environments:

Best Practices

Use for Analytics

Best suited for BI/reporting workloads.


Combine with Partitioning

Partitioned tables improve performance further.


Use Parallelism

Avoid for Financial Accuracy

Use exact counting where precision is mandatory.

Validate Accuracy

Before production deployment:


Common Mistakes

Mistake 1: Using in Auditing

Approximation should never replace precise audit reporting.


Mistake 2: Expecting Exact Results

Approximate means estimated!!


Mistake 3: Testing Only on Small Data

Benefits become visible on large datasets.

Key Advantages Summary

AdvantageBenefit
Faster executionBetter user experience
Lower memory usageReduced resource pressure
Scales to billions of rowsEnterprise-ready
Parallel-friendlyHigh throughput
Ideal for dashboardsReal-time analytics

Final Thoughts

APPROX_COUNT_DISTINCT is one of Oracle Database’s most powerful analytical optimization features for modern large-scale workloads.

It enables organizations to:

While it should not replace exact counting in precision-critical systems, it is an outstanding solution for:

For modern enterprise analytics, approximate query processing is no longer optional—it is becoming essential.


Exit mobile version