Enhance Data Analysis with APPROX_COUNT_DISTINCT


Introduction

Oracle introduced approximate query processing features to help clients analyse massive datasets with dramatically better performance and lower resource consumption.

One of the most useful functions in this category is: APPROX_COUNT_DISTINCT, this function provides an approximate count of unique values in a dataset while using significantly less memory and CPU than a traditional COUNT(DISTINCT ...).

For modern analytics, data warehousing, telemetry, clickstream processing, IoT, fraud analytics, and real-time dashboards, this feature can reduce query execution time from minutes to seconds.

What is APPROX_COUNT_DISTINCT?

APPROX_COUNT_DISTINCT is an Oracle aggregate function that estimates the number of distinct values in a column.

Traditional distinct counting:

COUNT(DISTINCT customer_id)
select COUNT(DISTINCT emp_id) from emp;

Approximate distinct counting:

select APPROX_COUNT_DISTINCT(emp_id) from emp;
A SQL query interface displaying two queries: one for counting distinct employee IDs and another for approximating that count. The result shows an approximate count of 3.

The approximate version trades a very small amount of accuracy for major gains in:

  • Speed
  • Scalability
  • Memory efficiency
  • Parallel execution performance

Why Traditional COUNT(DISTINCT) is Expensive

Distinct counting is computationally expensive because typically Oracle must:

  1. Read all rows
  2. Sort or hash unique values
  3. Eliminate duplicates
  4. Maintain large memory structures

On very large datasets (billions of rows), this can cause:

  • High PGA memory usage
  • Temp tablespace spills
  • Long execution times
  • Heavy CPU utilization

Example:

SELECT COUNT(DISTINCT customer_id)
FROM sales_transactions;

If the table contains 5 billion rows, Oracle may need substantial sorting and hashing resources.

How would APPROX_COUNT_DISTINCT Work

Based on own deduction I would suspect that internally, Oracle uses probabilistic algorithms based on variants of:

  • HyperLogLog (HLL)
  • Sketch-based cardinality estimation

Instead of storing every distinct value, Oracle stores a compact statistical representation (“sketch”) of the data.

This allows Oracle to estimate cardinality with very small error rates.

Typical error rate:

  • Around 1–2%
  • Often even lower in practical workloads

Disclaimer: please note that the statements in this paragraph are solely baed on own deduction bout how I would implement a similar functionality and not any source related o connected in any way to Oracle Corp.

Basic Syntax

APPROX_COUNT_DISTINCT(expression)

Example:

SELECT APPROX_COUNT_DISTINCT(customer_id)
FROM orders;
SQL query result showing the approximate count of distinct employee IDs, with a result of 3.

Simple Example

Sample Table

CREATE TABLE website_visits (
visitor_id NUMBER,
page_name VARCHAR2(100)
);

Insert sample data:

INSERT INTO website_visits VALUES (101, 'HOME');
INSERT INTO website_visits VALUES (102, 'HOME');
INSERT INTO website_visits VALUES (103, 'PRODUCT');
INSERT INTO website_visits VALUES (101, 'CHECKOUT');
INSERT INTO website_visits VALUES (104, 'HOME');
COMMIT;
SQL script showing the creation of a 'website_visits' table, including data insertion commands and confirmation messages.

Exact Distinct Count:

SELECT COUNT(DISTINCT visitor_id)
FROM website_visits;
SQL query to count distinct visitor IDs from the website visits table, displaying the result of 4.

Approximate Distinct Count:

SELECT APPROX_COUNT_DISTINCT(visitor_id)
FROM website_visits;
Screenshot of a SQL query displaying the approximate count of distinct visitor IDs from a database table, showing the result as 4.

On small datasets, results are often exact.

Performance Comparison (based on observations)

Traditional Query:

SELECT COUNT(DISTINCT customer_id)
FROM large_sales;

Approximate Query

SELECT APPROX_COUNT_DISTINCT(customer_id)
FROM large_sales;

Typical improvements on huge datasets:

MetricCOUNT(DISTINCT)APPROX_COUNT_DISTINCT
CPU UsageHighLow
Memory UsageHighVery Low
Temp SpaceLargeMinimal
ScalabilityModerateExcellent
SpeedSlowerMuch Faster

Real-World Use Cases

Use Case 1: Website Analytics

Problem

A large e-commerce platform wants to know:

  • Unique daily visitors
  • Unique product viewers
  • Unique buyers

Exact precision is not necessary for dashboards.

Solution
SELECT
visit_date,
APPROX_COUNT_DISTINCT(visitor_id) AS unique_visitors
FROM web_logs
GROUP BY visit_date
ORDER BY visit_date;
Benefits
  • Near real-time reporting
  • Reduced query cost
  • Faster dashboard refreshes
  • Smaller development and testing costs

Use Case 2: Telecom Call Analysis

Telecom companies OLTP databases typically process billions of call records.

Requirements:

  • Estimate unique subscribers per tower
  • estimate usage per network element

Sample SQL:

SELECT
tower_id,
APPROX_COUNT_DISTINCT(subscriber_id)
FROM call_records
GROUP BY tower_id;
Benefits:
  • Fast network utilization analysis
  • Efficient operational dashboards

Use Case 3: IoT Sensor Analytics

Millions of devices stream events continuously.

Requirement:

  • Estimate unique active devices hourly

Sample SQL would look like:

SELECT
TRUNC(event_time, 'HH24') AS hour_slot,
APPROX_COUNT_DISTINCT(device_id)
FROM sensor_events
GROUP BY TRUNC(event_time, 'HH24');

Use Case 4: Fraud Detection

Banks, financial institutions monitor unique card usage patterns.

Sample SQL which could help identify unusual activity spikes quickly.

SELECT
merchant_id,
APPROX_COUNT_DISTINCT(card_number)
FROM card_transactions
GROUP BY merchant_id;

Use Case 5: Data Warehousing

Data warehouses often aggregate massive fact tables.

SELECT
region,
APPROX_COUNT_DISTINCT(customer_id)
FROM sales_fact
GROUP BY region;

This is extremely useful in Large star schema environments

Grouped Aggregations

APPROX_COUNT_DISTINCT works naturally with GROUP BY.

Example:

SELECT
department_id,
APPROX_COUNT_DISTINCT(employee_id)
FROM employee_attendance
GROUP BY department_id;

So, for my demo table:

A table showing employee names, departments, and salaries, with entries for HR, IT, and Finance.

I get:

SQL query output showing department names and approximate distinct employee counts for 'FINANCE', 'HR', and 'IT' departments.

Using with Date Dimensions

Useful for:

  • Monthly active users
  • Quarterly customer trends
  • Annual analytics

Sample:

SELECT
TRUNC(order_date, 'MM') AS sales_month,
APPROX_COUNT_DISTINCT(customer_id)
FROM sales
GROUP BY TRUNC(order_date, 'MM')
ORDER BY sales_month;

Combining with Other Aggregates

Sample:

SELECT
product_category,
COUNT(*) AS total_sales,
SUM(amount) AS revenue,
APPROX_COUNT_DISTINCT(customer_id) AS unique_customers
FROM sales
GROUP BY product_category;

Exact vs Approximate Results

For analytics workloads, we have measured 0.36%. This is usually acceptable.

When to Use APPROX_COUNT_DISTINCT

Ideal scenarios.Use It When:

  • Working with billions of rows
  • Building dashboards
  • Running exploratory analytics
  • Processing streaming data
  • Real-time reporting matters
  • Small estimation error is acceptable

When NOT to Use It

Avoid approximate counting for:

  • Financial reconciliation
  • Regulatory reporting
  • Auditing
  • Billing systems
  • Legal compliance reports
  • Scientific computations requiring precision
Use exact counting instead:
COUNT(DISTINCT ...)

Parallel Query Benefits

Example:

SELECT /*+ parallel(8) */
APPROX_COUNT_DISTINCT(customer_id)
FROM huge_sales_table;

Benefits:

  • Excellent scalability
  • Efficient distributed aggregation
  • Lower inter-process communication

Partitioned Tables

We consistently observed that it works very efficiently on partitioned fact tables. Example:

Sample SQL:

SELECT
sales_region,
APPROX_COUNT_DISTINCT(customer_id)
FROM sales_partitioned
WHERE sales_date >= DATE '2026-01-01'
GROUP BY sales_region;

Materialized Views and Approximation

Approximate aggregation can significantly improve refresh performance for analytical materialized views.

Accuracy Expectations

Approximation algorithms are designed for:

  • Statistical consistency
  • Stable error bounds
  • High scalability

In most practical environments:

  • Error rates are very low
  • Results are repeatable
  • Performance gains are dramatic

Best Practices

Use for Analytics

Best suited for BI/reporting workloads.


Combine with Partitioning

Partitioned tables improve performance further.


Use Parallelism

Avoid for Financial Accuracy

Use exact counting where precision is mandatory.

Validate Accuracy

Before production deployment:

  • Compare approximate vs exact results
  • Measure acceptable error thresholds

Common Mistakes

Mistake 1: Using in Auditing

Approximation should never replace precise audit reporting.


Mistake 2: Expecting Exact Results

Approximate means estimated!!


Mistake 3: Testing Only on Small Data

Benefits become visible on large datasets.

Key Advantages Summary

AdvantageBenefit
Faster executionBetter user experience
Lower memory usageReduced resource pressure
Scales to billions of rowsEnterprise-ready
Parallel-friendlyHigh throughput
Ideal for dashboardsReal-time analytics

Final Thoughts

APPROX_COUNT_DISTINCT is one of Oracle Database’s most powerful analytical optimization features for modern large-scale workloads.

It enables organizations to:

  • Analyze massive datasets quickly
  • Build responsive dashboards
  • Reduce infrastructure strain
  • Improve query scalability
  • Support near real-time analytics

While it should not replace exact counting in precision-critical systems, it is an outstanding solution for:

  • Data warehousing
  • Big data analytics
  • Streaming platforms
  • Operational dashboards
  • Customer behavior analysis
  • IoT and telemetry systems

For modern enterprise analytics, approximate query processing is no longer optional—it is becoming essential.



Discover more from Radu Pârvu

Subscribe to get the latest posts sent to your email.

, , , ,

Leave a Reply

Discover more from Radu Pârvu

Subscribe now to keep reading and get access to the full archive.

Continue reading

Discover more from Radu Pârvu

Subscribe now to keep reading and get access to the full archive.

Continue reading