Introduction
Oracle introduced approximate query processing features to help clients analyse massive datasets with dramatically better performance and lower resource consumption.
One of the most useful functions in this category is: APPROX_COUNT_DISTINCT, this function provides an approximate count of unique values in a dataset while using significantly less memory and CPU than a traditional COUNT(DISTINCT ...).
For modern analytics, data warehousing, telemetry, clickstream processing, IoT, fraud analytics, and real-time dashboards, this feature can reduce query execution time from minutes to seconds.
What is APPROX_COUNT_DISTINCT?
APPROX_COUNT_DISTINCT is an Oracle aggregate function that estimates the number of distinct values in a column.
Traditional distinct counting:
COUNT(DISTINCT customer_id)
select COUNT(DISTINCT emp_id) from emp;
Approximate distinct counting:
select APPROX_COUNT_DISTINCT(emp_id) from emp;
The approximate version trades a very small amount of accuracy for major gains in:
- Speed
- Scalability
- Memory efficiency
- Parallel execution performance
Why Traditional COUNT(DISTINCT) is Expensive
Distinct counting is computationally expensive because typically Oracle must:
- Read all rows
- Sort or hash unique values
- Eliminate duplicates
- Maintain large memory structures
On very large datasets (billions of rows), this can cause:
- High PGA memory usage
- Temp tablespace spills
- Long execution times
- Heavy CPU utilization
Example:
SELECT COUNT(DISTINCT customer_id)FROM sales_transactions;
If the table contains 5 billion rows, Oracle may need substantial sorting and hashing resources.
How would APPROX_COUNT_DISTINCT Work
Based on own deduction I would suspect that internally, Oracle uses probabilistic algorithms based on variants of:
- HyperLogLog (HLL)
- Sketch-based cardinality estimation
Instead of storing every distinct value, Oracle stores a compact statistical representation (“sketch”) of the data.
This allows Oracle to estimate cardinality with very small error rates.
Typical error rate:
- Around 1–2%
- Often even lower in practical workloads
Disclaimer: please note that the statements in this paragraph are solely baed on own deduction bout how I would implement a similar functionality and not any source related o connected in any way to Oracle Corp.
Basic Syntax
APPROX_COUNT_DISTINCT(expression)
Example:
SELECT APPROX_COUNT_DISTINCT(customer_id)FROM orders;
Simple Example
Sample Table
CREATE TABLE website_visits ( visitor_id NUMBER, page_name VARCHAR2(100));
Insert sample data:
INSERT INTO website_visits VALUES (101, 'HOME');INSERT INTO website_visits VALUES (102, 'HOME');INSERT INTO website_visits VALUES (103, 'PRODUCT');INSERT INTO website_visits VALUES (101, 'CHECKOUT');INSERT INTO website_visits VALUES (104, 'HOME');COMMIT;
Exact Distinct Count:
SELECT COUNT(DISTINCT visitor_id)FROM website_visits;
Approximate Distinct Count:
SELECT APPROX_COUNT_DISTINCT(visitor_id)FROM website_visits;
On small datasets, results are often exact.
Performance Comparison (based on observations)
Traditional Query:
SELECT COUNT(DISTINCT customer_id)FROM large_sales;
Approximate Query
SELECT APPROX_COUNT_DISTINCT(customer_id)FROM large_sales;
Typical improvements on huge datasets:
| Metric | COUNT(DISTINCT) | APPROX_COUNT_DISTINCT |
|---|---|---|
| CPU Usage | High | Low |
| Memory Usage | High | Very Low |
| Temp Space | Large | Minimal |
| Scalability | Moderate | Excellent |
| Speed | Slower | Much Faster |
Real-World Use Cases
Use Case 1: Website Analytics
Problem
A large e-commerce platform wants to know:
- Unique daily visitors
- Unique product viewers
- Unique buyers
Exact precision is not necessary for dashboards.
Solution
SELECT visit_date, APPROX_COUNT_DISTINCT(visitor_id) AS unique_visitorsFROM web_logsGROUP BY visit_dateORDER BY visit_date;
Benefits
- Near real-time reporting
- Reduced query cost
- Faster dashboard refreshes
- Smaller development and testing costs
Use Case 2: Telecom Call Analysis
Telecom companies OLTP databases typically process billions of call records.
Requirements:
- Estimate unique subscribers per tower
- estimate usage per network element
Sample SQL:
SELECT tower_id, APPROX_COUNT_DISTINCT(subscriber_id)FROM call_recordsGROUP BY tower_id;
Benefits:
- Fast network utilization analysis
- Efficient operational dashboards
Use Case 3: IoT Sensor Analytics
Millions of devices stream events continuously.
Requirement:
- Estimate unique active devices hourly
Sample SQL would look like:
SELECT TRUNC(event_time, 'HH24') AS hour_slot, APPROX_COUNT_DISTINCT(device_id)FROM sensor_eventsGROUP BY TRUNC(event_time, 'HH24');
Use Case 4: Fraud Detection
Banks, financial institutions monitor unique card usage patterns.
Sample SQL which could help identify unusual activity spikes quickly.
SELECT merchant_id, APPROX_COUNT_DISTINCT(card_number)FROM card_transactionsGROUP BY merchant_id;
Use Case 5: Data Warehousing
Data warehouses often aggregate massive fact tables.
SELECT region, APPROX_COUNT_DISTINCT(customer_id)FROM sales_factGROUP BY region;
This is extremely useful in Large star schema environments
Grouped Aggregations
APPROX_COUNT_DISTINCT works naturally with GROUP BY.
Example:
SELECT department_id, APPROX_COUNT_DISTINCT(employee_id)FROM employee_attendanceGROUP BY department_id;
So, for my demo table:
I get:
Using with Date Dimensions
Useful for:
- Monthly active users
- Quarterly customer trends
- Annual analytics
Sample:
SELECT TRUNC(order_date, 'MM') AS sales_month, APPROX_COUNT_DISTINCT(customer_id)FROM salesGROUP BY TRUNC(order_date, 'MM')ORDER BY sales_month;
Combining with Other Aggregates
Sample:
SELECT product_category, COUNT(*) AS total_sales, SUM(amount) AS revenue, APPROX_COUNT_DISTINCT(customer_id) AS unique_customersFROM salesGROUP BY product_category;
Exact vs Approximate Results
For analytics workloads, we have measured 0.36%. This is usually acceptable.
When to Use APPROX_COUNT_DISTINCT
Ideal scenarios.Use It When:
- Working with billions of rows
- Building dashboards
- Running exploratory analytics
- Processing streaming data
- Real-time reporting matters
- Small estimation error is acceptable
When NOT to Use It
Avoid approximate counting for:
- Financial reconciliation
- Regulatory reporting
- Auditing
- Billing systems
- Legal compliance reports
- Scientific computations requiring precision
Use exact counting instead:
COUNT(DISTINCT ...)
Parallel Query Benefits
Example:
SELECT /*+ parallel(8) */ APPROX_COUNT_DISTINCT(customer_id)FROM huge_sales_table;
Benefits:
- Excellent scalability
- Efficient distributed aggregation
- Lower inter-process communication
Partitioned Tables
We consistently observed that it works very efficiently on partitioned fact tables. Example:
Sample SQL:
SELECT sales_region, APPROX_COUNT_DISTINCT(customer_id)FROM sales_partitionedWHERE sales_date >= DATE '2026-01-01'GROUP BY sales_region;
Materialized Views and Approximation
Approximate aggregation can significantly improve refresh performance for analytical materialized views.
Accuracy Expectations
Approximation algorithms are designed for:
- Statistical consistency
- Stable error bounds
- High scalability
In most practical environments:
- Error rates are very low
- Results are repeatable
- Performance gains are dramatic
Best Practices
Use for Analytics
Best suited for BI/reporting workloads.
Combine with Partitioning
Partitioned tables improve performance further.
Use Parallelism
Avoid for Financial Accuracy
Use exact counting where precision is mandatory.
Validate Accuracy
Before production deployment:
- Compare approximate vs exact results
- Measure acceptable error thresholds
Common Mistakes
Mistake 1: Using in Auditing
Approximation should never replace precise audit reporting.
Mistake 2: Expecting Exact Results
Approximate means estimated!!
Mistake 3: Testing Only on Small Data
Benefits become visible on large datasets.
Key Advantages Summary
| Advantage | Benefit |
|---|---|
| Faster execution | Better user experience |
| Lower memory usage | Reduced resource pressure |
| Scales to billions of rows | Enterprise-ready |
| Parallel-friendly | High throughput |
| Ideal for dashboards | Real-time analytics |
Final Thoughts
APPROX_COUNT_DISTINCT is one of Oracle Database’s most powerful analytical optimization features for modern large-scale workloads.
It enables organizations to:
- Analyze massive datasets quickly
- Build responsive dashboards
- Reduce infrastructure strain
- Improve query scalability
- Support near real-time analytics
While it should not replace exact counting in precision-critical systems, it is an outstanding solution for:
- Data warehousing
- Big data analytics
- Streaming platforms
- Operational dashboards
- Customer behavior analysis
- IoT and telemetry systems
For modern enterprise analytics, approximate query processing is no longer optional—it is becoming essential.

