PROWAREtech
.NET: Remove Data Outliers for Better Data
Remove data outliers using the Interquartile Range (IQR) method with logarithmic transformation; written in C#.
This code implements a statistical method for removing outliers from a dataset using the Interquartile Range (IQR) method, but with a logarithmic transformation.
- The function takes two parameters:
- A list of records, where each record has some data and an associated numeric metric
- An
iqrMultiplier
(defaulting to 1.5) which controls how aggressive the outlier removal is
- The code applies a log10 transformation to all metric values. This is useful when dealing with data that:
- Has a wide range of values
- Follows a log-normal distribution
- Contains only positive numbers
- It then calculates three important statistical measures in log space:
- Q1 (First quartile): The value at 25% of the sorted data
- Q3 (Third quartile): The value at 75% of the sorted data
- IQR (Interquartile Range): The difference between Q3 and Q1
- Using these measures, it calculates bounds for what constitutes an outlier:
- Lower bound = Q1 - (iqrMultiplier × IQR)
- Upper bound = Q3 + (iqrMultiplier × IQR)
- These bounds are then transformed back from log space to normal space using 10^x
- Finally, it filters the original dataset to keep only records whose metrics fall within these bounds
// Increase iqrMultiplier for more outliers and decrease for fewer outliers
public static List<(object data, double metric)> RemoveOutliers(List<(object data, double metric)> records, float iqrMultiplier = 1.5f)
{
// Calculate Q1, Q3 and IQR for log-transformed prices
var logs = records.Select(r => Math.Log10(r.metric)).OrderBy(p => p).ToList();
int q1Index = logs.Count / 4;
int q3Index = (3 * logs.Count) / 4;
double logQ1 = logs[q1Index];
double logQ3 = logs[q3Index];
double logIqr = logQ3 - logQ1;
// Filter out prices beyond [iqrMultiplier] IQRs from Q1 and Q3 in log space
double logLowerBound = logQ1 - (iqrMultiplier * logIqr);
double logUpperBound = logQ3 + (iqrMultiplier * logIqr);
// Convert bounds back to normal space
double lowerBound = Math.Pow(10, logLowerBound);
double upperBound = Math.Pow(10, logUpperBound);
var filtered = records.Where(r => r.metric >= lowerBound && r.metric <= upperBound).ToList();
return filtered;
}
Comment