PROWAREtech
.NET: K-Means Clustering Algorithm
See also: Unsupervised Machine Learning (Clustering)
This code implements a K-means clustering algorithm specifically designed to group numerical data (like prices) into bands or ranges. It works in logarithmic space to better handle data that may be exponentially distributed, which is common with prices and financial data.
The function takes a list of records (each containing some data and a metric value) and a parameter kBands
(defaulting to 5) which determines how many price bands or clusters to create. It first transforms all the metric values into logarithmic space, which helps handle the wide range of values more effectively. The algorithm then initializes cluster centers (centroids) evenly spread between the minimum and maximum log values.
The core of the algorithm is an iterative process that runs up to 100 times or until convergence. In each iteration, it performs two main steps: first, it assigns each data point to its nearest centroid based on absolute distance. Then, it updates each centroid to be the average of all points assigned to it. This process continues until the centroids stop moving significantly (convergence) or the maximum iterations are reached. This implementation uses a fixed random seed (42) for reproducibility.
After clustering in log space, the code converts the results back to normal space and computes the actual minimum and maximum values for each cluster. It creates bands based on these ranges and sorts them by minimum value. Finally, it prints out each band's range and the count of items in that band before returning the list of bands. Each band is represented as a tuple containing its minimum and maximum values. This kind of clustering can be particularly useful for creating price brackets, salary bands, or any other numerical categorization where you want to group similar values together while accounting for exponential distribution patterns.
public static List<(double min, double max)> FindClustersKMeans(List<(object data, double metric)> records, int kBands = 5)
{
var metrics = records.Select(r => r.metric).ToList();
var logs = metrics.Select(p => Math.Log(p)).ToList();
// Initialize centroids
var random = new Random(42);
var centroids = new List<double>();
double min = logs.Min();
double max = logs.Max();
double step = (max - min) / (kBands - 1);
for (int i = 0; i < kBands; i++)
centroids.Add(min + i * step);
// K-means iterations
for (int iter = 0; iter < 100; iter++)
{
// Assign points to nearest centroid
var clusters = new List<List<double>>();
for (int i = 0; i < kBands; i++)
clusters.Add(new List<double>());
foreach (var price in logs)
{
int nearestCentroid = 0;
double minDist = double.MaxValue;
for (int i = 0; i < kBands; i++)
{
double dist = Math.Abs(price - centroids[i]);
if (dist < minDist)
{
minDist = dist;
nearestCentroid = i;
}
}
clusters[nearestCentroid].Add(price);
}
// Update centroids
bool changed = false;
for (int i = 0; i < kBands; i++)
{
if (clusters[i].Any())
{
double newCentroid = clusters[i].Average();
if (Math.Abs(newCentroid - centroids[i]) > 0.0001f)
{
changed = true;
centroids[i] = newCentroid;
}
}
}
if (!changed)
break;
}
// Convert back from log space and find min/max for each band
var bands = new List<(double min, double max)>();
var assignments = new int[metrics.Count];
for (int i = 0; i < metrics.Count; i++)
{
double logPrice = Math.Log(metrics[i]);
int nearestCentroid = 0;
double minDist = double.MaxValue;
for (int j = 0; j < kBands; j++)
{
double dist = Math.Abs(logPrice - centroids[j]);
if (dist < minDist)
{
minDist = dist;
nearestCentroid = j;
}
}
assignments[i] = nearestCentroid;
}
// Create bands based on actual price ranges in each cluster
for (int i = 0; i < kBands; i++)
{
var clusterPrices = metrics.Where((p, idx) => assignments[idx] == i);
if (clusterPrices.Any())
{
bands.Add((
min: clusterPrices.Min(),
max: clusterPrices.Max()
));
}
}
// Sort bands by min price
bands = bands.OrderBy(b => b.min).ToList();
return bands;
}