PROWAREtech

articles » current » dot-net » kmeans-data-clustering

.NET: K-Means Clustering Algorithm

Clustering code designed to group numerical data into bands or ranges; written in C#.

See also: Unsupervised Machine Learning (Clustering)

This code implements a K-means clustering algorithm specifically designed to group numerical data (like prices) into bands or ranges. It works in logarithmic space to better handle data that may be exponentially distributed, which is common with prices and financial data.

The function takes a list of records (each containing some data and a metric value) and a parameter kBands (defaulting to 5) which determines how many price bands or clusters to create. It first transforms all the metric values into logarithmic space, which helps handle the wide range of values more effectively. The algorithm then initializes cluster centers (centroids) evenly spread between the minimum and maximum log values.

The core of the algorithm is an iterative process that runs up to 100 times or until convergence. In each iteration, it performs two main steps: first, it assigns each data point to its nearest centroid based on absolute distance. Then, it updates each centroid to be the average of all points assigned to it. This process continues until the centroids stop moving significantly (convergence) or the maximum iterations are reached. This implementation uses a fixed random seed (42) for reproducibility.

After clustering in log space, the code converts the results back to normal space and computes the actual minimum and maximum values for each cluster. It creates bands based on these ranges and sorts them by minimum value. Finally, it prints out each band's range and the count of items in that band before returning the list of bands. Each band is represented as a tuple containing its minimum and maximum values. This kind of clustering can be particularly useful for creating price brackets, salary bands, or any other numerical categorization where you want to group similar values together while accounting for exponential distribution patterns.


public static List<(double min, double max)> FindClustersKMeans(List<(object data, double metric)> records, int kBands = 5)
{
	var metrics = records.Select(r => r.metric).ToList();
	var logs = metrics.Select(p => Math.Log(p)).ToList();

	// Initialize centroids
	var random = new Random(42);
	var centroids = new List<double>();
	double min = logs.Min();
	double max = logs.Max();
	double step = (max - min) / (kBands - 1);

	for (int i = 0; i < kBands; i++)
		centroids.Add(min + i * step);

	// K-means iterations
	for (int iter = 0; iter < 100; iter++)
	{
		// Assign points to nearest centroid
		var clusters = new List<List<double>>();
		for (int i = 0; i < kBands; i++)
			clusters.Add(new List<double>());

		foreach (var price in logs)
		{
			int nearestCentroid = 0;
			double minDist = double.MaxValue;

			for (int i = 0; i < kBands; i++)
			{
				double dist = Math.Abs(price - centroids[i]);
				if (dist < minDist)
				{
					minDist = dist;
					nearestCentroid = i;
				}
			}

			clusters[nearestCentroid].Add(price);
		}

		// Update centroids
		bool changed = false;
		for (int i = 0; i < kBands; i++)
		{
			if (clusters[i].Any())
			{
				double newCentroid = clusters[i].Average();
				if (Math.Abs(newCentroid - centroids[i]) > 0.0001f)
				{
					changed = true;
					centroids[i] = newCentroid;
				}
			}
		}

		if (!changed)
			break;
	}

	// Convert back from log space and find min/max for each band
	var bands = new List<(double min, double max)>();
	var assignments = new int[metrics.Count];

	for (int i = 0; i < metrics.Count; i++)
	{
		double logPrice = Math.Log(metrics[i]);
		int nearestCentroid = 0;
		double minDist = double.MaxValue;

		for (int j = 0; j < kBands; j++)
		{
			double dist = Math.Abs(logPrice - centroids[j]);
			if (dist < minDist)
			{
				minDist = dist;
				nearestCentroid = j;
			}
		}
		assignments[i] = nearestCentroid;
	}

	// Create bands based on actual price ranges in each cluster
	for (int i = 0; i < kBands; i++)
	{
		var clusterPrices = metrics.Where((p, idx) => assignments[idx] == i);
		if (clusterPrices.Any())
		{
			bands.Add((
				min: clusterPrices.Min(),
				max: clusterPrices.Max()
			));
		}
	}

	// Sort bands by min price
	bands = bands.OrderBy(b => b.min).ToList();

	return bands;
}

PROWAREtech

Hello there! How can I help you today?
Ask any question

PROWAREtech

This site uses cookies. Cookies are simple text files stored on the user's computer. They are used for adding features and security to this site. Read the privacy policy.
ACCEPT REJECT