PROWAREtech
.NET: CNN v2.0 for Supervised Deep Learning Example
This convolutional neural network uses backpropagation to learn kernels, a.k.a. filters. For a convolutional neural network where the programmer chooses the kernels, see this article, which is good for simpler CNN's such as those used for the MNIST database of handwritten number digits.
Download
Download these files including example training and the MNIST image files with labels: NEURALNETWORK.zip. This is an on-going project, so check back for updates.
Project Overview
This code implements a modern convolutional neural network (CNN) framework in C# that supports both standard feed-forward networks and complex CNN architectures. It's designed for high performance with SIMD vectorization, parallel processing, and memory optimization techniques.
The framework includes comprehensive support for various layer types including convolutional (both 1D and 2D), pooling, batch normalization, dropout, dense/fully-connected, various activation functions like LeakyReLU and ELU (Exponential Linear Unit), and Gated Recurrent Unit (GRU) layers. It also features parallel convolutional blocks and embedding layers for processing sequential data. Each layer type implements both forward propagation and backpropagation with gradient computation.
The implementation uses the ADAM optimizer with momentum for training and includes advanced features like gradient clipping, batch size correction, and dynamic learning rate scheduling. The framework handles different loss functions including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Binary Cross-Entropy, and Softmax Cross-Entropy, making it suitable for both regression and classification tasks.
The code includes robust model persistence capabilities, allowing trained networks to be saved to and loaded from files using Brotli compression. It's built with production use in mind, featuring extensive error handling, proper resource management, and optimization techniques like cache-friendly blocking and thread-local gradient accumulation.
LeakyReLU-to-ELU Activation Comparison
- Complexity and Computation:
- LeakyReLU: Simple and fast, just a linear operation in both regions.
- ELU: More computationally complex due to the exponential function.
- Gradient Flow in Negative Region:
- LeakyReLU: Provides a constant, small gradient for negative inputs.
- ELU: Provides a varying gradient that tends toward zero for very negative inputs, but crucially remains nonzero, potentially enabling richer dynamics.
- Zero-Centering of Activations:
- LeakyReLU: Does not enforce zero mean activations.
- ELU: Naturally shifts mean activations closer to zero, which often helps optimization.
- Practical Performance:
- Empirical results often find ELU can outperform ReLU and sometimes LeakyReLU, especially in deeper networks, due to better gradient properties and zero-mean shifting. However, this gain might come at a slight computational cost.
- LeakyReLU still offers benefits over ReLU with minimal overhead and complexity. It’s a “low-hanging fruit” improvement with virtually no trade-offs apart from a small hyperparameter.
Here are graphs showing LeakyReLU with different alpha values of 0.01, 0.1 and 0.2, and ELU with different alpha values of 0.3, 1.0 and 3.0:
GRU Cell Flow Diagram
Public Instance Methods
AddEmbeddingLayer(int vocabSize, int embeddingDim)
Add2DConvolutionLayer(int filterSize, int filterCount, int stride, int? padding)
Add1DConvolutionLayer(int filterSize, int filterCount, int stride, int? padding)
AddParallel2DConvolutionBlockLayer(List<(int filterSize, int filterCount, int stride, int? padding)> branchConfigs, bool useBatchNormalization, float alpha, Pooling pooling, int maxPoolHeight, int maxPoolWidth)
AddParallel1DConvolutionBlockLayer(List<(int filterSize, int filterCount, int stride, int? padding)> branchConfigs, bool useBatchNormalization, float alpha, Pooling pooling, int maxPoolSize)
AddGlobalMaxPoolingLayer()
AddGlobalAveragePoolingLayer()
AddGRULayer(int hiddenSize, bool useLayerNormalization, bool returnSequences, float dropoutRate = 0.0f)
AddBiGRULayer(int hiddenSize, BiGRUMergeMode mergeMode = BiGRUMergeMode.Concat, bool useLayerNormalization = false, bool returnSequences = true, float dropoutRate = 0.0f)
AddBatchNormLayer(int depth)
Add2DMaxPoolingLayer(int poolHeight, int poolWidth)
Add1DMaxPoolingLayer(int size)
AddDropoutLayer(float rate = 0.2f)
AddELULayer(float alpha)
AddLeakyReLULayer(float alpha)
AddSoftmaxOutputLayer()
AddSigmoidOutputLayer(LossFunction lossFunction)
AddLinearOutputLayer(LossFunction lossFunction)
AddFullyConnectedLayer(int outputNodes, bool isFinalFCLayer = false)
AddFinalFullyConnectedLayer(int outputNodes)
SaveModelToFile(string filePath)
TrainBatch(List<(float[] inputs, float[] targets)> batchInputs, float learningRate, int threadCount, float clipThreshold = 5.0f)
TrainBatch(List<(float[][] inputs, float[] targets)> batchInputs, float learningRate, int threadCount, float clipThreshold = 5.0f)
TrainBatch(List<(float[][][] inputs, float[] targets)> batchInputs, float learningRate, int threadCount, float clipThreshold = 5.0f)
Train(List<(float[] inputs, float[] targets)> allTrainingInputs, float learningRate, int threadCount, float clipThreshold = 5.0f, int epochCount = 10, bool shuffle = true, Action<float, int, int>? progress = null)
Train(List<(float[][] inputs, float[] targets)> allTrainingInputs, float learningRate, int threadCount, float clipThreshold = 5.0f, int epochCount = 10, bool shuffle = true, Action<float, int, int>? progress = null)
Train(List<(float[][][] inputs, float[] targets)> allTrainingInputs, float learningRate, int threadCount, float clipThreshold = 5.0f, int epochCount = 10, bool shuffle = true, Action<float, int, int>? progress = null)
Predict(float[] input)
Predict(float[][] input)
Predict(float[][][] input)
PredictBatch(List<float[]> inputs)
PredictBatch(List<float[][]> inputs)
PredictBatch(List<float[][][]> inputs)
PredictClassIndex(float[] input)
PredictClassIndex(float[][] input)
PredictClassIndex(float[][][] input)
CalculateAccuracy(List<(float[], float[])> inputs)
CalculateAccuracy(List<(float[][], float[])> inputs)
CalculateAccuracy(List<(float[][][], float[])> inputs)
Public Static Methods
CreateNetwork(int batchSize, int sequenceLength, bool useCuda = false, float? beta1 = null, float? beta2 = null, float? epsilon = null)
CreateNetwork(int batchSize, int inputWidth, int inputDepth, bool useCuda = false, float? beta1 = null, float? beta2 = null, float? epsilon = null)
CreateNetwork(int batchSize, int inputHeight, int inputWidth, int inputDepth, bool useCuda = false, float? beta1 = null, float? beta2 = null, float? epsilon = null)
GetOptimumBatchSize(int desiredBatchSize)
GetOptimumThreadCount()
GetWarmupThenDecayLearningRate(float baseLearningRate, int batchIndex, int totalBatchesPerEpoch, int epochIndex, int totalEpochs, int decayStartEpochIndex, float warmupMinLearningRate, float decayEndLearningRate, int warmupBatches, string schedule)
LoadModelFromFile(string filePath)
TrueClassIndex(float[] inputOneHotEncoding)
Learning Rate
The larger the network architecture, the smaller the learning rate needs to be, because the greater parameters need smaller adjustments and a smaller learning rate also helps with exploding gradients as does gradient clipping.
Why does the CNN learn to a very low loss rate and then suddenly the loss rate begins increasing out of control at certain epochs? This behavior is a classic sign that the learning rate is still too high, especially for ADAM optimization with dropout. The initial fast drop in loss followed by increasing loss suggests the optimizer is overshooting the optimal weights. Lower the learning rate.
While ADAM's adaptive learning rates provide robustness across different parameters, research has shown that the base learning rate (often called alpha or η) still needs to be reasonably well-chosen for optimal performance. ADAM decides how to distribute the learning across different parameters, but the base learning rate determines the overall magnitude of updates. If the base rate is too high, even the adaptive scaling can't prevent overshooting. If it's too low, convergence will be unnecessarily slow despite good adaptive scaling.
A common range for ADAM's base learning rate is often between 0.0001 and 0.001, though this varies by problem.
A larger CNN typically needs smaller learning rates for both greater numbers of fully connected layers as well as greater numbers of convolutional layers:
- More Hidden (Fully Connected) Neurons:
- More parameters means more complex error surface
- Higher chance of overshooting optimal weights
- Gradients can become larger due to more connections
- Example: Going from 128 to 1024 neurons might need 2-4x smaller learning rate
- More Convolutional Layers:
- Even with fewer parameters than fully connected layers
- Deep networks face vanishing/exploding gradient issues
- Each additional layer compounds gradient instability
- Deeper networks are more sensitive to learning rate
- Example: Going from 2 to 8 conv layers might need 4-8x smaller learning rate
Key Factors Why Deeper CNNs Need Smaller Rates:
- Gradient Chain:
- Each layer multiplies gradients during backpropagation
- More layers = longer multiplication chain
- Higher risk of gradient explosion/vanishing
- Feature Hierarchy:
- Early layers learn basic features (edges, corners)
- Deep layers learn complex combinations
- Changes in deep layers cascade back through network
- Parameter Interaction:
- Changes in early layers affect all subsequent layers
- More conv layers = more complex interactions
- Small changes can have amplified effects
ADAM's Beta1 and Beta2
For CNNs, the default values of β₁ = 0.9 and β₂ = 0.999 typically work very well, which is why they're the default in most frameworks. These values were suggested in the original ADAM paper and have proven robust across many architectures.
However, some variations that can be worth trying:
- Standard Choice:
- β₁ = 0.9
- β₂ = 0.999
- More Conservative:
- β₁ = 0.85-0.9
- β₂ = 0.99
- This can help if training seems unstable, as it reduces the influence of past gradients.
- More Aggressive:
- β₁ = 0.95
- β₂ = 0.9999
- This can sometimes help convergence if having a very deep network and training is progressing slowly.
Generally, it's more important to tune other hyperparameters first (learning rate, batch size, architecture) before adjusting β₁ and β₂. The defaults are robust enough that they rarely need tuning for CNNs. Generally, it's more important to tune other hyperparameters first (learning rate, batch size, architecture) before adjusting β₁ and β₂. The defaults are robust enough that they rarely need tuning for CNNs.
If noticing training instability (especially early in training), trying a slightly lower β₁ (like 0.85) can help by making the optimizer more responsive to recent gradients.
Dropout Layers
Dropout layers are a regularization technique that zeros-out a percentage of neurons, and are used in neural networks to prevent overfitting and improve generalization.
Like L2 regularization, dropout layers cause the learning to be quite erratic.
Typical dropout rates:
- First layer: None
- Convolutional layers: 0.1 to 0.25 (10-25%)
- Fully connected (hidden) layers: 0.15 to 0.4 (15-40%)
- 0.5 (50%) is quite aggressive and typically only used in cases of severe overfitting
The rationale:
- The first layer is where data are introduced to the network
- Convolution layers have fewer parameters and built-in regularization from weight sharing
- Fully connected layers have more parameters and are more prone to overfitting
- Earlier layers need less dropout than later layers
- With Batch Normalization Layers already helping, one can lower dropout rates
Parameters
For optimum performance including learning, batch size as well as processor thread count should be a power of 2, such as 32, 64, 128 or 256, but not greater. Thread count in an 88-thread machine, for example, would want to use 64 of the threads. This will still take full advantage of a processor with hyper-threading (44 cores, 88 threads - all cores will be busy). Do not exceed a thread count of 256 as this is the most effective batch size for the CNN and this is hard-coded. It is possible to use a thread count of 512 and change the maxmimum effective batch size to 512 in the code. While not tested, this should cause no problems.
Data Augmentation
These data augmentations are available:
original |
guassian blur |
random brightness |
channel shift |
color jitter |
crop |
cutout |
elastic |
horizontal flip |
noise |
rotation |
shear |
translation |
vertical flip |
zoom |
Example Architectures
MNIST Digits
This simple network architecture was used to train with the MNIST database of handwritten number digits to 99% accuracy using a learning rate of 0.0001.
private static ConvolutionalNeuralNetworkV2 CreateCNN(int batchSize)
{
var cnn = ConvolutionalNeuralNetworkV2.CreateNetwork(batchSize: batchSize, inputHeight: IMAGE_HEIGHT, inputWidth: IMAGE_WIDTH, inputDepth: 1, beta1: 0.9f, beta2: 0.999f);
// First Block
cnn.Add2DConvolutionLayer(filterCount: 32, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 32);
cnn.AddLeakyReLULayer(alpha: 0.1f);
cnn.Add2DConvolutionLayer(filterCount: 32, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 32);
cnn.AddLeakyReLULayer(alpha: 0.1f);
cnn.Add2DMaxPoolingLayer(poolHeight: 2, poolWidth: 2);
// Second Block
cnn.Add2DConvolutionLayer(filterCount: 64, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 64);
cnn.AddLeakyReLULayer(alpha: 0.1f);
cnn.Add2DConvolutionLayer(filterCount: 64, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 64);
cnn.AddLeakyReLULayer(alpha: 0.1f);
cnn.Add2DMaxPoolingLayer(poolHeight: 2, poolWidth: 2);
cnn.AddDropoutLayer(rate: 0.25f);
// Third Block
cnn.Add2DConvolutionLayer(filterCount: 128, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 128);
cnn.AddLeakyReLULayer(alpha: 0.1f);
cnn.Add2DConvolutionLayer(filterCount: 128, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 128);
cnn.AddLeakyReLULayer(alpha: 0.1f);
cnn.Add2DMaxPoolingLayer(poolHeight: 2, poolWidth: 2);
cnn.AddDropoutLayer(rate: 0.25f);
// Fourth Block
cnn.Add2DConvolutionLayer(filterCount: 256, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 256);
cnn.AddLeakyReLULayer(alpha: 0.1f);
cnn.Add2DConvolutionLayer(filterCount: 256, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 256);
cnn.AddLeakyReLULayer(alpha: 0.1f);
cnn.AddDropoutLayer(rate: 0.25f);
// 2,304 input nodes
cnn.AddFullyConnectedLayer(outputNodes: 1024);
cnn.AddBatchNormLayer(depth: 1024);
cnn.AddLeakyReLULayer(alpha: 0.1f);
cnn.AddDropoutLayer(rate: 0.3f);
cnn.AddFullyConnectedLayer(outputNodes: 512);
cnn.AddBatchNormLayer(depth: 512);
cnn.AddLeakyReLULayer(alpha: 0.1f);
cnn.AddDropoutLayer(rate: 0.2f);
cnn.AddFullyConnectedLayer(outputNodes: 256);
cnn.AddBatchNormLayer(depth: 256);
cnn.AddLeakyReLULayer(alpha: 0.1f);
cnn.AddDropoutLayer(rate: 0.1f);
cnn.AddFinalFullyConnectedLayer(outputNodes: 10); // ten classes for the digits 0 to 9
cnn.AddSoftmaxOutputLayer();
return cnn;
}
Cat, Dog & Human
This architecture creates a large, resource hungry network with many inputs (50,176 nodes) into the bottleneck convolutional layer. This was used to train the CNN to recognize cats, dogs and humans.
While it's common to place a nonlinearity such as ReLU, ELU, or a BatchNorm + ReLU/ELU combo after every convolution (including bottlenecks), it is not an absolute rule. Whether or not to include an activation after a bottleneck depends on the role that layer is playing in the architecture and the desired information flow. Here are a few relevant points:
- The main purpose of a bottleneck layer is often to reduce the number of parameters by decreasing the channel dimension before a more expensive operation or a fully connected layer. If the intention is purely dimensionality reduction and the subsequent layer (e.g., a pooling or a fully-connected layer) will handle nonlinearity or final classification logic, then having no activation at this point might be a deliberate design choice. Essentially, if the designer wants a linear projection of features into a lower-dimensional space without distorting them through a nonlinear activation, they may choose not to include one.
- Trade-offs and Experimental Reasons:
- Including an activation can introduce beneficial nonlinearities, potentially improving the representational capacity of the model.
- However, some complex architectures find that adding an activation at every single layer is not always beneficial. A linear bottleneck layer can act as a form of linear dimensionality reduction, preserving more subtle linear correlations that might be lost after a nonlinear activation.
- If the network has batch normalization and dropout layers strategically placed after this bottleneck, one might argue that the primary nonlinearity influencing these features occurs slightly downstream. This can still be effective, as the BN and dropout can help regularize and shape the distribution of the features without immediately applying a nonlinearity.
It is not mandatory to follow the bottleneck layer with an activation function. While many architectures do, some intentionally omit it to maintain a linear mapping at that stage. The decision often comes down to the specific architectural philosophy, the role of the bottleneck within the model, and empirical performance results. So omitting the activation can be considered normal practice in certain scenarios, especially if the goal is to simply reduce parameters before a subsequent stage in the network.
YMMV.
private static ConvolutionalNeuralNetworkV2 CreateCNN(int batchSize, int firstDenseLayerOutput, int secondDenseLayerOutput)
{
// IMAGE_HEIGHT = 224, IMAGE_WIDTH = 224, Input Depth = 3 Color Channels
var cnn = ConvolutionalNeuralNetworkV2.CreateNetwork(batchSize: batchSize, inputHeight: IMAGE_HEIGHT, inputWidth: IMAGE_WIDTH, inputDepth: 3, beta1: 0.9f, beta2: 0.999f);
// First block
cnn.Add2DConvolutionLayer(filterCount: 32, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 32);
cnn.AddELULayer(alpha: 1f);
cnn.Add2DConvolutionLayer(filterCount: 32, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 32);
cnn.AddELULayer(alpha: 1f);
cnn.Add2DMaxPoolingLayer(poolHeight: 2, poolWidth: 2);
cnn.AddDropoutLayer(rate: 0.1f); // Light dropout in early layers
// Second block - Doubled filters
cnn.Add2DConvolutionLayer(filterCount: 64, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 64);
cnn.AddELULayer(alpha: 1f);
cnn.Add2DConvolutionLayer(filterCount: 64, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 64);
cnn.AddELULayer(alpha: 1f);
cnn.Add2DMaxPoolingLayer(poolHeight: 2, poolWidth: 2);
cnn.AddDropoutLayer(rate: 0.2f);
// Third block - Increased complexity
cnn.Add2DConvolutionLayer(filterCount: 128, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 128);
cnn.AddELULayer(alpha: 1f);
cnn.Add2DConvolutionLayer(filterCount: 128, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 128);
cnn.AddELULayer(alpha: 1f);
cnn.Add2DMaxPoolingLayer(poolHeight: 2, poolWidth: 2);
cnn.AddDropoutLayer(rate: 0.25f);
// Fourth block - Added more feature extraction before
cnn.Add2DConvolutionLayer(filterCount: 256, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 256);
cnn.AddELULayer(alpha: 1f);
cnn.Add2DConvolutionLayer(filterCount: 256, filterSize: 3, stride: 1, padding: 1);
cnn.AddBatchNormLayer(depth: 256);
cnn.AddELULayer(alpha: 1f);
cnn.Add2DMaxPoolingLayer(poolHeight: 2, poolWidth: 2);
cnn.AddDropoutLayer(rate: 0.3f);
// 50,176 input nodes
// Bottleneck convolutional layer w/BatchNorm, Activation and ...
cnn.Add2DConvolutionLayer(filterCount: 128, filterSize: 1, stride: 1, padding: 0);
cnn.AddBatchNormLayer(depth: 128);
cnn.AddELULayer(alpha: 1f);
cnn.Add2DMaxPoolingLayer(poolHeight: 2, poolWidth: 2);
cnn.AddDropoutLayer(rate: 0.2f);
// 6,272 input nodes
cnn.AddFullyConnectedLayer(outputNodes: 3072);
cnn.AddBatchNormLayer(depth: 3072);
cnn.AddELULayer(alpha: 1f);
cnn.AddDropoutLayer(rate: 0.4f);
// Additional dense layer for more complex feature combinations
cnn.AddFullyConnectedLayer(outputNodes: 1536);
cnn.AddBatchNormLayer(depth: 1536);
cnn.AddELULayer(alpha: 1f);
cnn.AddDropoutLayer(rate: 0.3f);
cnn.AddFinalFullyConnectedLayer(outputNodes: 3); // three classes: cat, dog and human
cnn.AddSoftmaxOutputLayer();
return cnn;
}
Mumbai Real-estate Pricing
This large feed-forward only network (not a CNN, actually) architecture was used to train a 2024 Mumbai Real Estate Price Predictor.
private static ConvolutionalNeuralNetworkV2 CreateFFN(int batchSize)
{
// Calculate total input features:
// Numerical features: 6 (area, bedrooms, bathrooms, balconies, floors, age)
// One-hot encoded locality
// One-hot encoded property type
// One-hot encoded furnished status
// Location price indicator: 1
// totalInputFeatures is this sum
// Calculate total input features:
int numericalFeatures = 6; // area, bedrooms, bathrooms, balconies, floors, age
int localityFeatures = localityMap.Count; // One-hot encoded localities
int propertyTypeFeatures = propertyTypeMap.Count; // One-hot encoded property types
int furnishedFeatures = furnishedMap.Count; // One-hot encoded furnished states
int locationIndicator = 1; // Price indicator from lat/long
int totalInputFeatures = numericalFeatures + localityFeatures + propertyTypeFeatures + furnishedFeatures + locationIndicator;
ffn = ConvolutionalNeuralNetworkV2.CreateNetwork(batchSize: 256, inputHeight: 1, inputWidth: 1, inputDepth: totalInputFeatures);
// Create a complex feed-forward network
ffn.AddFullyConnectedLayer(outputNodes: 512);
ffn.AddBatchNormLayer(depth: 512);
ffn.AddLeakyReLULayer(alpha: 0.1f);
ffn.AddFullyConnectedLayer(outputNodes: 256);
ffn.AddBatchNormLayer(depth: 256);
ffn.AddLeakyReLULayer(alpha: 0.1f);
ffn.AddDropoutLayer(0.3f);
ffn.AddFullyConnectedLayer(outputNodes: 128);
ffn.AddBatchNormLayer(depth: 128);
ffn.AddLeakyReLULayer(alpha: 0.1f);
ffn.AddDropoutLayer(0.2f);
ffn.AddFullyConnectedLayer(outputNodes: 64);
ffn.AddBatchNormLayer(depth: 64);
ffn.AddLeakyReLULayer(alpha: 0.1f);
ffn.AddDropoutLayer(0.1f);
ffn.AddFinalFullyConnectedLayer(outputNodes: 1);
ffn.AddSigmoidOutputLayer(LossFunction.BoundedRMSE); // Must use Sigmoid final activation for bounded MSE and RMSE networks
return ffn;
}
IMDB Review Sentiment
This CNN/RNN/CNN+RNN architecture is for natural language processing (NLP) and was used to train a model with IMDB movie reviews to predicted if a review was negative or positive. The RNN uses Bidirectional Gate Recurrent Unit (BiGRU) layers. The CNN alone struggles to learn these data.
The parallel convolutional block uses 0.3 as the threshold between ReLU and ELU activation. LeakyReLU activation is used when the alpha parameter is 0.3 or less and ELU activation when it is greater than 0.3. In this example of alpha being 1.0, ELU activation is being used.
Loss Plateauing: If seeing an initial loss reduction with loss then plateauing around 0.69 (about 50% accuracy) after three or so epochs, do not be alarmed. The pattern of an initial plateau around 50% accuracy and ~0.693 loss, followed by a gradual decrease in loss and increase in accuracy after a five to seven epochs once the embeddings begin to make sense, is a common phenomenon. It's a sign that the model is starting from scratch and needs those few epochs to learn meaningful word representations before real sentiment classification capabilities emerge. It takes considerable time to train word embeddings from scratch. And before the word embeddings and other parameters start to capture meaningful patterns, the model's predictions often default to one class. This happens because the network effectively hasn't learned any discriminative features yet, so it takes a "safe bet" by consistently outputting whichever class it finds easiest to predict. As a result, the confusion matrix during the initial training phase will likely show that all (or nearly all) examples are predicted as the same sentiment, leading to a large number of false positives or false negatives.
private static ConvolutionalNeuralNetworkV2 CreateCNN(int vocabSize, int embeddingDim, int maxSequenceLen, int numClasses, int batchSize)
{
var cnn = ConvolutionalNeuralNetworkV2.CreateNetwork(batchSize: batchSize, sequenceLength: maxSequenceLen);
// Embedding layer
cnn.AddEmbeddingLayer(vocabSize: vocabSize, embeddingDim: embeddingDim);
var convBlockConfig1 = new List<(int filterSize, int filterCount, int stride, int? padding)>
{
(filterSize: 2, filterCount: 192, stride: 1, padding: null), // null paddings means it uses the "same" padding
(filterSize: 3, filterCount: 192, stride: 1, padding: null),
(filterSize: 4, filterCount: 192, stride: 1, padding: null),
(filterSize: 5, filterCount: 192, stride: 1, padding: null)
};
// Single convolutional block to capture n-gram features
// poolSize only used when pooling: Pooling.Max
cnn.AddParallel1DConvolutionBlockLayer(branchConfigs: convBlockConfig1, useBatchNormalization: true, alpha: 1f, pooling: Pooling.GlobalMax, poolSize: 0);
cnn.AddDropoutLayer(0.2f);
// 192 * 4 = 768 inputs
cnn.AddFullyConnectedLayer(outputNodes: 512);
cnn.AddBatchNormLayer(depth: 512);
cnn.AddELULayer(alpha: 1f);
cnn.AddDropoutLayer(rate: 0.2f);
cnn.AddFullyConnectedLayer(outputNodes: 256);
cnn.AddBatchNormLayer(depth: 256);
cnn.AddELULayer(alpha: 1f);
cnn.AddDropoutLayer(rate: 0.15f);
// Output layer
cnn.AddFinalFullyConnectedLayer(outputNodes: numClasses);
if (numClasses == 1)
cnn.AddSigmoidOutputLayer(LossFunction.BinaryCrossEntropy);
else
cnn.AddSoftmaxOutputLayer();
return cnn;
}
private static ConvolutionalNeuralNetworkV2 CreateRNN(int vocabSize, int embeddingDim, int maxSequenceLen, int numClasses, int batchSize)
{
var cnn = ConvolutionalNeuralNetworkV2.CreateNetwork(batchSize: batchSize, sequenceLength: maxSequenceLen);
// Embedding layer
cnn.AddEmbeddingLayer(vocabSize: vocabSize, embeddingDim: embeddingDim);
// Bidirectional GRU layer
// First bidirectional GRU layer returns sequences
cnn.AddBiGRULayer(hiddenSize: 160, useLayerNormalization: true, returnSequences: false, dropoutRate: 0.2f);
// 320 inputs
cnn.AddFullyConnectedLayer(outputNodes: 256);
cnn.AddBatchNormLayer(depth: 256);
cnn.AddELULayer(alpha: 1f);
cnn.AddDropoutLayer(rate: 0.2f);
cnn.AddFullyConnectedLayer(outputNodes: 128);
cnn.AddBatchNormLayer(depth: 128);
cnn.AddELULayer(alpha: 1f);
cnn.AddDropoutLayer(rate: 0.15f);
// Output layer
cnn.AddFinalFullyConnectedLayer(outputNodes: numClasses);
if (numClasses == 1)
cnn.AddSigmoidOutputLayer(LossFunction.BinaryCrossEntropy);
else
cnn.AddSoftmaxOutputLayer();
return cnn;
}
private static ConvolutionalNeuralNetworkV2 CreateCNNandRNN(int vocabSize, int embeddingDim, int maxSequenceLen, int numClasses, int batchSize)
{
var cnn = ConvolutionalNeuralNetworkV2.CreateNetwork(batchSize: batchSize, sequenceLength: maxSequenceLen);
// Embedding layer
cnn.AddEmbeddingLayer(vocabSize: vocabSize, embeddingDim: embeddingDim);
var convBlockConfig1 = new List<(int filterSize, int filterCount, int stride, int? padding)>
{
(filterSize: 2, filterCount: 192, stride: 1, padding: null), // null paddings means it uses the "same" padding
(filterSize: 3, filterCount: 192, stride: 1, padding: null),
(filterSize: 4, filterCount: 192, stride: 1, padding: null),
(filterSize: 5, filterCount: 192, stride: 1, padding: null)
};
// Single convolutional block to capture n-gram features
// maxPoolSize only used when pooling: Pooling.Max
cnn.AddParallel1DConvolutionBlockLayer(branchConfigs: convBlockConfig1, useBatchNormalization: true, alpha: 1f, pooling: Pooling.None, maxPoolSize: 0);
cnn.AddDropoutLayer(0.2f);
// 384,000 inputs
// Bidirectional GRU layer
cnn.AddBiGRULayer(hiddenSize: 256, useLayerNormalization: true, returnSequences: false, dropoutRate: 0.2f);
// 256 x 2 = 512 inputs
cnn.AddFullyConnectedLayer(outputNodes: 512);
cnn.AddBatchNormLayer(depth: 512);
cnn.AddELULayer(alpha: 1f);
cnn.AddDropoutLayer(rate: 0.2f);
cnn.AddFullyConnectedLayer(outputNodes: 256);
cnn.AddBatchNormLayer(depth: 256);
cnn.AddELULayer(alpha: 1f);
cnn.AddDropoutLayer(rate: 0.15f);
// Output layer
cnn.AddFinalFullyConnectedLayer(outputNodes: numClasses);
if (numClasses == 1)
cnn.AddSigmoidOutputLayer(LossFunction.BinaryCrossEntropy);
else
cnn.AddSoftmaxOutputLayer();
return cnn;
}
Power Trends
Here is another example of using Gated Recurrent Unit (GRU) layers. GRU is a specialized neural network architecture designed to handle sequential data and maintain memory of past information. This was used to train a network on power usage trends.
private static ConvolutionalNeuralNetworkV2 CreateCNN(int batchSize, int sequenceLength, int inputFeatures, int outputFeatures)
{
var cnn = ConvolutionalNeuralNetworkV2.CreateNetwork(batchSize: batchSize, inputWidth: sequenceLength, inputDepth: inputFeatures, beta1: 0.9f, beta2: 0.999f);
// First convolution block
var branchConfig1 = new List<(int filterSize, int filterCount, int stride, int? padding)>
{
(3, 48, 1, null), // uses "same" padding when null
(5, 48, 1, null),
(7, 48, 1, null)
};
cnn.AddParallel1DConvolutionBlockLayer(branchConfigs: branchConfig1, useBatchNormalization: true, alpha: 1f, pooling: Pooling.None, maxPoolSize: 0);
// Second convolution block
var branchConfig2 = new List<(int filterSize, int filterCount, int stride, int? padding)>
{
(3, 96, 1, null),
(5, 96, 1, null),
(7, 96, 1, null)
};
cnn.AddParallel1DConvolutionBlockLayer(branchConfigs: branchConfig2, useBatchNormalization: true, alpha: 1f, pooling: Pooling.None, maxPoolSize: 0);
// GRU layers with correct sequence length
cnn.AddGRULayer(hiddenSize: 256, useLayerNormalization: false, returnSequences: true, dropoutRate: 0.15f);
cnn.AddGRULayer(hiddenSize: 128, useLayerNormalization: false, returnSequences: false, dropoutRate: 0.15f);
// Dense layers
cnn.AddFullyConnectedLayer(outputNodes: 64);
cnn.AddELULayer(alpha: 1f);
cnn.AddDropoutLayer(rate: 0.2f);
cnn.AddFinalFullyConnectedLayer(outputNodes: outputFeatures);
cnn.AddLinearOutputLayer(LossFunction.RMSE);
return cnn;
}
Training Monitor
Included is code that implements a TrainingMonitor
class that tracks and displays the progress of training a neural network, specifically designed to work with a convolutional neural network (CNN) implementation. It's a sophisticated monitoring system that provides real-time insights during the training process.
This code shows careful attention to practical training concerns, such as handling noise in measurements, providing accurate time estimates, and tracking long-term improvement trends while accounting for the natural volatility introduced by dropout layers. This makes it a robust tool for monitoring and debugging neural network training processes.
Monitor output:
############################## Training Progress ############################### ============================== Overall Loss Trend ============================== Max: 0.777943 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█░░░░░░░█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ░░░░░░░░░░░░░█░░░░░░░░░░░█░░░░░░░░░█░░░░░░░█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ░░░░█░░░░░░░░█░░░░░░░░░░░█░░░░░░█░██░░░█░░░█░░░░█░░░░░░░░░██░░░░░░░░░░░░░░░░░░░░ ██░░█░░░░░██░██░░█░░░█░░██░░░░█░█░██░░██░█░█░░░░█░░░█░░█░░██░░░█░░░░░░░░█░░░░░░░ ██░██░░█░░██░██░██░░░█░███░░░░█░█░██░░██░███░░░██░░░█░░█░░██░░░█░█░░█░░░█░░░░█░░ ██░███░█░███░██░██░░░█░█████░░█░█░███░██░███░░░███░███░█░░██░░░█░█░██░█░███░░█░█ ██████░█████░█████░░░█░█████░░█░█░██████░███░░████░███░█░░██░██████████████░░███ ██████░█████░█████░█░█░█████░██░█░██████████░█████░███░█░░██░███████████████████ ██████████████████░█░███████████████████████████████████████████████████████████ ██████████████████░█████████████████████████████████████████████████████████████ Min: 0.683328 ============================== Last 30 Loss Trend ============================== Max: 0.751115 ░░░░░░░░░░░░░░░░░░░░░░███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ░░░░░██░░░░░░░░░░░░░░░████░░░░░░░░░██░░░░░░░░░░░░░░░░░░░░░░░█░░░░░░░░░░░░░░░░░░░ ░░░░████░░░░░░█░░░░░░█████░░░░░░░░░██░░░░█░░░░░░░░░░░░░░░░░███░░░░░░░░░░░░░░░░░░ ░░░░█████░░░░██░░░░░░█████░░░░░░░░████░░███░░░░███░░░░░░░░░████░░░░░░░░░░██░░░░░ ░░░██████░░░░██░░░░░░██████░░░░░░█████░█████░██████░░░███░████████░░░░░░░███░░░█ ░░████████░░░███░░░░░██████░░░░░████████████████████░██████████████░░░░░████████ ░░████████░░████░░░░███████░░░██████████████████████████████████████░░░░████████ ░██████████░█████░░░███████░░█████████████████████████████████████████░░████████ ░█████████████████░░████████████████████████████████████████████████████████████ ░███████████████████████████████████████████████████████████████████████████████ Min: 0.695009 Epoch Progress: [████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 21.56% Total Progress: [█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 1.80% | Epoch | Batch | Samples |Processed| Average |SmoothEMA|SmoothMin|ImprvSinc| | 1/12 | 80 | 570,000 | 10,240 |0.723810 |0.723896 |0.722736 | 57 | 30 Batch Loss Trend: Stable (Confidence: 41%) Speed: 1.05 samples/sec; Epoch ETA: 9h 48m; Total ETA: 6d 3h --------------------------------------------------------------------------------
Supported Loss Functions
namespace ML.CNN
{
public enum LossFunction
{
MSE = 1, // MSE (Mean Squared Error): Good for regression tasks, use Linear final layer
RMSE = 2, // RMSE (Root Mean Squared Error): Like MSE, but error values in same unit predictions and not sensitive to outliers, use Linear final layer
BoundedMSE = 3, // Bounded MSE network will normalize output to [0,1], use Sigmoid or Linear final layer
BoundedRMSE = 4, // Bounded RMSE network will normalize output to [0,1], use Sigmoid or Linear final layer
BinaryCrossEntropy = 5, // Binary Cross Entropy: For binary/two-class classification, use Sigmoid or Linear final layer
SoftmaxCrossEntropy = 6 // Softmax Cross Entropy: Most common for multi-class classification, use Softmax or Linear final layer
}
}
Download: NEURALNETWORK.zip.