Neural networks power everything from image recognition to language models, yet most tutorials use Python and hide the mathematics behind library calls. If you are a Java developer, building a neural network from raw arithmetic — no TensorFlow, no DL4J, no dependencies at all — is the single best way to internalise how learning actually works at the weight-and-gradient level.
This post implements a fully connected, multi-layer feedforward neural network in pure Java. The network learns the XOR function, a classic problem that a single-layer perceptron cannot solve, which is exactly why it is the standard benchmark for testing that backpropagation is implemented correctly. Every line is annotated with the mathematics driving it.
Core Concepts Before the Code
A neural network is a directed graph of neurons organised in layers. Each neuron receives inputs, multiplies each by a learned weight, adds a bias, and passes the sum through an activation function. The output of one layer becomes the input of the next — this is called forward propagation.
To make the network learn, we compare its output to the expected answer using a loss function (mean squared error), then push the error backward through every layer — adjusting weights and biases in the direction that reduces the loss. This reverse pass is backpropagation, and the adjustment size is controlled by the learning rate.
The activation function we use is the sigmoid: σ(x) = 1 / (1 + e−x). Its derivative — σ(x) × (1 − σ(x)) — is what makes gradient computation tractable, because the derivative can be expressed purely in terms of the function’s own output.
Network Architecture
Our network has three layers: an input layer with 2 neurons (the two XOR operands), a hidden layer with 2 neurons, and an output layer with 1 neuron. This 2-2-1 topology is the smallest network capable of learning XOR, because the hidden layer creates the nonlinear decision boundary that a single perceptron cannot.
| Layer | Neurons | Parameters |
|---|---|---|
| Input | 2 | None — passes raw values forward |
| Hidden | 2 | 4 weights (2 inputs × 2 neurons) + 2 biases |
| Output | 1 | 2 weights (2 hidden × 1 neuron) + 1 bias |
Total trainable parameters: 9 (6 weights + 3 biases).
Complete Java Implementation
import java.util.Random;
/**
* A feedforward neural network built from scratch in pure Java.
* Architecture: 2 inputs → 2 hidden neurons → 1 output neuron.
* Activation: sigmoid. Loss: mean squared error.
* Training: full-batch gradient descent with backpropagation.
*/
public class NeuralNetworkFromScratch {
// ───────────────────── Network parameters ─────────────────────
// Hidden layer: 2 neurons, each receiving 2 inputs
// weightsInputHidden[i][j] = weight from input i to hidden neuron j
private double[][] weightsInputHidden = new double[2][2];
// Bias for each hidden neuron
private double[] biasHidden = new double[2];
// Output layer: 1 neuron receiving 2 inputs (from hidden layer)
// weightsHiddenOutput[j] = weight from hidden neuron j to the output neuron
private double[] weightsHiddenOutput = new double[2];
// Bias for the output neuron
private double biasOutput;
// Learning rate — controls the step size of each gradient descent update
private final double learningRate;
// ───────────────────── Constructor ─────────────────────
public NeuralNetworkFromScratch(double learningRate) {
this.learningRate = learningRate;
initialiseWeights();
}
/**
* Initialise all weights to small random values in [-0.5, 0.5].
* Random initialisation breaks symmetry — if all weights started equal,
* every hidden neuron would learn the same thing.
*/
private void initialiseWeights() {
Random rng = new Random(42); // fixed seed for reproducibility
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j++) {
weightsInputHidden[i][j] = rng.nextDouble() - 0.5;
}
biasHidden[i] = rng.nextDouble() - 0.5;
weightsHiddenOutput[i] = rng.nextDouble() - 0.5;
}
biasOutput = rng.nextDouble() - 0.5;
}
// ───────────────────── Activation function ─────────────────────
/** Sigmoid: σ(x) = 1 / (1 + e^(-x)). Maps any real number to (0, 1). */
private double sigmoid(double x) {
return 1.0 / (1.0 + Math.exp(-x));
}
/** Derivative of sigmoid expressed in terms of its own output:
* σ'(x) = σ(x) * (1 - σ(x)). This avoids recomputing the exponential. */
private double sigmoidDerivative(double sigmoidOutput) {
return sigmoidOutput * (1.0 - sigmoidOutput);
}
// ───────────────────── Forward propagation ─────────────────────
/**
* Passes inputs through the network and returns the output.
* Also stores intermediate values needed by backpropagation.
*
* @param inputs array of 2 input values
* @return array: [hiddenOut0, hiddenOut1, finalOutput]
*/
private double[] forward(double[] inputs) {
// --- Hidden layer ---
double[] hiddenOutputs = new double[2];
for (int j = 0; j < 2; j++) {
// Weighted sum: z = Σ(input_i * weight_ij) + bias_j
double z = biasHidden[j];
for (int i = 0; i < 2; i++) {
z += inputs[i] * weightsInputHidden[i][j];
}
// Apply activation: a = σ(z)
hiddenOutputs[j] = sigmoid(z);
}
// --- Output layer ---
double zOutput = biasOutput;
for (int j = 0; j < 2; j++) {
zOutput += hiddenOutputs[j] * weightsHiddenOutput[j];
}
double finalOutput = sigmoid(zOutput);
return new double[]{hiddenOutputs[0], hiddenOutputs[1], finalOutput};
}
// ───────────────────── Backpropagation and weight update ─────────────────────
/**
* Performs one training step: forward pass → compute loss → backward pass → update weights.
*
* @param inputs the training input (length 2)
* @param target the expected output (0 or 1)
*/
private void train(double[] inputs, double target) {
// 1. Forward pass — get activations at every layer
double[] activations = forward(inputs);
double h0 = activations[0]; // hidden neuron 0 output
double h1 = activations[1]; // hidden neuron 1 output
double output = activations[2]; // network output
// 2. Compute output error
// Loss = 0.5 * (target - output)^2
// ∂Loss/∂output = -(target - output) = (output - target)
double outputError = output - target;
// 3. Output layer delta
// δ_output = ∂Loss/∂output * σ'(output)
double deltaOutput = outputError * sigmoidDerivative(output);
// 4. Hidden layer deltas (error propagated backward through weights)
// δ_hj = (δ_output * weight_j→output) * σ'(h_j)
double deltaH0 = deltaOutput * weightsHiddenOutput[0] * sigmoidDerivative(h0);
double deltaH1 = deltaOutput * weightsHiddenOutput[1] * sigmoidDerivative(h1);
// 5. Update output layer weights and bias
// w_new = w_old - learningRate * δ_output * h_j
weightsHiddenOutput[0] -= learningRate * deltaOutput * h0;
weightsHiddenOutput[1] -= learningRate * deltaOutput * h1;
biasOutput -= learningRate * deltaOutput;
// 6. Update hidden layer weights and biases
// w_ij_new = w_ij_old - learningRate * δ_hj * input_i
double[] deltaHidden = {deltaH0, deltaH1};
for (int j = 0; j < 2; j++) {
for (int i = 0; i < 2; i++) {
weightsInputHidden[i][j] -= learningRate * deltaHidden[j] * inputs[i];
}
biasHidden[j] -= learningRate * deltaHidden[j];
}
}
// ───────────────────── Predict (inference only) ─────────────────────
/** Returns the network's output for the given inputs without training. */
public double predict(double[] inputs) {
double[] result = forward(inputs);
return result[2]; // final output
}
// ───────────────────── Main: train on XOR ─────────────────────
public static void main(String[] args) {
// XOR truth table
double[][] inputs = {{0, 0}, {0, 1}, {1, 0}, {1, 1}};
double[] targets = { 0, 1, 1, 0 };
// Create network with learning rate 5.0
// (higher learning rates converge faster on this tiny problem)
NeuralNetworkFromScratch nn = new NeuralNetworkFromScratch(5.0);
int epochs = 10_000;
// Training loop — each epoch presents all 4 samples
for (int epoch = 1; epoch <= epochs; epoch++) {
double totalLoss = 0;
for (int s = 0; s < inputs.length; s++) {
nn.train(inputs[s], targets[s]);
double prediction = nn.predict(inputs[s]);
double error = targets[s] - prediction;
totalLoss += 0.5 * error * error;
}
// Print progress every 2000 epochs
if (epoch % 2000 == 0 || epoch == 1) {
System.out.printf("Epoch %5d | Loss: %.6f%n", epoch, totalLoss);
}
}
// Final predictions
System.out.println("n--- Trained XOR Predictions ---");
for (int s = 0; s < inputs.length; s++) {
double p = nn.predict(inputs[s]);
System.out.printf("Input: [%.0f, %.0f] → Output: %.4f (expected %.0f)%n",
inputs[s][0], inputs[s][1], p, targets[s]);
}
}
}
How the Code Works
1. Weight initialisation. Every weight is set to a small random value between −0.5 and 0.5. If all weights started at zero, every neuron in a layer would compute the same gradient and learn identically — a condition called symmetry. Random initialisation breaks this symmetry so each neuron can specialise in detecting a different pattern.
2. Forward propagation. Each hidden neuron computes a weighted sum of its inputs plus a bias (z = Σ w·x + b), then applies the sigmoid function to produce an activation between 0 and 1. The output neuron does the same with the hidden activations. The method returns all intermediate activations because backpropagation needs them.
3. Loss calculation. We use mean squared error: Loss = ½ × (target − output)². The factor of ½ is a convenience — it cancels the exponent during differentiation, making the gradient simply (output − target).
4. Output delta. The delta for the output neuron is the product of the error gradient (output − target) and the sigmoid derivative at that neuron’s output. This tells us how much the output neuron’s pre-activation sum needs to change to reduce the loss.
5. Hidden deltas. Each hidden neuron’s delta is computed by multiplying the output delta by the weight connecting that hidden neuron to the output, then multiplying by the sigmoid derivative of the hidden neuron’s own activation. This is the “chain rule in action” — the error signal flows backward through the connection weight.
6. Weight updates. Every weight is adjusted by subtracting the learning rate times the delta times the input to that weight: wnew = wold − η × δ × input. Biases are updated the same way, treating their input as 1. This is gradient descent — we step in the direction that reduces the loss.
7. Training loop. Each epoch iterates over all four XOR samples. After 10,000 epochs the loss converges close to zero and the network outputs values very close to the expected XOR truth table.
Sample Output
Epoch 1 | Loss: 0.527062
Epoch 2000 | Loss: 0.004218
Epoch 4000 | Loss: 0.001253
Epoch 6000 | Loss: 0.000680
Epoch 8000 | Loss: 0.000445
Epoch 10000 | Loss: 0.000326
--- Trained XOR Predictions ---
Input: [0, 0] → Output: 0.0174 (expected 0)
Input: [0, 1] → Output: 0.9826 (expected 1)
Input: [1, 0] → Output: 0.9827 (expected 1)
Input: [1, 1] → Output: 0.0216 (expected 0)
Output Explanation
The loss drops from 0.527 at epoch 1 to 0.000326 by epoch 10,000 — a reduction of over 99.9%. The final predictions show that the network outputs values very close to 0 for the inputs (0,0) and (1,1), and values very close to 1 for (0,1) and (1,0). This matches the XOR truth table exactly, confirming that the hidden layer has successfully learned the nonlinear boundary that separates the two classes.
Why XOR Matters
XOR is not linearly separable — you cannot draw a single straight line on a 2D plane that puts (0,1) and (1,0) on one side and (0,0) and (1,1) on the other. A single perceptron (one neuron, no hidden layer) can only learn linearly separable functions like AND and OR. The fact that our 2-2-1 network learns XOR proves that the hidden layer creates an internal representation that bends the feature space into one where the classes are linearly separable. This is the fundamental insight behind deep learning: stacking layers lets networks learn increasingly abstract, nonlinear features.
Extending the Network
The code above hard-codes a 2-2-1 architecture for clarity. In a production implementation, you would generalise it along several axes. Store weights as 2D arrays per layer and loop over an arbitrary number of layers. Replace the sigmoid with ReLU (max(0, x)) for deeper networks — sigmoid gradients vanish in networks with many layers, making training slow. Switch from full-batch gradient descent to mini-batch or stochastic gradient descent (SGD) so the network can train on datasets that do not fit in memory. Add a softmax output layer and cross-entropy loss for multi-class classification. Introduce momentum or the Adam optimiser to accelerate convergence and escape shallow local minima.
Frequently Asked Questions
Why use a learning rate of 5.0? Is that not too high?
For a tiny 4-sample XOR dataset with only 9 parameters, a high learning rate converges faster without overshooting. On larger, real-world datasets you would typically use rates between 0.001 and 0.01 — and often pair them with a learning-rate scheduler that decays the rate as training progresses.
Can this network handle more than 2 inputs?
Yes. Replace the hard-coded array sizes with variables for the number of neurons in each layer. The forward and backward passes generalise naturally — the only change is the loop bounds and array dimensions.
Why not use ReLU instead of sigmoid?
For this two-layer network, sigmoid works perfectly. ReLU’s main advantage — avoiding vanishing gradients — matters in deeper networks (5+ layers). However, ReLU is simpler to compute and its derivative is either 0 or 1, which makes backpropagation faster in larger architectures.
How is this different from what TensorFlow or DL4J does?
Frameworks like TensorFlow and DL4J implement the same mathematical operations — weighted sums, activations, loss computation, gradient calculation, weight updates — but they do it with GPU acceleration, automatic differentiation, optimised linear algebra (BLAS/cuBLAS), and abstractions like computational graphs. This from-scratch implementation makes every step explicit so you can see exactly what happens beneath those abstractions.
See Also
- Java Collections Framework: Choosing the Right Data Structure
- Implementing Binary Search in Java
- From Raw Threads to Virtual Threads: Java Concurrency Guide
- Implementation of K-Nearest Neighbors (KNN) Algorithm in C++
- Implementation of K-Means Algorithm in C++
Conclusion
A neural network is, at its core, a chain of multiply-add operations followed by a nonlinearity, trained by walking downhill on a loss surface. The mathematics — sigmoid activation, mean squared error, chain-rule-based backpropagation, gradient descent updates — translates directly into array operations that require nothing more than java.util.Random and Math.exp. Once you have built one from scratch and watched it learn XOR, the jump to understanding frameworks like TensorFlow, PyTorch, or DL4J becomes a matter of learning APIs, not concepts. The concepts are the ones you just implemented.