Activations
Activation functions introduce non-linearity into a neural network, enabling it to learn complex mappings. Simplegrad provides the most common activations as differentiable functional ops. All of them operate element-wise (except softmax, which acts along a chosen axis) and return a new Tensor with gradients wired into the computation graph.
import simplegrad as sg
x = sg.Tensor([-1.0, 0.5, 2.0], requires_grad=True)
out = sg.relu(x) # [0. 0.5 2.0]
probs = sg.softmax(x, axis=0)
relu
relu(x: Tensor) -> Tensor
Apply ReLU activation element-wise: max(0, x).
tanh
tanh(x: Tensor) -> Tensor
Apply hyperbolic tangent element-wise, mapping inputs to (-1, 1).
sigmoid
sigmoid(x: Tensor) -> Tensor
Apply sigmoid activation element-wise: 1 / (1 + exp(-x)), mapping inputs to (0, 1).
elu
elu(x: Tensor, alpha: float = 1.0) -> Tensor
Apply Exponential Linear Unit activation element-wise.
ELU is defined as::
elu(x) = x if x > 0
alpha * (exp(x) - 1) if x <= 0
Unlike ReLU, ELU has a smooth, non-zero output for negative inputs,
which allows the mean activation to be closer to zero and reduces the
"dying neuron" problem. The alpha parameter controls the saturation
value for strongly negative inputs (the limit as x → -∞ is -alpha).
Parameters:
-
x(Tensor) –Input tensor of any shape.
-
alpha(float, default:1.0) –Slope scale for the negative region. Must be > 0. Defaults to 1.0.
gelu
gelu(x: Tensor, mode: str = 'erf') -> Tensor
Apply Gaussian Error Linear Unit activation element-wise.
GELU is defined as x * Φ(x), where Φ is the standard Gaussian CDF. It has become the standard activation in transformer architectures (BERT, GPT) because it combines the properties of dropout, zoneout, and ReLU. Unlike ReLU it is smooth and probabilistically gates inputs based on their magnitude.
Two computation modes are provided:
"erf" (default, exact)::
gelu(x) = 0.5 * x * (1 + erf(x / sqrt(2)))
"tanh" (fast approximation, original paper formula)::
gelu(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))
The tanh approximation was proposed by Hendrycks & Gimpel (2016) and is accurate to within 0.02% for all x. It is the mode used by most production implementations when a fast path is desired.
Parameters:
-
x(Tensor) –Input tensor of any shape.
-
mode(str, default:'erf') –Computation mode —
"erf"for the exact formula or"tanh"for the fast approximation. Defaults to"erf".
Returns:
-
Tensor–Tensor of the same shape as x.
Raises:
-
ValueError–If
modeis not"erf"or"tanh".
softmax
softmax(x: Tensor, dim: int | None = None) -> Tensor
Apply softmax along the given dimension.
Softmax converts a vector of real values into a probability distribution:
each output is in (0, 1) and the values along dim sum to 1. It is
defined as exp(x_i) / sum_j(exp(x_j)).
Numerically stable: subtracts max(x) along dim before exponentiation.
Parameters:
-
x(Tensor) –Input tensor of any shape.
-
dim(int | None, default:None) –Dimension to normalize over. If None, normalizes over all elements.