Activations

Activation functions introduce non-linearity into a neural network, enabling it to learn complex mappings. Simplegrad provides the most common activations as differentiable functional ops. All of them operate element-wise (except softmax, which acts along a chosen axis) and return a new Tensor with gradients wired into the computation graph.

import simplegrad as sg

x = sg.Tensor([-1.0, 0.5, 2.0], requires_grad=True)
out = sg.relu(x)            # [0.  0.5 2.0]
probs = sg.softmax(x, axis=0)

relu

\[ f(x) = \max(0, x) \]

`relu(x: Tensor) -> Tensor`

Apply ReLU activation element-wise: max(0, x).

tanh

\[ f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]

`tanh(x: Tensor) -> Tensor`

Apply hyperbolic tangent element-wise, mapping inputs to (-1, 1).

sigmoid

\[ f(x) = \frac{1}{1 + e^{-x}} \]

`sigmoid(x: Tensor) -> Tensor`

Apply sigmoid activation element-wise: 1 / (1 + exp(-x)), mapping inputs to (0, 1).

elu

\[ f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases} \]

`elu(x: Tensor, alpha: float = 1.0) -> Tensor`

Apply Exponential Linear Unit activation element-wise.

ELU is defined as::

elu(x) = x                    if x > 0
         alpha * (exp(x) - 1) if x <= 0

Unlike ReLU, ELU has a smooth, non-zero output for negative inputs, which allows the mean activation to be closer to zero and reduces the "dying neuron" problem. The alpha parameter controls the saturation value for strongly negative inputs (the limit as x → -∞ is -alpha).

Parameters:

x (Tensor) –

Input tensor of any shape.
alpha (float, default: 1.0 ) –

Slope scale for the negative region. Must be > 0. Defaults to 1.0.

gelu

\[ f(x) = x \cdot \Phi(x) = \frac{x}{2}\left(1 + \tanh\!\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715\,x^3\right)\right)\right) \]

`gelu(x: Tensor, mode: str = 'erf') -> Tensor`

Apply Gaussian Error Linear Unit activation element-wise.

GELU is defined as x * Φ(x), where Φ is the standard Gaussian CDF. It has become the standard activation in transformer architectures (BERT, GPT) because it combines the properties of dropout, zoneout, and ReLU. Unlike ReLU it is smooth and probabilistically gates inputs based on their magnitude.

Two computation modes are provided:

"erf" (default, exact)::

gelu(x) = 0.5 * x * (1 + erf(x / sqrt(2)))

"tanh" (fast approximation, original paper formula)::

gelu(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))

The tanh approximation was proposed by Hendrycks & Gimpel (2016) and is accurate to within 0.02% for all x. It is the mode used by most production implementations when a fast path is desired.

Parameters:

x (Tensor) –

Input tensor of any shape.
mode (str, default: 'erf' ) –

Computation mode — "erf" for the exact formula or "tanh" for the fast approximation. Defaults to "erf".

Returns:

Tensor –

Tensor of the same shape as x.

Raises:

ValueError –

If mode is not "erf" or "tanh".

softmax

\[ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} \]

`softmax(x: Tensor, dim: int | None = None) -> Tensor`

Apply softmax along the given dimension.

Softmax converts a vector of real values into a probability distribution: each output is in (0, 1) and the values along dim sum to 1. It is defined as exp(x_i) / sum_j(exp(x_j)).

Numerically stable: subtracts max(x) along dim before exponentiation.

Parameters:

x (Tensor) –

Input tensor of any shape.
dim (int | None, default: None ) –

Dimension to normalize over. If None, normalizes over all elements.