Skip to content

Activations

Activation functions introduce non-linearity into a neural network, enabling it to learn complex mappings. Simplegrad provides the most common activations as differentiable functional ops. All of them operate element-wise (except softmax, which acts along a chosen axis) and return a new Tensor with gradients wired into the computation graph.

import simplegrad as sg

x = sg.Tensor([-1.0, 0.5, 2.0], requires_grad=True)
out = sg.relu(x)            # [0.  0.5 2.0]
probs = sg.softmax(x, axis=0)

relu

\[ f(x) = \max(0, x) \]

relu(x: Tensor) -> Tensor

Apply ReLU activation element-wise: max(0, x).

tanh

\[ f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]

tanh(x: Tensor) -> Tensor

Apply hyperbolic tangent element-wise, mapping inputs to (-1, 1).

sigmoid

\[ f(x) = \frac{1}{1 + e^{-x}} \]

sigmoid(x: Tensor) -> Tensor

Apply sigmoid activation element-wise: 1 / (1 + exp(-x)), mapping inputs to (0, 1).

elu

\[ f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases} \]

elu(x: Tensor, alpha: float = 1.0) -> Tensor

Apply Exponential Linear Unit activation element-wise.

ELU is defined as::

elu(x) = x                    if x > 0
         alpha * (exp(x) - 1) if x <= 0

Unlike ReLU, ELU has a smooth, non-zero output for negative inputs, which allows the mean activation to be closer to zero and reduces the "dying neuron" problem. The alpha parameter controls the saturation value for strongly negative inputs (the limit as x → -∞ is -alpha).

Parameters:

  • x (Tensor) –

    Input tensor of any shape.

  • alpha (float, default: 1.0 ) –

    Slope scale for the negative region. Must be > 0. Defaults to 1.0.

gelu

\[ f(x) = x \cdot \Phi(x) = \frac{x}{2}\left(1 + \tanh\!\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715\,x^3\right)\right)\right) \]

gelu(x: Tensor, mode: str = 'erf') -> Tensor

Apply Gaussian Error Linear Unit activation element-wise.

GELU is defined as x * Φ(x), where Φ is the standard Gaussian CDF. It has become the standard activation in transformer architectures (BERT, GPT) because it combines the properties of dropout, zoneout, and ReLU. Unlike ReLU it is smooth and probabilistically gates inputs based on their magnitude.

Two computation modes are provided:

"erf" (default, exact)::

gelu(x) = 0.5 * x * (1 + erf(x / sqrt(2)))

"tanh" (fast approximation, original paper formula)::

gelu(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))

The tanh approximation was proposed by Hendrycks & Gimpel (2016) and is accurate to within 0.02% for all x. It is the mode used by most production implementations when a fast path is desired.

Parameters:

  • x (Tensor) –

    Input tensor of any shape.

  • mode (str, default: 'erf' ) –

    Computation mode — "erf" for the exact formula or "tanh" for the fast approximation. Defaults to "erf".

Returns:

  • Tensor

    Tensor of the same shape as x.

Raises:

  • ValueError

    If mode is not "erf" or "tanh".

softmax

\[ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} \]

softmax(x: Tensor, dim: int | None = None) -> Tensor

Apply softmax along the given dimension.

Softmax converts a vector of real values into a probability distribution: each output is in (0, 1) and the values along dim sum to 1. It is defined as exp(x_i) / sum_j(exp(x_j)).

Numerically stable: subtracts max(x) along dim before exponentiation.

Parameters:

  • x (Tensor) –

    Input tensor of any shape.

  • dim (int | None, default: None ) –

    Dimension to normalize over. If None, normalizes over all elements.