Description
One of the key challenges in optimizing neural networks is the inherent high-dimensionality and non-convexity of the objective function. A single neuron with Sigmoid activation is known to have the number of local minima grow exponentially in the dimension based on square loss. Properly tuned gradient-based methods converge to a stationary point, prompting the question: which stationary point do these methods typically find, and how can we bound their convergence rates?
Beyond popular ReLU and sigmoid functions, recent work has explored polynomial and rational activations. Polynomial activations have shown promise in computer vision tasks, while rational activations have been applied to solving PDEs and training Generative Adversarial Networks. Notably, well-known activations such as ReLU and other smooth functions can be approximated by polynomials or rational functions up to a desired accuracy.
In this talk, I will (1) describe the training dynamics of shallow neural networks with these algebraic activations, focusing on rational networks as a representative case, (2) characterize their stationary points and investigate how poles, factorization symmetries, and higher-dimensional parameter spaces complicate gradient-based optimization, and (3) discuss the existence and elimination of “spurious valleys” (connected components of sub-level sets that exclude a global minimum) in different architectures. I will demonstrate the theoretical findings with numerical experiments.