Autoscaling

By far the most powerful form of scaling is autoscaling. Under this model, App Engine monitors key application metrics such as requests per second, latency, errors, and resource utilization. As these metrics change, the App Engine scheduler intelligently determines whether to pass additional requests to existing instances, or to scale the service up or down. Beyond these key performance metrics, the App Engine scheduler also takes into account external factors such as request queue depth and application startup time in order to stay ahead of traffic spikes.

For more information on how the App Engine scheduler performs autoscaling, refer to https://cloud.google.com/appengine/docs/standard/python/how-instances-are-managed.

Autoscaling services must operate under relatively strict conditions compared to basic and manual scaling. For example, background threading is not allowed, and relatively short request timeouts are enforced on the service. These restrictions are more severe for services running in the standard environment than in the flexible environment. As we covered earlier, services running in the standard environment are able to scale much more rapidly than those in the flexible environment. This difference becomes significant for services that experience regular spikes in traffic.

When scaling down dynamic instances, App Engine will stop the instances without destroying them. They remain in an inactive state and do not contribute to billable hours. This can be seen in the App Engine dashboard under Navigation menu | App Engine | Instances. Select a service with automatic scaling. In the metrics drop-down list (summary, by default), select the Instances option. The instances graph displays the number of created instances, the number of active instances, and the billed instance estimate.

For a service that has not recently received traffic, the number of active instances and the billed instance estimate should both be zero, depicted as follows:

When using autoscaling, App Engine scales the number of running instances to zero when not in use

Autoscaling is the default strategy for App Engine services. The behavior can be customized through the automatic_scaling configuration properties in the service's application configuration file as follows:

automatic_scaling:
  min_idle_instances: 2
  max_idle_instances: 10
  max_concurrent_requests: 100
  min_pending_latency: 10ms
  max_pending_latency: automatic

Developers may choose to set min_idle_instances to something greater than zero to reduce initial startup time after periods of inactivity at a potential cost of increased instance hours. Conversely, max_idle_instances can be used to define an upper limit for instance hours, at the potential cost of performance under peak traffic. The other configuration properties can be used to define under which conditions the service should be scaled. Note that min_pending_latency and max_pending_latency here refer to the shortest and longest time the App Engine internal request queue may hold on to a request before dispatching it to an existing instance.