DownFlux

Local Collision Avoidance

Sun, 19 Dec 2021 00:00:00 -0800

See ORCA in action at github.com/downflux/go-orca.

Consider a rectangle. If we were to double the length and width of the box, we quadruple the total area – the area of a 2D object increases much faster than its characteristic length. We sometimes refer to this phenomena, and others like it, as the curse of dimensionality. The basic idea is that when there are decisions to be made, adding more factors to consider is really slow.

We are using this rectangle to represent the world map in DownFlux. One of the fundamental things we need to do in a real-time strategy game is to order units to move around the map. Pathfinding techniques such as A* are great with finding the optimal global path – but for large maps, we have to search a lot due to the curse. This problem balloons when we consider the amount of units that can typically generate in a RTS, e.g. on the order of thousands. Furthermore, remember that all of these units have hitboxes – when we run pathfinding, not only do we need to ensure units do not collide with walls, we also need to make sure units don’t collide with one another. As the units are all moving, the only way we can do this within the A* framework is to recalculate the paths. A lot.

Putting all of this together, we expect that pathfinding will be a significant drain on our computing resources, of which we have very little when taking into consideration these computations will all need to be completed within the fraction of a second comprising a server tick.

In order to reduce the pathfinding computation time then, it appears we need to tackle the problem in two fronts –

find a way to apply the results of a single A* calculation to multiple units, and
reduce the number of A* pathfinding calculations which need to occur due to potential collisions

A common command pattern in real-time strategy games is for a player to issue a move command to an entire group of units. A natural inclination then, is to run A* only on the single move target for all the units currently selected. But doing so will naturally generate numerous collision events as the units converge on a common target – so we need a way to calculate local unit movement without falling back to A*. We need local collision avoidance detection.

ORCA

Optimal Reciprocal Collision Avoidance (ORCA) is a technique which guarantees local collision avoidance for a set of independent agents; that is, we can simulate a bunch of moving objects, and ensure that the objects do not overlap, without (or with very limited) knowledge of any global state. This is incredibly applicable to our problem because we can bypass all collision detection A* invocations, which in theory will drastically reduce the computational load

ORCA achieves collision avoidance in two steps –

calculating all agent-agent interactions and coming up with a characteristic velocity which avoids collisions
```
 f(a, b) -> v
```
given all such velocities, calculate a velocity for each agent that accounts for all potential upcoming collisions i.e., a fold operation
```
 g({v<sub>a</sub>}) -> v
```

When we apply these steps to all agents, we will get an agent {a: v} map, where there is a guarantee no collisions will occur if the agent sets their velocity to the prescribed output. What remains is to describe how these two steps actually work. We will focus on the characteristic collision avoidance velocity here, and leave the second step to a future post.

Consider two agents that are currently moving towards each other.

Figure 1: Two agents in position (p-)space heading towards one another.

In order to determine if these two objects will collide, we can systematically construct a velocity obstacle (VO) object in velocity (v-)space (Figure 2). The VO object is defined by two fundamental properties –

the shape of the central blockage¹ of the VO object, and
the coordinates of the center-of-mass of this blockage is away from the origin of v-space, which can be calculated from the relative velocities of the two objects.

The shape of the central blockage is defined to be the set of all relative² velocities between two objects which will result in collision, and the VO “cone” is built from extending a line from the origin to the edge of the blockage.

Any relative velocity between the two objects that fall within this cone indicates that the two objects will collide at some point in the future, assuming the velocities stay constant.

Figure 2: Velocity object between the two agents. The left figure demonstrates the intuitive construction of a velocity cone between two objects – here, we find the velocities of the bottom agent that will result in the two agents colliding. The right figure demonstrates a rough construction of the velocity obstacle object created by the agents in Figure 1. Note that the circle here defines the characteristic “width” of the cone, and whose radius is proportional to r_A + r_B.

We note that the distance from the v-space origin to the collision artifact (i.e. disc) is a function of time – that is, we will only achieve a collision if the velocity remains unchanged while the distance between the two agents shrinks. If our simulation time is very short (smaller than the time it would take to achieve collision) we should be able to proceed with the given velocity, even if it will eventually cause a collision. Thus, we can consider a truncated VO object, where the base of the circle has radius r₀ / 𝜏, and 𝜏 is the simulation timestep. For example, if we set

𝜏 = 1

in Figure 2, the base of the truncated VO object is just the solid circle, and the VO object points away from the origin. To reiterate, relative velocities between the two agents which fall inside this truncated cone will cause a collision within the next timestep.

Given a VO cone, it becomes fairly simple to generate a velocity which will avoid collision – this is the projected normal vector u onto the edge of VO. Because the algorithm is reciprocal, we can assume³ the opposing agent will also move to avoid the collision – thus, we only need to alter the velocity of each agent by ||u||/2 (directed away from one another in p-space). Note any relative velocity outside the VO object will ensure the two agents will not collide – because of reasons⁴, we narrow this search space to a half-plane⁵ ORCA_A|B for agent A, which is orthogonal to u and passes through the minimally-adjusted velocity v_A + u/2 (see Figure 3).

Figure 3: Construction of the ORCA half-plane of agent A given agent B. Note that u points to the closest point on the VO object from the relative velocity, and thus by definition is perpendicular to the surface of VO. Here, F(ORCA_A|B) indicates the direction of the half-plane – that is, the region in v-space which are permissible velocities for agent A.

We will leave discussion of how to use these ORCA planes to the next part.

Works Cited

van den Berg et al. “Reciprocal n-Body Collision Avoidance.” 2011.
Snape et al. “Reciprocal Collision Avoidance and Navigation for Video Games.” 2012.
Sunshine-Hill, Ben. “RVO and ORCA: How They Really Work.” 2017.
Snape, James. snape/RVO2. 2021.⁶

Notes

For two circular agents, this is a disc. ↩
Velocity objects generally are constructed for non-relativistic agents, and the relative velocities are just the normal vector difference v_A - v_B. ↩
This is a configurable value – for example, we may make an agent with more mass less liable to change its own velocity. This can be done either implicitly, by refusing to alter the actual velocity of the more massive agent, at the cost of potential collisions if the timestep is too large, or by feeding the VO-generation library with a local weighting function (e.g. giving the more massive agent a weighted velocity change value of ||u||/10, with the less massive agent moving the remainder 9||u||/10). ↩
Why we define this geometric object is due to math™, but more details can be found in van den Berg et al. ↩
Technically a hyperspace in N-dimensional ambient space (e.g. a half-space if our velocity vectors have a z-component). ↩
A lot of the work in our ORCA implemenation is based off of the official ORCA implementation under the Apache 2.0 license. We thank the original author for their work. ↩

Commanding RTS Commands

Fri, 05 Feb 2021 00:00:00 -0800

Scaling State Mutations via FSM Visitors

DownFlux is a real-time strategy game in active development at github.com/downflux. The goal of this project is simply to learn and have fun. I have several years of professional software development experience, none of which is in the game industry. This document does not advocate a general form solution for all state mutation problems, but rather demonstrates a different view of the command pattern. For a more technical and detailed overview of this approach, take a look at the design doc.

I mix first person plural in this document liberally because it sounds awkward to keep saying “I” all the time, not because I’m royalty.

Abstract

A major problem we’re facing while working on DownFlux has been finding a scalable approach to state mutations. Scalability here represents the ability for us to remain agile when implementing new mutation flows – this encompasses general good software development guidelines like testability, code “fragrance” (i.e. lack of smell), and framework flexibility.

Our model of a mutation flow consists of a command scheduler object, housing a metadata object per distinct flow invocation. These metadata objects are a thin wrapper around a finite state machine (FSM), and exposes a minimal subset of the game state to a visitor object.

Our metadata objects may only call read-only queries to the game state, and returns a calculated state to the visitor. The visitor may invoke write operations on both the metadata and the underlying state.

See a snapshot of our repo for more details. Feel free to reach out on Reddit or Twitter with questions or comments.

Jargon

state mutations, flows, commands: a series of changes to the game state (e.g. map, entities, etc.) which achieve a specific end-goal (e.g. move)

Flow Examples

move(source, dest): move the source object to the destination location.
chase(source, target): series of serialized moves, which are updated as the destination object moves.
attack(source, target): chase target asynchronously; if target is within attack range and the source can attack (off cooldown), then commit state change.

An ad hoc Approach

The first attempt we made at implementing a state mutation “framework” skipped any consideration of scalability or maintainability for the sake of an MVP. Here is our single move command:

func (s *Server) doTick() {
  for {
    for c := range s.Commands() {
      // Client calls mutate this CommandQueue object by appending
      // pending commands.
      c.Execute(s.q[c.Type()])
    }
  }
}

type Command interface {
  Execute(args interface{}) error
  Type() CommandType
}

func (c *MoveCommand) Execute(args interface{}) error {
  a = args.(MoveCommandArg)

  // p is a list of Position objects (i.e. (x, y) tuples).
  p = c.map.GetPath(a.Source.Location.Get(a.Tick), a.Destination)

  // Source merges the positions with internal velocity in
  // the curve.
  a.Source.Location.Update(p)
  return nil
}

Figure 1: Simple implementation of the move command.

Yup. This moves things. How do we start overengineer this?

Our second order approximation takes into consideration GetPath is expensive – we’re making a full A* search. But in an RTS game, it is very often the case that the player direct units to a different location before the unit reaches the target, wasting a lot of compute cycles.¹ Therefore, we want to calculate and set a partial trajectory instead, with delayed execution of the rest of the path.²

Figure 2: Partial path diagram. The command should only calculate p₀ first; at some time t in the future, recalculate the path (which may involve further sub-path iterations).

With the partial path logic, our command now looks something like this:³

func (s *Server) doTick() {
  for var args := range s.q {
    c.Execute(curTick, args)
  }
}

// Called by client API as well as internally.
func (c *MoveCommand) Schedule(
  t Tick,
  e Entity,
  d Destination) error {

  scheduledAction := c.q.Get(e)
  if scheduledAction != nil && scheduledAction.Precedence(t) {
    c.q.Set(t, e, d)
  }
}

func (c *MoveCommand) Execute(t Tick, args interface{}) {
  const pathLen int = 10;
  var arg := args.(MoveCommandArg)

  // Return a path of a specific length instead.
  p = c.Map.GetPath(
    arg.Source.Location.Get(t),
    arg.Destination,
    pathLen,
  )

  arg.Source.Location.Update(p)

  // Schedule partial path execution if the last element of the path is not
  // the "true" destination. c.Schedule() also needs to calculate if there are
  // any existing commands that need to be overwritten.
  if p[len(p) - 1] != arg.Destination {
    c.Schedule(
      t + a.Source.CalculateTravelTime(p),
      arg.Source,
      arg.Destination)
  } else {
    c.Delete(arg)
  }
}

Figure 3: Toy move command implementation v2 – here we enqueue a delayed move command into the main queue. This queue may have client- or other server-initiated command scheduling, so when we update the queue, we need to ensure there is a single, canonical execution flow; this logic is packed into the Schedule() function, meaning a single command will need to know the implementation logic / hierarchy of all other commands.

Kind of a pain, but still doable.

This model worked well enough for us to get a rudimentary frontend client running; however, a gut check seems to indicate major scalability issues with this approach.⁴ In particular,

A command may act on multiple entity types, and an entity may have multiple mutation flows – the implementations so far already demonstrates this vulnerability IMO.
Because the command queue contains commands from all command implementations (i.e. move, attack, etc.) and the command may mutate the queue (e.g. partial move enqueues), the command must know the details of all siblings flows.
The command must manually check the global state each time it is invoked, e.g. if the source has reached the destination. It is unclear how each command will implement this state read, which will impede maintainability.
Command.Execute() read and writes to the global state; from our simple move example, this already seems like a testability nightmare and needs to be addressed.

A common theme to these issues is the broad scope and authority we have conferred upon the command object; how can we clamp down on this?

(An Accidental) Tour de Entities

The first concern seems like a classic double dispatcher problem between the command and the entities (e.g. tanks) that they mutate. This seems to suggest we should break out the command into a visitor pattern implementation.

func (s *Server) doTick() {
  for var v := range s.Commands() {
    for var e := range s.Entities {
      e.Accept(v)
    }
  }
}

func (e *EntityImpl) Accept(v Visitor) { v.Visit(e) }

func (c *MoveCommand) Visit(e Entity) {
    if !e.IsMoveable() { return }

    if c.q.Has(e) {
      // This is the same implementation as in [Figure 3](#figure-3).
      c.Execute(c.Status.CurrentTick(), ...)
    }
}

Figure 4: The architectural change counterpart to the changes made in Figure 3.

There are some flaws here.

The Acceptor object is a single game entity – this is not abstract enough. Consider the attack command which mutates both the attacker and target – how do we visit target in an AttackCommand? Do we need a DealDamageVisitor? If so, suggests we will need a message broker between attacking and taking damage, which seems unnecessarily overwrought.
The command still has to deal with the schedule (c.q), which is a global mutatable state. As mentioned in Figure 2, the schedule may be edited by both sides of the network divide, and having our command dealing with that logic directly seems messy.

Note that this refactor was actually useless in terms of reducing tech debt, but was very important in exposing the points of friction that we will need to address.

Finite State Metadata

Let’s examine the first concern above, where we’re dealing with pain points brought up by iterating over the entities themselves in a command. Because we’re visiting the entity, that means any broader details about the execution (including e.g. partial move cached data) still need to be managed by the command object:

type MoveCommand struct {
  // Reference to global state.
  q []MoveCommandArg
  ...
}

func (c *MoveCommand) Visit(e Entity) {
  if c.q.Has(e, ...) { ... }  // See [Figure 4](#figure-4)
  ...
}

This seems inefficient – why are we accepting a non-scheduled entity as valid input? In fact, our first approach was probably closer to the mark – let’s just pass the command metadata as input instead!

func (c *MoveCommand) Visit(m MoveCommandArg) { ... }

One key difference between this and our initial implementation is how we’re approaching the metadata object here – we’re promoting the metadata into a “real” data struct, and as such, we need to consider the exported metadata API. What does a command need from the metadata?

In the case of move (with partial implementation), we need to track when the next iteration of partial paths need to be calculated. Seems like a job for an FSM!

type CommandMetadata interface {
  Status() FSMState
  Transitions() map[FSMState]FSMState

  // Used to determine which command needs to be canceled.
  Precedence(o CommandMetadata) bool

  // Triggered by Schedule or a Command.
  Cancel()
}

// MoveCommandArg will implement the CommandMetadata interface.
type MoveCommandArg struct {
  scheduledTick Tick
  source        Moveable
  destination   Position
}

Figure 5: Expanded MoveCommandArg type from Figure 1.

Where the FSM DAG for MoveCommandArg is as follow:

Figure 6: move state diagram.

The most straightforward way to link this into MoveCommand.Visit() looks something like this:

func (c *MoveCommand) Visit(m *MoveCommandArg) {
  if m.Tick() == curTick {
    m.SetStatusOrDie(EXECUTING)
  }
  if m.Status() == EXECUTING {
    p = c.Map.GetPath(..., pathLen)
    ...

    // Need to schedule next iteration.
    if m.Destination() != p[len(p) - 1] {
      m.SetTick(...)
      m.SetStatusOrDie(PENDING)
    }  
  }
  if m.Source().Location(curTick) == m.Destination() {
    m.SetStatusOrDie(FINISHED)
  }
}

Figure 7: move implementation with partial paths and FSM metadata inputs.

This seems cleaner than what we had before! We have a formal FSM structure validating the partial command action being executed. Additionally, because we’re passing a reference to the metadata object into Visit(), we can migrate the schedule away from the command.

This still seems a bit messy though, when we have to call SetStatusOrDie so many times. Is there a way we can not do that?

(Yes.)

Read-Only FSMs

We observe that the state of an FSM is an explicit representation of the underlying system. It does not matter how we calculate this state! In Figure 7, we “calculated” the state by storing it as an internal variable via SetStatusOrDie(), but we can also treat the state as a generic read-only operation on the system.

As an example, let’s consider the state diagram of the move command:⁵

FINISHED: A move command is finished if the source entity has arrived at the given destination.
PENDING: If the internal m.scheduledTick does not equal the current tick, the command is not yet ready to execute; this accounts for both when the source is already moving, or still needs to calculate the next partial move.
EXECUTING: If m.ScheduledTick equals current game tick, the command needs to take action and actually calculate the path of the object. At the end of the execution phase, the scheduled tick should be updated.
CANCELED:⁶ An externally triggered transition if e.g. the client specifies another move command in the meantime. This may need to be explicitly set.

So in code form, this looks something like this:

// MoveCommandArg will implement the CommandMetadata interface.
type MoveCommandArg struct {
  scheduledTick Tick
  isCanceled    bool

  // References the actual game state.
  status        *TickStatus  // Exports CurrentTick().
  source        Moveable
  destination   Position
}

func (m *MoveCommandArg) Status() FSMStatus {
  if m.isCanceled == CANCELED { return CANCELED }
  if m.source.Location.Get(m.status.CurrentTick()) == m.destination {
    return FINISHED
  }
  if m.scheduledTick == status.CurrentTick() {
    return EXECUTING
  }
  return PENDING
}

func (c *MoveCommand) Visit(m MoveCommandArg) {
  if m.Status() == EXECUTING {
    ...
    m.SetScheduledTick(...)
  }
}

Figure 8: Toy implementation of the move command with smart metadata objects.

By making the metadata a bit smarter, we’ve greatly reduced the burden on the execution logic. Note that the metadata object itself is read-only – we are ensuring that only the command object has the ability to write to the game state, as well as to the metadata object (e.g. SetScheduledTick()). Our server tick logic currently looks like this:

func (s *Server) doTick() {
  for var v := range s.Visitors() {
    for var q := range s.q[v.Type()] {
      // It is up to each metadata list to decide if it may be run in parallel
      // or not.
      q.Accept(v)
    }
  }
}

To pause a second, here is what our infrastructure looks like at the moment:

Figure 9: FSM / Visitor relationship diagram. The dirty state component is outside the scope of this blog post, but is explained in the design doc.

Two-Pass Scheduler

The other friction point we had was with regards to the complexity of having the command pushing into a two-way schedule (i.e., one that is directly mutated by both the client and server). We need a way to control the timing of when schedule mutations are made.

Our solution to this problem was to implement a client-only schedule object which is used as a scratchpad for incoming requests. At the beginning of each tick, we merge this into our actual source-of-truth schedule:

type Schedule interface {
  Append(t VisitorType, m CommandMetadata)
  RemoveCanceledAndFinished()

  // Requires CommandMetadata to implement Precedence().
  Merge(o Schedule)
}

func (s *Server) doTick() {
  s.q.RemoveCanceledAndFinished()
  s.q.Merge(s.clientSchedule)

  for var v := range s.Visitors() { ... }
}

Figure 10: Two-pass schedule implementation.

This ensures when commands are running, the command has exclusive write access to the schedule – since only instances of the same command are executing at the same time, reasoning about concurrency becomes greatly simplified.

Conclusions

An interesting tangent: a core application of the visitor pattern is for a double-dispatch table; however, note that we have a strict one-to-one relationship between a single CommandMetadata implementation and a command. There is no double dispatch here.

However, the reason why a visitor pattern is good when solving for the double-dispatch is because it forces a decoupling of the underlying data object from the mutations. It is our good fortune that we chose to view the problem through this lens, even if we originally applied the pattern inappropriately.

If we wished, we can migrate back to using a simple for loop to call the commands, as we originally did, but safe in the knowledge that we have arrived at a scalable approach to building state mutation flows.

func (s *Server) doTick() {
  ...
  for var c := range s.Commands() {
    for var m := range s.q[c.GetType()] {
      // If the command wants to run serially, it may employ a class-level lock
      // on Execute().
      go c.Execute(m)
    }
    // Wait for all invocations to return before continuing to next command.
    ...
  }
}

Chaining Commands

Let’s apply the same pattern to the attack command, a flow which has a dependent chase action.

type AttackMetadata struct {
  s     CanAttack
  t     CanDie  // Mortal?
  chase *ChaseMetadata
}

func (m *AttackMetadata) Status() Status {
  if chase.Status() == CANCELED { return CANCELED }
  if t.Health(curTick) <= 0 {
    return FINISHED  // Cleaned up next tick.
  }
  if d(s, t) < s.AttackRange() && a.OffCooldown(curTick) {
    return EXECUTING
  }
  return PENDING
}

func (c *AttackCommand) Visit(m AttackMetadata) {
  if m.Status() == EXECUTING {
    t.Damage(a.Strength())
  }
}

Figure 11: Simplified attack command implementation.⁷

Dependencies in our framework are modeled by a pointer in the metadata to another metadata object; the encompassing flow can then incorporate the dependent flow status when reporting its own status. We have yet to encounter a case where the command needs to query a dependent step’s status direcly.

A command may need to enqueue a dependent flow. For example, consider an entity commanded to guard an area – when an enemy enters the entity’s line of sight, guard may decide to enqueue an attack. In this case, the guard command will have a reference to the attack schedule and call q.Append().

Canceling Commands

q.Append() and q.Merge() will invoke CommandMetadata.Precedence(), which tests for the relative priority of two metadata objects. The lower priority one will be canceled.

CommandMetadata.Cancel() is command-dependent, but should also trigger the Cancel() function of dependencies. An upstream / parent command which need the child finish can then query the child flow status when reporting its own Status().

Addendum

Recontextualizing as Event Flows

I came across a rather interesting tech talk while writing this article which talks about the Event-carried State Transfer software pattern (indeed, from what little research I’ve done on this, it seems like this talk is actually the talk which introduced the concept to the wider public).

There are some interesting parallels here between the event-driven approach described and ours here. Indeed, the state query in the command executor is just detecting if an event occurred between the last and current server tick. Moreover, the event-carried state transfer pattern seems to emphasize minimizing data access to the underlying state. The event pattern achieves this through some level of caching, packed into the event data in order to reduce resource contention. Our implementation instead minimizes the API surface area that is exposed through the command metadata.

It is true that we could massage our current approach into an event-driven approach; however, this seems both overengineered and antithetical to how we view our code.

Remember that we are treating the game system as deterministic. When an object moves, the partial move schedule is already preordained – there is no additional user input that is necessary in order to make the system behave correctly. Our framework accounts for this by doing a series of state reads. However, if we were to transform the state transitions into broadcasted events, we’re asserting instead the system is always in flux, and we’re “promoting” deterministic behavior into the category of “unexpected” inputs. This seems like a less elegant approach, and at the same time will require a large system overhaul for questionable value (for our use-case).
A single server tick will execute a list of commands in a known order, e.g. we process all move commands, then all attack commands, etc. Event queues are very useful when we are decoupling execution order from our server; however, if we were to do this, then a whole new, scary world of consistency problems appear. We can leave that problem to concurrent text editors and CRDTs.

A Digression on Attack Variants

While editing this document, a friend pointed out the toy implementation of the attack command does not fully specify some edge-case behavior –

I know you said this is simplified, but how do you handle situations where the command calls for a stationary source (e.g. tesla coil) to attack a target which then leaves its range?

Does it stay in the command queue in case the target comes back into range, with some lower-priority “auto attack” command dealing damage to nearby enemies in the meantime? Or does it cancel itself?

This question demonstrates a nice property of the FSM / visitor approach, which is the flexibility of implementation. The implementation in Figure 11 assumes that the target can move, and will always try to attack the same target until the target dies. How do we extend this command?

We can envision an attack variant that forgets the target after the target goes out of range:

type ForgetfulAttackMetadata struct {
  s           CanAttack
  t           CanDie
  hasAttacked bool
}

func (m *ForgetfulAttackMetadata) Status() Status {
  if t.Health(curTick) <= 0 {
    return FINISHED  // Cleaned up next tick.
  }
  if d(s, t) < s.AttackRange() && a.OffCooldown(curTick) {
    return EXECUTING
  }
  if m.hasAttacked && d(s, t) >= s.AttackRange() {
    return CANCELED  // Cleaned up next tick.
  }

  return PENDING
}

func (c *ForgetfulAttackCommand) Visit(m ForgetfulAttackMetadata) {
  if m.Status() == EXECUTING {
    t.Damage(a.Strength())
    m.SetHasAttacked()
  }
}

Figure 12: Alternative attack command implementation. Which cancels itself if the target exits range via a read-only operation.

Partial Tick Execution

Because the metadata is stored in a separate queue, we can pause command execution at any given time during a tick – this means we can smooth out large server loads over several ticks, allowing us to enforce a consistent server tick rate (at the expense of some additional end-to-end latency). This feature is not currently implemented, but may be of use later.

Notes

Partial pathfinding is implemented via hierarchical A*, though this may / will change in the future. The point is that there may be additional complexity introduced into commands. As an interesting sidenote, partial pathfinding allows us to spread out pathfinding to multiple workers after the initial coarse-grain search. This may be a nice optimization route to go down in the future. ↩
Future implementations of pathfinding, e.g. via flow fields or navmesh-based solutions, may eliminate the need for partial paths. ↩
In reality, this step was implemented along with initial visitor pattern migration (explained later), but we’re highlighting a rather important motivating point for seeking better approaches to the problem. ↩
While interface{} inputs are undesirable, they aren’t necessarily an architectural problem. We’re concerned with what are potential project-terminators due to non-maintainability. ↩
For more information on this, see Time-Invariant Finite State Machines. State transitions are traditionally triggered by an “external” user; we are expanding the FSM here to allow for the possibility that transitions may be triggered without an explicit outside trigger action. This allowance gives us a lot of flexibility in modeling semi-autonomous commands. ↩
Sidenote, I learned the objectively better “cancelled” spelling is British, and so have reverted to the inferior but semantically consistent American spelling. ↩
For a more in-depth discussion of the attack command implementation details, see A Digression on Attack Variants ↩

Arbitrary Command Execution

Wed, 13 Jan 2021 00:00:00 -0800

Scaling Complex Flows with FSM Metadata

Status	final
Author(s)	minke.zhang@gmail.com
Contributor(s)	bleh777777777777@gmail.com
Last Updated	2021-01-12

Background

DownFlux is a real-time strategy game which potentially requires a wide variety of state-mutating flows. Because both the state and the flows are complex, we need a formal framework to describe the work that needs to be done to change the state. In a world where we add ad hoc state mutations, we will very quickly see the pains of a complex chain of code without a clear debug entry point.

Overview

We will break any given state mutation into two parts – a command metadata object, and a command executor. The metadata describes the overall command, exposes a specific subset of the game state, and tracks the work that is currently being done and will need to be done. The metadata may hold a reference to a child command metadata struct as well.

On every game tick, the executor object queries the metadata for what work (if any) needs to be done. The metadata queries only its internal references – notably, this is a read-only operation; the metadata object does not have authority to mutate state on its own. If the metadata signals to the executor that work needs to be done, the excutor will then explicitly mutate both the game and metadata object as appropriate.

Detailed Design

Figure 1: FSM / Visitor relationship diagram.

Game State

The game state represents the totality of game data. This state may include game entities e.g. tank instances, the curves representing an entity property over time, as well as any other general data e.g. server status, the current game tick, etc.

A subset of the game state is broadcast per tick to all connected clients.

FSM (Command Metadata)

A command is represented with a finite state machine with a fully defined transition graph. For example, the move command consists of the PENDING, EXECUTING, FINISHED, and CANCELED states, with transitions

PENDING → EXECUTING
PENDING → FINISHED
PENDING → CANCELED

This command will have references to the underlying game state as part of the data struct, e.g.

type MoveCommand struct {
  serverStatus        *Status
  positionCurve       Curve
  nextPartialMoveTick float64
}

The command may offer a set of utility functions to mutate the referenced subset of the game state, or its own internal state (e.g. nextPartialMoveTick), but must not mutate itself. The command manager must manually make the mutations.

Virtual State Transitions

The command metadata will be used to calculate the “real” state of the command at any given point in time – if we schedule a move command to occur ten ticks in the future, we want to make sure the command itself knows when it needs to execute. This alleviates processing logic that otherwise will need to be handled by the iterator examining the commands (which in our case is the move visitor).

See Time Invariant Finite State Machines for more details.

API

func (c Command) ID() ID

The command may need to generate a UUID at init time – this ID will be used to check for duplicates of the command, and for calculating what commands of the same type may conflict with one another, e.g. two move commands on the same unit.

func (c Command) Accept(v Visitor) error

A command must allow an entry point for the visitor. This is part of the standard visitor pattern API.

func (c Command) State() (State, error)

A command will return its current, virtual (i.e. calculated) state. This will be used by the caller to determine what actions (if any) should be taken at the current point in time.

func (c Command) To(State) error

A command will surface a way to transition between different states in the internal FSM. This function will error out if there is no valid transition path from the current internal virtual state to the target.

func (c Command) Precedence(d Command) bool

A command must know if it may be superseded by another higher-priority command. This function returns true if the input Command arg is of lower priority (i.e. “c preceded d”).

Chaining FSMs

Figure 2: Chaining FSMs; note that a visitor may queue additional dependent flows, but never accesses the dependent FSM itself, nor the dependent flow visitor.

The command may be a part of a larger, more intricate chain of commands – an attack-move command consists of both chasing a target and actually attacking when the target is within range. This is a valid pattern.

The parent command in this case may also need to expose an API endpoint to allow the visitor to change this reference. In our implementation of the Chase visitor, we regularly cancel and replace the referenced Move command with a new destination – this pointer is set via

func (c ChaseCommand) SetMove(m MoveCommand) error

See Figure 2 for more details.

Example

Consider our Attack command; logically, we have a background task in which the attacker constantly chases the target; if the target is within attack range, we then signal to the Visitor this step is ready to execute

// Simplified API for brevity.
func (c *AttackCommand) Status() Status {
  if c.chaseCommand.Status() == CANCELED {
    return CANCELED
  }
  if d(
    c.source.Position(),
    c.destination.Position()) < c.source.AttackRange() && (
      c.source.OffAttackCooldown()) {
    return EXECUTING
  }
  return PENDING
}

FSM List

An FSM list will keep track of all commands of a specific type (e.g. all Move commands). This list may be mutated by an arbitrary visitor (e.g. when a Chase visitor needs to spawn in a new Move command). The default access pattern is provided via the Accept function.

N.B.: Technically this may be implemented as a simple slice, but our underlying implementation uses a map struct instead for fast queries.

API

func (l List) Clear() error

At the beginning of a game tick, the list will be required to delete any references to FINISHED or CANCELED-state commands.

Any dependent commands which have references to these deleted commands will still have access to the data structs – in our Golang implementation, the underlying memory is not freed until the last reference is deleted.

func (l List) Merge(m List) error

Our engine implementation keeps two lists per command type – one for incoming user requests (the “cache”), and one as our source of truth (“source”). At the beginning of the game tick, after the source deletes canceled and finished commands, the cache will be merged into the source.

This merge may cancel some commands in the source, as user commands take priority; in this case, we will delete the reference to the canceled command, and replace it with the new user command. As in the case of the Clear() function, the chained command(s) will still have access to the data of the deleted command.

func (l List) Append(c Command) error

The list will expose a generic way to add a new command.

func (l List) Accept(v Visitor) error

The list will also be exposed to the visitor – this function usually only acts as an iterator wrapper around the tracked commands. Commands here may be mutated serially or concurrently, depending on the list implementation.

Merge

Figure 3: FSM List merge operation.

As stated above, for each command type, we keep two FSM list instances. The cache is used for keeping track of client (e.g. player) input, while the source is used to keep track of the actual work items that need to be done. After we clear the stale commands from the source list, we will then merge in the cache – this allows us to atomically schedule the client input, and to override existing commands in the queue.

Visitors

The visitor is our execution phase in the game. As stated above, this is the standard visitor of the visitor pattern; however, the key difference here is that while many references to this pattern uses the visitor to mutate actual objects (game entities in our case), we have opted for an additional layer of indirection, and have the visitors mutate the metadata instead. This allows us to have the opportunity to have a formalized definition for each command type, and greatly increases the scalability of our game as we add more and more commands to the execution model.

API

func (v Visitor) Visit (a Agent) error

The Visitor mutates the game state and the underlying command via the Visit() function. This function generally

queries the command’s State(), and
decide what action to take based on the returned value.

For example, the Move visitor will do a no-op if the Move command returns any state other than EXECUTING. In the EXECUTING phase, the visitor will

calculate a partial path for the entity,
update the entity curve, and
schedule when the next partial move should be calculated (via MoveCommand.SchedulePartialMove(float64)).

Chaining Commands

It is in the Visitor that any dependent flows for the visitor-specific command may be generated – e.g. the newly-created Move commands that make up the Chase-chain are created here.

In the case the visitor needs to create new commands, the visitor will need a reference to the associated FSM List of the dependent command type. The visitor is responsible for scheduling the newly created command, and will schedule the command in the source of truth, not the cache.

See Figure 2; note that Visitor_B does not have a data dependency on FSM_A – setting this limitation greatly simplifies the separation of responsibilities between the visitor and the command metadata, and should allow for more scalability.

Also note that although we have said FSMs are read-only, there is a read-write dependency from FSM_B to FSM_A. This write operation is just a Command.To(CANCELED) call in case we need to halt the operation, and should not be any other mutation.

Example

For our Attack command defined above, our visitor would query for the state, and mutate the game state:

// Simplified API for brevity.
func (v *AttackVisitor) Visit(c *AttackCommand) error {
  if c.Status() != EXECUTING { return nil }
  if c.Status() == EXECUTING {
    c.Source().Attack()
    c.Target().Damage(c.Source().Strength())
    v.dirtyState.Add(c.Source(), c.Target())
  }
}

Note that neither the Attack command nor the visitor modifies the dependent Chase flow - this independent execution of commands is crucial for scalability.

Dirty State

In the case the Visitor updates the game state via a curve, or creates a new entity, it is responsible for updating the game’s dirty state list. This list keeps track of the broadcastable data per tick, and is reset at the beginning of every game tick. For more information, see the design doc.

Work Estimates

Work Item	Time Estimate	Status
implement FSM interface	1 week	DONE
implement Move FSM	1 week	DONE
implement Produce FSM	1 week	DONE
implement Chase FSM	1 week	DONE
implement Attack FSM	1 week	DONE
implement Move Visitor	1 week	DONE
implement Produce Visitor	1 week	DONE
implement Chase Visitor	1 week	DONE
implement Attack Visitor	1 week	DONE
implement Client Move API	1 day	DONE
implement Client Produce API	1 day	DONE
implement Client Attack API	1 day	DONE
demonstrate feasibility	N/A	DONE

Client Disconnect Handling in DownFlux

Mon, 16 Nov 2020 00:00:00 -0800

Client-Server Networking Model for a Large-Scale RTS

Status	draft
Author(s)	minke.zhang@gmail.com
Contributor(s)
Last Updated	2020-11-16

Goals

handle client reconnects
resolve game state cleanly
deal with connection spam
detect client / server network outages

Background

DownFlux is an ongoing open-source RTS game built from scratch (rather rashly). Because DownFlux is not built on top of any existing gaming engine, we need to design a way for client-server network connections to be resilient to network flakiness.

Overview

Downflux is using a client-server model approach for networking, with gRPC serving as the API layer. The client issues player commands (Move, Attack, etc.) via blocking API, whereas the server passes game state change through a persistent stream. During the normal course of a game, it is possible that the client may experience transient network outages – this design doc focuses on one implementation of client disconnect / reconnect logic which can handle this in a graceful and scalable way.

StreamData API

The server communicates with the client via the StreamData RPC endpoint; each message sent along this API contains a list of entities and a separate list of curves, as well as the server time at which the message was generated. These data points communicate the game state delta between subsequent points in time; by merging all data messages, the client will have the complete game state.

These messages are sent once per server tick. To save on bandwidth, a message will be sent only if a delta exists – if both the list of entities and curve deltas is empty, then the server will skip sending the message for that tick.

Game State Monotonicity

The monotonically increasing¹ game state S may be totally represented by the set E of game entities and C the set of curves representing game metrics evolving over time. We represent the merging of an existing, valid game state with an incoming StreamData message as

S’ := S ∪ ΔS == (E ∪ ΔE, C ∪ ΔC)

The set of entities here is an append-only mathematical set, i.e. there are no duplicate elements. Because entities are uniquely identified by a UID, we can send along just the newly generated entities per server tick.

A curve is uniquely specified by its

parent entity UID,
the entity property this curve represents (e.g. location, health, etc.),
and the last time the curve was updated by the server.

When we merge two curves, the data generated by the most recently-updated curve takes precedence – that is, if the older curve and newer curve have conflicting extrapolated data, we replace the older curve’s extrapolated data. If the newer curve does not have information on a specific time interval, we keep the older curve’s data. In this way, we can guarantee that the curve itself is idempotent under merge requests (of a specific destination curve), and the prediction of the curve over time becomes more accurate (since we’re merging only new predictions).

We can formalize these definitions as

S ≤ S’ ⇔ E ⊆ E’ ^ C ≤ C’

We can compare the curves by comparing the server tick.

Client Work

The game client will treat the incoming StreamData messages as game state deltas and merge them into the local state via the process described above; because the end user (player) of the client only cares about the current tick, any data older than the current server tick may be thrown out², and it’s okay if the old data is invalid.

Therefore, we can see a framework for leveraging the game state delta as a game re-sync tool.

Detailed Implementation

Disconnect Detection

We can implement client / server disconnect detection via the gRPC keepalive flags. These may be specified on the server at start up time, and on the client at connect time. gRPC supports heartbeat messages sent at specific intervals, and allows the underlying channel to auto-close when a heartbeat timeout occurs.

Because this is handled at the gRPC layer, we may abstract that away in the game executor.Executor instance, as long as we ensure that the gRPC server - executor will always receive incoming StreamDataResponse messages, e.g. via a server-local slice object per client.

Server

Once the client channel is closed, the client-side StreamData gRPC endpoint will also terminate.

gRPC

The gRPC server on startup will set the flags specified in keepalive.md and the Golang module so that

the client may periodically send keepalive messages;
the server will send periodic keepalive messages; and
there is a definite, non-infinite timeout for these server-initiated keepalives, after which
1. the gRPC stream will be closed, and
2. the gRPC server will mark the underlying executor Client object as dirty, which then instructs the component to teardown the client channel and mark it as in need of a sync.

The gRPC server will implement a client-specific local message queue and listener Goroutine – these constructs will listen on the executor client channel and enqueue any messages sent along it, guaranteeing that the channel will never be blocked.

Executor

The executor will provide a StopClientStreamError function, which will be used to teardown the client channel struct and mark the associated client as out of sync with the game state.

Client State Metadata

The executor will model a client connection in the form of a transition diagram –

The executor will keep an executor-specific client metadata object, with a flow diagram as defined in Figure 1. A metadata object will store a Golang channel object, used to send data to the gRPC server.

Figure 1: Executor client flow diagram.

We are defining the states NEW, DESYNCED, OK, and TEARDOWN as follows:

A client is in the NEW state when
- the client is first created, or
- when a network error is detected while streaming game state.
In this state, the channel does not exist, and no data will be broadcasted to this client.
A client enters the DESYNCED state once a call to the gRPC StreamData endpoint is made – in this state, the channel is created, and the client is marked as needing the full game state update. The executor will provide the appropriate data upon the next tick to the client channel.
A client is in the steady OK state once the full state has been sent. Future messages sent along this channel are state deltas, as defined above.
A client enters the TEARDOWN state once the game shuts down – at this point, the client may not reconnect, and the channel is permanently closed.

Resync

With our flow diagram, it becomes apparent that the client upon a server disconnect will only need to reissue a StreamData gRPC call with its stored internal client ID. The gRPC server will handle the reconnect by marking the client as DESYNCED, just as it would have done upon the initial stream request. The next message sent from the server will be the full game state.

Footnotes

This is not necessarily the right wording, but there doesn’t seem to be such a phrase which describes our game state assumptions. ↩
This is not true for the case of the replay client, but that should be connected to the server locally, where network flakiness is not an issue. ↩

DownFlux Client Design Doc

Fri, 23 Oct 2020 00:00:00 -0700

Client-Facing Engine Design

Status	Draft
Author(s)	minke.zhang@gmail.com
Contributors
Last Updated	2020-10-23

Objective

Outline the core mechanics necessary for rendering the server-calculated game state to the players.

Background

DownFlux is a collaborative RTS.

Overview

The client will consist of two parts – the actual rendering engine (e.g. drawing tanks and units on screen) and the client API component that does the lower level communications with the server.

Detailed Design

Client

The API client is tasked with the lower-level communications with the server, and forwards the transformed user inputs given by the Renderer. The client also runs a daemon thread to process incoming data provided by the StreamCurves endpoint.

class APIClient {
  public string ID;  // Client ID used to identify the player to the server.

  // Invokes AddClient and announces to server the current tick of the client
  // (in case of reconnection)
  public string Connect(string tickID);

  // Returns the current buffer of recieved data from server stream. Will clear
  // buffer after invocation.
  public StreamData Data;

  public void Move(
    string tickID,
    List<string> entityIDs,
    Position destination,
    MoveType moveType,
  );

  public async Task StreamCurvesLoop(string tickID);
}

Renderer

The rendering engine will process actual user input (key presses, mouse clicks, etc.), transform them into useful game intents, and forward them via the API client to the server.

The engine will also be tasked with the core loop of processing game state and displaying that in some form to the user.

Core Loop

using EntityID = string;
using CurveID = string;

public Dictionary<EntityID, Entity> EntityLookup;
public Dictionary<CurveID, Curve> CurveLookup;

public Queue<PlayerAction> Actions;

Tick Rate

The server and rendering engine will run at differing tick rates – because we aim to minimize the network traffic for communicating game state, the data sent to the client will be much slower than the rate at which information needs to be redrawn. For example, the canonical server tick rate is at ~10Hz (from server design), but obviously for games, the renderer needs to draw at a rate of 30 - 60Hz, if not higher.

To account for this tick discrepancy, the renderer will interpolate the server curve data for the relevant partial server tick times.

The renderer may also need different curve rates for different phases in the core loop – for example, if the server only broadcasts game state at 10Hz, the renderer doesn’t need to poll for every frame (60Hz).

Server Reconciliation

public static TickRate = 10;

The renderer will need to query the API client regularly to update its internal curve state and for new entity announcements. The API client holds the actual server stream.

If any new data is provded from the API client (whether new entity announcements or curve updates), the renderer will update EntityLookup and CurveLookup, either with Curve.ReplaceTail or by creating a new entity row.

N.B.: EntityLookup and CurveLookup are add only sets. If an entity should no longer be rendered, it will be marked as tombstoned instead.

Rendering

The renderer will iterate through the set of curves and draw appropriate data for the values interpolated at the current client tick time.

The server will also need to iterate through the Actions queue and render any client-only changes.

Process Player Input

TODO(minkezhang): Is this async?

At this phase, the renderer will take note of the current player actions (e.g. physical clicks, mouse drags, etc.), transforms them into a usable struct, and append them to the PlayerAction queue.

Process Player Actions

Depending on the action specified (e.g. Move or ScrollViewport), the may need to communicate with the server – this phase is leaving room for calling out action-specific handlers

The renderer will call the server (via the API client) with appropriate commands at this time.

DownFlux Networking Design

Fri, 09 Oct 2020 00:00:00 -0700

Client-Server Model for a Large-Scale RTS

Status	final
Author(s)	minke.zhang@gmail.com
Contributor(s)
Last Updated	2020-10-09

Objective

Design a communications model between a small number of clients concurrently mutating a complex RTS game state.

Background

Relevant to this document, DownFlux will be an RTS game for a small number (~10) of players within normal RTS parameters. We are exploring different networking models for the game, and have proposed the following design as a potential implementation of client-state interaction.

Overview

Assumptions

Metric	Estimated Bound
Players	10
Controllable Entities	1k
Map Size	1k x 1k tiles
Server Tick Rate	10Hz (see justification)
Network Latency	100ms
Network Bandwidth	1Mbps / player (see justification)

A list of historically relevant papers and articles can be found in the DownFlux docs repo.

Infrastructure

The networking model will be a client-server model, as opposed to the more commonly implemented lockstep framework used in most RTS games. We’ve decided that this should dramatically simplify the design, and dodge around tricky issues like creating a consistency model for a fully connected P2P system from scratch. One of the main benefits of the lockstep model is saving on computation and bandwidth costs (on the centralized server)¹, but modern consumer processing power and bandwidth should be enough to handle the workload.

The client consists of a renderer² and an API component, with the API component sending player commands to the server, e.g. “build barracks here”, “attack enemy infantry”, “use special ability”, etc.

The server consists of two message queues (input and output), and a core loop which handles the burdensome task of simulating the game state. The input queue regularly issues a list of player-issued messages to the core loop, sorted by the server clock time. The core loop then takes these messages and runs through an internal subprocess order in which these messages and the existing game state are taken as input. A list of output messages will be then piped to the outgoing queue, which will fire the back to the client immediately, along with information on when the messages should be rendered. The client will merge these messages into the existing game state to be picked up by the rendering component.

Detailed Design

Types

type ClientID, EntityID, CurveID string

type TickID string
type Tick float64

type Entity interface {
    ID() EntityID
    Curve(t (HP|PRIMARY_COOLDOWN|...)) CurveID
}

type Curve interface {
    ID() CurveID
    Type() (LINEAR|STEP|PULSE|...)
    Parent() EntityID
    Tick(float64) Tick
    Value(Tick) float64
}

Client

CURRENT_TICK_ID TickID = ""

API Component

Breaking down the command that will be issued to the server, we can broadly speculate this would include

Build(c Coordinate, t (BARRACKS|REFINERY|...))
Ping(c Coordinate, t (ATTACK|GUARD|...))
Move(entity_id string, c Coordinate, t (NORMAL|REVERSE|ATTACK_MOVE|GUARD|...))
Attack(entity_id, target_entity_id string)
UseAbility(entity_id, target_entity_id string, c Coordinate, t (PRIMARY|SECONDARY|ULTIMATE))

Each API call will add an additional CURRENT_TICK_KEY to each message when sending the message to the server.

Server

The server simulates the entire game state and facilitates player-player interaction (e.g. combat). At the heart of the server is a linear game loop consisting of phases. Phases are run serially, but logic within each phase should exploit concurrency where possible.

SERVER_TICK_RATE int = 10
CURRENT_TICK Tick = 0  // At 60Hz, we will need to run the game for
                       // 400+ days before encountering overflow
CURRENT_TICK_ID TickID = ""
TICK_ID_LOOKUP map[Tick]TickID = nil  // length TICK_ID_WINDOW_SIZE
TICK_ID_WINDOW_SIZE int = 2  // Number of ticks in the past the
                             // server will accept as valid (initial)
                             // input

New Tick Phase

This phase is a trivial subroutine which

increments CURRENT_TICK,
generates a new CURRENT_TICK_ID, and
update TICK_ID_LOOKUP with the new data, as well as dropping the oldest row

Input Queue Phase

CLIENT_RECENT_TICK map[ClientID]Tick = nil
MESSAGE_QUEUE []PlayerCommand = nil

The input phase keeps a buffer of incoming player messages. The buffer is sorted by the received timestamp of the server.

For each incoming message, the input phase will do some basic precondition tests:

if message has an embedded TickID which does not show up in the keys of TICK_ID_LOOKUP, discard message and relay error to sender;
then, if message has an embedded TickID corresponding to a Tick before the Tick found in CLIENT_RECENT_TICK, discard message and relay error to sender;
then update the client’s CLIENT_RECENT_TICK entry with the corresponding Tick and enqueue the message

At the beginning of each tick, the queue will

discard duplicate messages,
log queue alongside CURRENT_TICK,
sent off to the core loop for processing.

Output Queue Phase

MESSAGE_QUEUE []Curves = nil

The output queue keeps a buffer of outgoing messages to each player. An outgoing message may either be an entity mutation (e.g. a creating or destroying a building or unit), or a curve³ mutation (e.g. altering a path, starting an attack, etc.).

For each player, we will need to

filter the outgoing messages by the player POV, i.e. don’t broadcast stealthed units or units under the fog of war, and
filter any curve message by the player POV, i.e. don’t show a position curve (movement) that goes 50 ticks in the future if the position of the unit will exit the player POV; we can optionally skip this step if the domain of the curve extends into the future only a little bit (e.g. less than 1s of rendering time)

This filter step can be a no-op for now while we implement everything else, but if we do not enable filtering, the player can exploit the additional information in the form of map hacks.

At the end of the tick, the output phase will

log unfiltered queue with CURRENT_TICK,
send players their respective messages, along with the new CURRENT_TICK_ID

Core Loop Phase

TRIGGER_QUEUE map[Tick]CurveID = nil
ENTITY_LOOKUP map[EntityID]Entity = nil  // Entity.Curves() is a list
                                         // of CurveIDs
CURVE_LOOKUP map[CurveID]Curve = nil  // Curve.Parent() is a single
                                      // EntityID

The core loop is tasked with updating the actual game state, including editing the map terrain and mutating the curves. This is a heavy subroutine.

Delete Entities

For the current server tick, we check the TRIGGER_QUEUE⁴ for any curves that have significant effects, e.g. setting health to 0. For these curves, we need to

delete the parent entity (e.g. structure or unit)
delete the row from the queue

Create Entities

New structures and units may be instructed to be built, either by a player command (new structure) or when a production facility finishes production (unit ready). For the former, we will read from the message queue, whereas the latter will need to be checked against TRIGGER_QUEUE.

New entities will need to be added to the ENTITY_LOOKUP master list.

For buildings which have a set construction time, we will

generate a new curve for the new entity representing when the structure may be used (e.g. start producing units),
add the curve to the output queue

Update Curves

Each subphase of curve mutation may be done concurrently.

Collision Detection

We need to check if any entities (units, buildings, projectiles, crates, etc.) overlap hitboxes – if they do, we need to resolve any actions that may occur (taking damage, grabbing upgrade, redo pathing, etc.). For entities which require deletion (e.g. projectiles and crates), add to TRIGGER_QUEUE – do not delete in this step.

Collision detection may be implemented via QuadTree, but can be null-op for now. This phase may possibly support being done asynchronously by a separate process in the background.

Pathing

For this phase, we will read from the message queue and update / create curves with new paths (and add to the output queue). This may

be done in parallel, and
is abstracted away from the actual implementation of pathfinding and may be either generated by HPF⁵ or flow fields⁶.

If flow fields are used for pathfinding, the simulation for steering forces are all simulated on the server.

Attack Resolution

Attacks from the input queue will need to be processed and aggregated. For each attack command from the queue, we will need to find the target entity and connect the curve with the target entity. This will be in the form of an HP curve for the target entity, which itself is an aggregated curve formed by the summation of all attack (and heal) curves targeting the entity.

If the HP curve has changed, add / update the TRIGGER_QUEUE row for when the HP curve reaches 0

Any updated curves will need to be added to the output queue.

Ability Resolution

Abilities like shields, speed boosts, etc. will have associated curves and must be updated and added to the output queue.

Caveats

Tick Rate Tuning

The server will have significant impact on the network latency in this model – if we assume (per Assumptions) a 100ms client-server travel time, and that the server itself will take another 100ms (at 10Hz), our total end-to-end latency is 300ms. While it may not matter much ultimately in the game results⁷, we will still need to employ around 20 frames of client-side prediction to smooth out user input. How fast can we make the server tick rate to cut down on the minimal latency?

Client API Component

We could alternatively consolidate Move, Attack, and UseAbility into more general API endpoints:

EntityTargetAbility(entity_id, target_entity_id string, t (MOVE|REVERSE|ATTACK_MOVE|ATTACK|PRIMARY|SECONDARY|...))
EntityTargetAoE(entity_id string, c Coordinate, t (GUARD|PRIMARY|SECONDARY|...)

We decided this is needlessly generic and will create more problems server side in decoding the intent of the API than it solves in a unified API.

Scalability

Multi-Server Processing

Look into Redis for in-memory SQL implementation, as our Event and Curve data are rather tabular (as are the message queues). This is useful if / when we scale up to multiple server nodes.

Redundancy and Reliability

TBD

Security

TBD

Privacy

The server will be aware of

identifiable user data (IP, username, etc.)
user game input (game commands) and the time the commands were received

The server will not keep track of the user IP, or other non-game related identifiable data. The user ID, username, and game input will be tracked for replay purposes.

Work Estimates

Work Item	Time Estimate	Status
barebones client and server	1 week	DONE
implement tick phase	1 day	DONE
implement input queue	1 week	DONE
implement output queue	1 week	DONE
implement create entities	1 week	DEMO
implement pathfind	1 week	MVP (no flow field)

Footnotes

Terrano, Mark. “1500 Archers on a 28.8: Network Programming in Age of Empires and Beyond.” 2005. ↩
Discussing the specifics of the rendering engine is out of scope of this design document. ↩
Stolen from The Tech of Planetary Annihilation: ChronoCam. Curves are linear transformations of a variable trajectory. This transformation saves on data being sent to the client. ↩
We need to decide if we want a generic trigger queue, or a queue broken down by category, with the CurveID still mapping back to a global lookup map. ↩
Botea, Adi. “Near optimal hierarchical path-finding.” 2004. ↩
Emerson, Elijah. “Crowd Pathfinding and Steering Using Flow Field Tiles.” 2020. ↩
Claypool, Mark. “The effect of latency on user performance in Real-Time Strategy games.” 2005. ↩

Networking Brainstorm

Mon, 05 Oct 2020 00:00:00 -0700

Articles

1500 Archers on a 28.8: Network Programming in Age of Empires and Beyond 2001

Seminal paper on how Age of Empires II dealt with the networking problem by implementing client lockstep simulations. Lockstep implementation here requires N² stable (but slow) connections.

The effect of latency on user performance in Real-Time Strategy games 2005

Important finding that overall network latency didn’t actually impact the outcome of RTS games much.

Rokkatan: scaling an RTS game design to the massively multiplayer realm 2006

Scales up clients connecting to a game by implementing multiple proxy servers. Servers are totally connected but each proxy is only allowed to make a specific set of (mutually exclusive) mutations. Features dynamic rebalancing in case a server crashes.

Goes into detail a bit about the evolution of networking models in gaming; goes a bit into detail about how modern day rewind & replay lag compensation works.

Networking in real-time strategy games 2011
RTS Game Protocol 2011

Short, accessible explanation of one possible lockstep implementation.

Synchronous RTS Engines and a Tale of Desyncs 2011

Accessible explanation of how Supreme Commander (2007) implemented and optimized lockstep. Specifically deals with global tick synchronization implementation – may also be applicable in client / server model.

Synchronous RTS Engines 2: Sync Harder 2011

Goes into detail about how Supreme Commander implemented the communications and sync protocol. Very useful reference.

Question on Synchronous networking 2011

Includes some links to sample implementations of lockstep.

Opinion: Synchronous RTS Engines And A Tale of Desyncs 2011

More information on lockstep implementation and explanation.

Very good article on a possible approach to a client-server model of RTS network communications – by sending linear transformations of data trajectories (instead of frame-by-frame updates) to save on bandwidth.

Convenient back of the envelope bandwith calculations for RTS games.

Don’t use Lockstep in RTS games 2016
Friday Facts #76 - MP inside out 2015

Factorio originally used lockstep to deal with broadcasting game state. This is an interesting case as it similarly has to deal with large maps and changing terrain with large amounts of data being transferred across multiple clients.

Friday Facts #147 - Multiplayer rewrite 2016

Describes in detail the networking model migration for Factorio from lockstep to server-elect model.

Building a Multiplayer RTS in Unreal Engine 2018

Summarizes lockstep and client-server implementation and provides sample command message samples.

Starcraft II’s internal tick rate is 16 - 20Hz; apparently other RTS can be as low as 8Hz. Good reference / justification.

Code References

Queries

RTS game networking query

DownFlux

Local Collision Avoidance

ORCA

Works Cited

Notes

Commanding RTS Commands

Abstract

Jargon

Flow Examples

An ad hoc Approach

(An Accidental) Tour de Entities

Finite State Metadata

Read-Only FSMs

Two-Pass Scheduler

Conclusions

Chaining Commands

Canceling Commands

See Also

Addendum

Recontextualizing as Event Flows

A Digression on Attack Variants

Partial Tick Execution

Notes

Arbitrary Command Execution

Background

Overview

Detailed Design

Game State

FSM (Command Metadata)

Virtual State Transitions

API

func (c Command) ID() ID

func (c Command) Accept(v Visitor) error

func (c Command) State() (State, error)

func (c Command) To(State) error

func (c Command) Precedence(d Command) bool

Chaining FSMs

Example

FSM List

API

func (l List) Clear() error

func (l List) Merge(m List) error

func (l List) Append(c Command) error

func (l List) Accept(v Visitor) error

Merge

Visitors

API

func (v Visitor) Visit (a Agent) error

Chaining Commands

Example

Dirty State

Work Estimates

See Also

Client Disconnect Handling in DownFlux

Goals

Background

Overview

StreamData API

Game State Monotonicity

Client Work

Detailed Implementation

Disconnect Detection

Server

gRPC

Executor

Client State Metadata

Resync

Footnotes

DownFlux Client Design Doc

Objective

Background

Overview

Detailed Design

Client

Renderer

Core Loop

Tick Rate

Server Reconciliation

Rendering

Process Player Input