Mini Patch 830-2g - GameObject Refactor & Startup Optimisations

Razmataz · November 19, 2021

Preamble

Hi guys!

This is a complicated announcement, so it has warranted its own thread.

Also, this is marked as 8.3.0-2g because there's still an 8.3.0-2f update in the pipeline that ideally is done before this one.

TLDR

Up until now, all phases have shared the same container for object data.
Following a maintenance period required for this change-set, we will divide the container up for each phase.
This allows us to load phases separately from every other process in the server, allowing server startup to be a separate process from phase loading.
Upon login after a crash, most recently active phases will be loaded first, and phases up to 30 days old will be added to the phase loading queue.
Inactive phases can be manually re-activated after the queue is completed by an officer+.
The server is currently scheduled to have extended maintenance on the 29th of November to faciliate this update. The window is expected to be 2 to 3 hours.
Rollback steps are prepared and will be able to be performed on a much shorter time span.

History

Veteran members of Epsilon will remember a completely different state of server performance and optimisation. A time when server optimisations were unnecessary and performance was acceptable.

Since then, the active population has grown substantially. We've gone through two major phases of player increase, and by incurring those additional players, bottlenecks have been made apparent and we've taken steps to address them. For the most part (at least, outside of the backup period) there is a low world latency so things feel quite responsive.

However, the steps we've taken to address the long startup period has seen some successes, but with diminishing returns. We now face a startup time of an alarming 10 minutes. The breakdown of a typical server crash & restart is:

08:39 - Server session crashes
08:40 - Core file dumped & moved to high volume disk
08:40 - Startup of new server session
08:41 - Server session begins loading objects
08:47 - 6 and a half minutes later, objects are fully loaded.
08:48 - Server is now connectable

That's a 10 minute process, of which three fifths are dedicated to objects. We already know how to deal with the core dump (and this should take off a whole minute off the process), but we need to address the object load time.

This is not helped by the rate at which the object count in the server is growing incredibly rapidly. With blueprints and gobject group copys, and then with doodad imports, the fact that more players log in and build on the server is a catalyst for these numbers skyrocketing.

In the past we had this issue with startup times, but were able to partially remedy it by switching the object loading process to a multi-threaded one. This issue is also apparent in the increasing times that backups take, and how long a degraded performance is noticeable on the server.

So it's time to change something fundamental with how the server works. If you're interested in the technicals, I've gone into how it used to work, how it works now and how it will work:

Old Workflow

Every single object is stored inside a single container called the "GameObject Data Store", which contains an integer (the objects GUID) as a key, and the object data as the value for that key. You can think of it like a dictionary with the key "word" and the value "definition".

When the server starts up, the server performs a query on the phase table to initialise all phases. Then, a query on the gameobject database asking it for every single object. The query eventually returns the millions of rows, and the server goes through each row and converts the numeric based data into the object data. Once completed, it adds this object to the GameObject Data Store, and proceeds onto the next row.

Once all of the rows have been parsed, the server proceeds to the next step of initialisation.

In Layman's Terms

The server runs one big query on the gameobjects database, and spawns every object from that. Nothing can happen until this finishes.

Current Workflow

Every single object is stored inside a single container, which contains an integer (the objects GUID) as a key, and the object data as the value for that key. You can think of it like a dictionary with the key "word" and the value "definition".

When the server starts up, the server performs a query on the phase table to initialise all phases. Then, it generates a list of all of the phase ids. It then creates eight separate threads (which can execute code simultaneously) to alternate between the two following steps:

Query the gameobject database asking it for every single object in a given phase id;
Go through each row, generate the object data, and add it to the GameObject Data Store.

Each thread will check the list of all the phase ids, take one (and remove it from the list) and then handle it.

Once the list of phases is exhausted, the server proceeds to the next step of initialisation.

A key limitation of this is that the GameObject Data Store is one container; it is a very, very, very bad idea to attempt to have multiple threads adding data to it at the same time. (A few people might remember this incident where the server started up in a fraction of the time, but because of this limitation, very few objects were added to the GameObject Data Store and thus not loaded into the world).

This limitation results in a significant loss of ideal speed in which the gameobjects are loaded. Even then, it was able to reduce the startup times by about 20% to 30%. The ideal speed would be by half.

In Layman's Terms

The server has multiple scripts running at the same time, making many small queries on the gameobjects database and spawning every object from that. Objects can be queried, but cannot be added simultaneously. Nothing can happen until this finishes.

New Workflow

Every single object is stored inside a container of containers. The top level container contains an integer (the phase ID) as a key, and the GameObject Data Store for that phase as the value for that key. You can think of it like a library, with a dictionary for each language.

When the server starts up, the server performs a query on the phase table to initialise all phases, plus room for extras. It reviews the last time the phase was entered by an officer or above and determines if it is an active or inactive phase. Based on its activity, it is added into a queue.

The server spawns up to eight separate threads to handle the global phases (main phase, NPE, etc.) and waits for them to complete. Because each phase has its own GameObject Data Store, all threads can query and load objects into memory simultaneously.

Following this, the server spawns eight separate threads to handle phases as determined by the queue. These threads are detached from the main thread and do not need to be waited upon. Therefore, as soon as these eight threads are created, the server continues with initialisation.

Each thread will take the next phase from the queue, drop it from the queue, load in the gameobjects, load in the gameobject groups (as this process attaches data to each gameobject) and change the state of the phase from queued to active.

For phases that are inactive (not entered by officer+ for >30 days), gameobjects are not loaded. There are no queries done, and the phase is initially inaccessible. If I attempt to enter the phase as a player, I am told it is inactive. If I attempt to enter it as an officer+, it will spawn a separate, detached thread and load in both gameobjects and groups. After thats done, the phase is made active & accessible.

This whole process of phases being loaded in optionally is thanks to the tiny change in the memory structure. Because we now have a container of containers, we can multi-thread write to each of the phase key GameObject Data Stores, rather than risk multiple writes to the global GameObject Data Store and risk a crash.

In Layman's Terms

The server has multiple scripts running at the same time, making many small queries on the gameobjects database and spawning every object from that. Objects can be queried and added simultaneously. The server can continue, and complete, server startup while this is ongoing, but an exception is made for global phases (main phase, NPE, etc.) and these have to be loaded before the server can proceed.

Impact of New Workflow

Because the server doesn't need to wait for your phase objects to load, the startup time is reduced by more than half and it will never be affected by the gameobject count again! When you log into the server, you spawn in the main phase. When you attempt to rejoin the phase you were just in, either you're too fast and it hasn't loaded yet (but will imminently) or it will already be accessible because it was high up in the queue due to a very recent last activity time.
The Global GameObject Data Store controls the GUID of an object. This is a value of up to 2.1 trillion (so a number you won't ever hit). We're currently at a GUID of > 200 million and this is a number that each phase taps into. When we move into a Phase based GameObject Data Store, each phase will have their own GUIDs. You'd be hard pressed to get into 7 digits.
Phases will be loaded over the next ~# minutes, in order of descending recent activity. This might result in your attempts to enter a phase being rejected initially, but this should rarely be the case. Additionally if you want to load into a phase that hasn't been active for a while, this has to wait until the initial loading is done.

A bit more on 2: Because of the magnitude of data involved, this requires extended maintenance in order to reallocate GUIDs for every object in the game. This is estimated at a ~2 to 3 hour process, as detailed in the Deployment section. We are accounting for GUID changes in groups, in teleporters, in creature scripts, etc. Unfortunately we can't edit your macros for you, so you'll need to amend them if you specify GUIDs in there.

In Layman's Terms

The server boots up fast. Real fast. The required phases are loaded in under 1 second, so the server is no longer bottlenecked for 6 minutes loading in everyones objects. Because of the change in the gameobject container structure, gameobject GUIDs will be made unique on a per phase basis. Your first spawned object GUID of 200,000,000 will become GUID 1. This may cause your macros to no longer be aligned to the object GUIDs of which they refer to.

Deployment

The server is currently scheduled to have extended maintenance on the 29th of November to faciliate this update. The window is expected to be 2 to 3 hours.

This is to permit me to do the following steps without interference:

Run the reallocation of GUID script on all gameobjects currently in the database
Import the gameobject table anew with the updated GUIDs.
Run the database query updates on all of the tables that depend on the gameobject GUIDs to also specify the phase, since GUIDs are now only unique on a per phase basis

Contingency Plans

Every plan works until it meets the opposition. There is a chance that, like other releases, something is fundamentally broken with the way things are done and the changes have to be reverted.

The previous iterations of the gameobject tables and all of the updated ones will be copied and retained for up to a week. If the performance is considered absolutely garbage, then we will revert the codebase changes and switch to the old versions of the tables. It is, fortunately, possible for us to recover any builds that are done after the migration but this will need to be manually done.

Ideally, the reason for us to roll back these changes will either be apparent immediately or not at all, so there shouldn't be much of an opportunity for any new gameobjects to be added in a large enough capacity to warrant migrating them across the rollback.

In Closing

Hopefully this migration will go successfully! Hopefully it will have been worth it, and hopefully I won't be spending the following 8 hours pulling my hair out over new issues.

I've done the whole migration on our dev box with a clone of the production database and it changed startup times from 12m down to 4m our dev server is a tad slower than production to begin withfor some reason, so this should reflect as a 10m down to 3m process on Apertus.

Anyway, time for me to get cracking with the backup optimisations.

Sign In

Mini Patch 830-2g - GameObject Refactor & Startup Optimisations

Recommended Posts

Razmataz 14

Share this post

Link to post

Share on other sites

Please sign in to comment

Recently Browsing 0 members

Browse

Activity