Wednesday 23 April 2014

A BS Analysis Based on Legit Data

tl;dr We took a look at the LegitBS Defcon 21 scoring database to see if we could verify their results and to understand whether the stated scoring rules appear to have been followed in generating scores for the participating teams. Based on our analysis it remains clear that PPP was the convincing winner of the Defcon 21 CTF. Beyond that things get pretty foggy pretty quickly.

Here begins the too long part

"Open" data is only meaningful if people take the time to analyze that data. The current organizers of Defcon Capture the Flag (CTF), LegitBS [1],  have been very forthcoming with their process for running CTF as well as the data they gathered while running the live game. To date they have dropped [2] several items of interest to the CTF community.

As aspiring Defcon CTF participants, we thought that analyzing some of this data might help us get acquainted with just how the scoring of the Defcon CTF works. This article attempts to explain what and how we learned from digging into the Defcon 21 scoring database. Taking LegitBS up on their tag: "I want to download and audit the Scorebot from DEF CON 21 CTF", we visited the announcement page for their database drop [3] and grabbed the database dump via bit torrent [4]. LegitBS gives instructions for restoring the data into postgres and a link to an ER diagram [5] in case you find that useful.

Before we get to deep into this, let's state that our efforts are all amateur stuff, but we have made every effort to be accurate and attempt to state any assumptions that we had to make in order to make progress with our analysis. We invite LegitBS, or anyone else, to offer clarification where detail is missing and corrections where errors have been made. Rather than littering this post with SQL query results we will link to result pages that show both the exact query that was performed and the exact results returned using the LegitBS data.

The rough flow of this article will be as follows:
1. Review of the scoring rules using available LegitBS documents
2. Description of the database
3. Interesting? data extracts from the database
4. Replaying the game attempting to validate the published results
5. Conclusions

1. Review of the scoring rules using available LegitBS documents

We refer people to the LegitBS postings [6][7] for a high level view of how scoring worked and hit only the high points here.

  1. Teams began the game with 2500 flags each from a pool of 50000 flags. In theory this was a zero-sum game with various actions in the game resulting in redistribution of flags from one team to another. Since flags were neither created nor destroyed, one would assume that the sum of all scores at the end of the game would be 50000. However this was not the case as a logic error (to their credit, openly admitted by LegitBS) left 7628 flags in the possession of LegitBS. As far we can tell, these flags became completely inaccessible to the teams and were thus effectively destroyed, invalidating the zero-sum assertion.
  2. According to the published rules it seems that flags were intended to remain associated with specific services. However, it appears that flags were never initially paired to specific services (see discussion of flags table in the database section below) making it difficult to prove that this in fact was the case throughout the contest.
  3. Tokens differ from flags. One token, per team, per service, per round was generated and exposed to teams that could successfully exploit a service. The final number of tokens that may enter play is dependent on the number of active services and the number of scoring rounds that the organizers choose to conduct.
  4. A stolen token could be redeemed (within some expiration period?) for a share of the victims flags for that round. Each exploited team/service instance was worth up to 19 flags (assuming the victim had 19 flags left) which were evenly distributed among all teams that presented the token associated with that team/service instance. Frequently, even distribution meant that some remainder flags went undistributed and these flags appear to have been claimed by LegitBS. For example if teams A,B,C, and D all presented the token associated with team E's service 1, then the 19 flags associated with E/1 would be divided 4 per team to A, B, C, and D with LegitBS presumably keeping the remaining 3 flags.
  5. During each scoring round the organizers also tested each team/service instance to verify the correct operation of the service. A failed availability check resulted in the loss of 19 flags (presumably from that service's pool of flags, but again we can't tell) which were redistributed among the teams whose same service was deemed to be up in the same scoring period. For example if team A's service 1 is deemed to be down, they lose 19 flags which are distributed among those teams that have service 1 up and running correctly. If only 10 teams had service 1 up and running, then each would get 19/10 or 1 flag while LegitBS kept the remaining 9. Since a total of 10 teams would have been scored down in this example, each of the 10 "up" teams would have received a total of 10 free flags just for being good citizens of the game and keeping their services running. It is not clear from the published rules exactly how the flag accounting was performed for failed availability checks. The database provides no transaction records for flags that were redistributed as a result of failed availability checks. Using a slightly different accounting method we might say that 10 teams failed the check resulting in a pool of available "failure" flags of 10 * 19 = 190. These flags would then be redistributed as 190 / 10 = 19 flags per "up" team, with no remainder flags going to LegitBS in this case. This is a drastic difference from the first method of accounting and we have no idea which method was applied by LegitBS.
2. Description of the database

The "scorebot" database is made up of 12 tables [Fig 1]. The "teams" and "services" tables allow us to map integer IDs back to readable names.

The "flags" table contains exactly 50000 entries and tells us which team currently owns each flag, but provides no information about what service the flags is associated with or how the team came to possess the flag. The final CTF score [Fig 2] appears to be derived directly from this table. We can clearly see LegitBS holding on to a significant number of flags which we assume are the remainder flags.

The "rounds" table contains start and end times for each scoring round/period. We could use the end time as the effective time of all scoring actions within the period in order to develop a scoring time line of the entire game.

The "instances" table provides a list of team/service pairs representing active services within the game. We can use this table to understand the order and time at which individual services were activated in the game [Fig 3]. The "tokens" and "availabilities" tables (see below) each reference this table.

The "availabilities" table contains the results of availability checks (aka service polls) for each scoring period in the game.  Each entry references an instance (team/service) to which the poll result applies and contains the "status" of the poll along with a "memo" field that contains some text associated with the poll, in some cases hinting at the reason a poll failed. ASSUMPTION: Lacking any other guidance, we assume for the purposes of this analysis that any availability status other than zero represents a failed availability check.

The "tokens" table provides a record of all the tokens that were generated in the game, when they were generated (both a time and a round) and what instance (team/service) the token was associated with.

The "redemptions" table provides a record of which tokens (stolen items)were turned in by which teams (attackers). This is the record of successful attacks that took place throughout the game. To understand exactly where a token was stolen from we need to do a join on both the tokens table, to obtain an instance id and a round number (the round_id field in the redemptions table is unused), and the instances table to obtain a service_id and team_id. From the redemptions table we can derive attack time lines, learn which services and which teams were attacked most frequently, and see which teams were able to steal the most tokens broken down by service, by victim or both.

Finally, the "captures" tables allows us to observe the redistribution of flags as a result of the redemption process. A single capture record is associated with a single redemption record. Conversely, a single redemption record may be associated with many capture records. This reflects the fact that a single redemption may result in the awarding of many flags when those flags are being distributed among a small number of teams. In order to provide detail about the root cause of the capture, we need to join captures with redemptions to learn the attacker and the token that was redeemed, further join with tokens to learn the team/service instance to which the token is tied, and conduct a final join with instances to learn the specific team and service from which the token was obtained. Convenience joins with services and teams allow us to use readable team and service names in our result tables.

It is important to understand that the captures table contains the only record of flags transfers within the game and that this record is further limited to flags transfers that resulted from token redemptions. There is no data in the database that allows us to audit the transfer of flags that resulted from failed availability checks.

The "schema_migrations", "timers", and "messages" tables are of no interest to our analysis.

3. Interesting? data extracts from the database

In attempting to understand and audit the scorebot database, we will focus particular attention on the availabilities and redemptions tables. These two tables contain the results of observed events that can be attributed to specific teams. In theory, with a perfect understanding of the rules of the game and the algorithms used by LegitBS to redistribute flags we should be able to reproduce the final score by processing transactions from these two tables. The captures table in particular should be able to be derived from data contained in the redemptions table. In auditing the database, one of the things we will look for is whether our derived version of flags distribution matches the version recorded in the captures table.

In the process of auditing the database, many queries were conducted to help us explore the game in various ways. Having developed a number of queries we will also share the results of those queries as we strive to explain what types of things took place during the game. For example one things we can say is that 18880 tokens were redeemed during the game and 6089 of these redemptions resulted in no flags changing hands [Fig 4]. ASSUMPTION: a redemption with no associated capture record indicates that the team whose token was redeemed had no flags remaining to redistribute. There is an interesting edge case where this assumption may not be valid and it relates to the fact that LegitBS either knowingly allowed or failed to anticipate that teams might submit their own tokens for redemption.


According to the published scoring algorithm, this practice should not have increased a team's score, it merely should have stopped their score from decreasing as quickly as it might otherwise have. Example 1: Team A is the only team to submit its own token. 19 flags are taken from A because they have been owned, then 19 flags are awarded back to A as the sole "owner". This is a net change of zero or no benefit to their score. Example 2: Teams A and B submit a token from A. 19 flags are taken from A because they have been owned, then 9 flags (19 / 2) are awarded back to A and 9 flags are awarded to B as the shared "owners". A's score decreases but not as much as if B had been the lone owner. B has been deprived of 10 flags. Example 3: In the extreme case that every team steals and submits A's token and A additionally submits their own token, the 19 flags from A are evenly distributed among the 20 teams that turned in A's token. A loses 19 flags and each of the 20 teams that turned in A's token are awarded 19 / 20 = 0 new flags. In this case, while A may have 19 flags, no associated capture records would be present in the captures tables since no team earned a whole flag. We can find 3 cases [Fig 5] during the game where 19 teams turned in the exact same token, but no instances in which the same token was turned in by all 20 teams.


Here are some conclusions that we can draw from the data contained in the availabilities table. First, availability checks were performed in 260 of the 263 game rounds [Fig 6]. In each round that availability was checked, all teams were checked [Fig 7]. From rounds 1-99, LegitBS was also checked for availability and they failed 16 of these checks [Fig 8]. We have no idea whether there is any significance to this or not. We have no idea whether LegitBS redistributed any of their flags as a result of failing their own polls or not. It does not make any sense that they would, but details are lacking. Why were they polling themselves in the first place? Why did they stop polling themselves?

We congratulate More Smoked Leet Chicken on failing the fewest availability checks [Fig 9] and encourage Robot Mafia to re-examine their game strategy. The flags_lost column in Figure 9 is a theoretical maximum number of flags lost based solely on the number of failed polls. During the game, teams are also receiving flags as a result of other team's availability failures and as a result of redeeming tokens. A team may have lost fewer than flags_lost flags if their score was already zero when they failed an availability check. You can't lose what you don't possess. More detailed breakdowns of checks failed by team, by service can be seen in Figure 10 [Fig 10]. A summary of failed checks by service may be found in Figure 11 [Fig 11]. Again, the number of flags_lost in these tables is a theoretical maximum that assumes a team still had 19 flags left to take away at the time a check failed. Shortly we will attempt to run through the game one round at a time while simulating the published redistribution algorithm to compute round by round scores taking into account both availability checks as well as token redemptions. But first...


The redemptions table is the record of all tokens turned in during the game. Figure 12 [Fig 12] is the complete history of steals broken down by attacker, victim, and service and ordered by submission (created_at?) time. More Smoked Leet Chicken again tops the list with the first token submitted, and in round one no less! Of course they were also the first team to be owned having turned in their own token, which they did again in round 2 and then stopped, perhaps because their score did not change and they perceived little benefit? Maybe they will comment on their rationale? We can break the redemptions down in many different ways to come up with lots of interesting facts. Because submitting your own tokens is not very interesting from a skills perspective we will show how prolific the practice was and then we will start to filter "self steals"  out of the summary data until we attempt to recreate the final scores.

From the data [Fig 13] is it clear that there is one team that spent more time touching itself than every other team combined. To understand the impact of this strategy we present Figure 14 [Fig 14] which is read as follows: "$team submitted their own token $token for service $svc which was submitted by a total of $submitters allowing $team to keep $kept flags it would otherwise have lost and depriving each of the other submitters of $deprived flags each" Men in Black Hats clearly benefited the most from this strategy, keeping more than 200 flags out of other teams hands and holding onto more than 200 flags they would otherwise have lost.

Now let's filter out the self submissions to look at some statistics related to pwning services. Figure 15 [Fig 15] is a revised redemption time line with self steals filtered out. We see that Men in Black Hats were the first to steal another team's flag 6 hours and 17 minutes into the game (based on a round 1 start time of 17:10). Figure 16 [Fig 16] shows total steals by team. We can see that PPP had 800 more steals than the next most prolific team raon_ASRT. This helps to explain the large margin by which PPP won the game. It is important to keep in mind that one steal does not necessarily equate to one flag captured. One steal earns a share of the 19 (at best) flags taken from the victim by the organizers and distributed evenly amongst each team that stole the same token. A steal may be worth as little as one flag if every single team was able to obtain and redeem the same token. Further a steal might be worth nothing if the victim team had no flags remaining. We will shortly turn our attention to the captures table which will help us compute things like average flags per steal.

Before we move on here is a breakdown of the number of steals by service [Fig 17], where atmail is easily the most abused service. The number of steals by team by service [Fig 18] is presented sorted in two different ways, by team name and by service name. Takeaways here include the fact that only three teams managed to own every service (PPP, More Smoked Leet Chicken, and raon_ASRT) and that PPP held a decisive advantage over every other team on the lonetuna service. The number of steals by attacker and victim [Fig 19] is the "who picked on who" table. APT8 (Always Providing Tokens 8) topped most team's list of most stolen from. Four teams (PPP, Men in Black Hats, More Smoked Leet Chicken, and raon_ASRT) managed to steal tokens from every one of their opponents. Each of these tables filters out self steals.


As mentioned previously, there is not necessarily a one to one mapping between tokens and flags. The "captures" table contains the record of how LegitBS distributed flags whenever a token was successfully redeemed. The captures table tells us nothing about how flags were redistributed following failed availability checks, and the captures tables tells us nothing about how many "remainder" flags LegitBS may have kept for themselves following each successful redemption. This data is virtually impossible to accurately recover given the data that LegitBS has made available to date. Later on we attempt to recover it as best we can by applying our understanding of the rules to simulate a replay of the game.

As background, here is an example of the remainder problem as it manifests itself in the captures table. Every capture record is linked to a specific redemption record. No redemption records exist for LegitBS as they were not redeeming tokens for themselves, therefore no capture records exist to document what happened to remainder tokens. Looking at a hypothetical case in which 10 teams redeem the same token, if the victim currently holds at least 19 flags, then each redeemer will receive 19 / 10 = 1 flag and, as we understand it, 9 remainder flags would go to LegitBS. Now if the victim held between 10 and 18 flags, the redeemers would still receive 1 flag each and LegitBS would receive anywhere from zero to eight remainder flags, the exact number is undocumented and must be derived in this case. Finally, if the victim currently held fewer than 10 flags, the redeemers would each receive nothing and LegitBS would receive between 0 and 9 flags, effectively confiscating the victim's last remaining flags in a transaction that also would not be documented. As we saw in [Fig 4] there were 6089 redemptions, or nearly 1/3 of all redemptions for which the redeemer was awarded zero flags, so it appears that this last case was fairly common.

Let's pulls some data from the captures table. Without filtering self steals Figure 20 [Fig 20] provides a summary of the number of flags awarded each team as a result of redemptions. Keep in mind that each flag that is being awarded also represents a flag that is being taken away from another team. Figure 21 [Fig 21] is the same table with self steals removed (since these are given right back to the redeemer, they do not change the redeemers score). Flipping things around, Figure 22 [Fig 22] details flags losses as a results of redemptions (including self steals), and Figure 23 [Fig 23] shows the same data with self steals removed.

Another data pull that may be interesting to some shows the efficiency with which each team was able to convert tokens into flags. Figure 24 [Fig 24] like Figure 21, shows how many flags each team was awarded (with self steals filtered out) along with the average number of flags that were awarded per token that the team was able to steal. It is fairly obvious that PPP was both efficient, earning on average 7.8 flags per stolen token, and prolific.

Something that is fairly essential for us to understand before we can carry out any simulation of the game is whether flags remain tied to a specific service for the duration of the game. The rules seem to imply that this is the case. Quoting directly from the rules document:

* Each stolen flag will be placed in a bin for the same service it was stolen
  ** Those flags can be lost again through that service when it is exploited
     by others.
* Given enough time, it is possible to lose all flags for a service.
  ** There will be nothing left to steal from you until you steal flags from
     another team.

We conclude that if we had 100 flags in each of 6 service bins and then lost all of our flags from one of the bins while remaining steady in the other 5, that a successful token steal from the service with the empty bin would result in no loss of flags to us (even though we hold 500 flags) and no award of flags to the attacker. This is a critical point in attempting to model the game. To add confusion, the rules are silent on the point of whether flags that are confiscated as a result of availability failures are similarly confined to service specific bins, so on this point we will need to guess. We welcome LegitBS clarification on any/all of these points.

What can the database tell us about binning? The only two tables that reference flags are the flags table itself and the captures table. At game start, the captures table is empty. The table schema for flags is shown here:

scorebot_production=# \d flags
           Table "public.flags"
   Column   |            Type             | ...
------------+-----------------------------+ ...
 id         | integer                     | ...
 team_id    | integer                     |
 created_at | timestamp without time zone |
 updated_at | timestamp without time zone |

It is quite clear that at game start, flags are associated only with teams and not with services. Unless there is a secret formula for associating flag IDs with specific services, it is difficult to believe that any enforcement of service related bins was possible. If binning was not enforced, one implication is that it was entirely possible to lose every flag in your possession through a single exploited service. Another impact is that repeated failing of availability checks on only one service might drain every flag in your posession.

Since the captures table provides our only insight into the migration of flags between teams, perhaps we can draw conclusions about binning by further examining it. First some observations. Since the flags table reflects the final owner of each flag, we have no way to identify the original owner of each flag. Figure 25 [Fig 25] shows that for 12341 flags, the created time and the update time are identical. If it is true that the update_at field was in fact updated with each change of ownership, then we may conclude that these 12341 flags at least never changed hands. 12341 also corresponds to the number of flags whose update_at time precedes the beginning of the game. We also see that no id associated with any of these flags ever appears in the captures table while 25282 different flags do appear in the captures table. This accounts in some respect for 37623 flags. ASSUMPTION: the remaining 12377 flags changed hands exclusively as a result of failed availability checks for which no transaction records are available.

With 44680 entries in the captures table referencing only 25282 flags, it is clear that many flags are referenced more that one time. Perhaps the capture records associated with one or more of these flags will support our assertion that flags were not confined to service specific bins. Figure 26 [Fig 26] shows just the beginning of a list of flags ordered by reference count taken from the captures table. Several flags change hand more than 100 times. Any one of these may suffice. Figure 27 [Fig 27] details the available capture records for flag 17153, a flag that changed hands via captures 144 times, sometimes several times in a single round. The first eight entries in this table show that in round 253 alone, the flag migrated through all six services with the last capture (the result of a self steal) being taken out of the avoir "bin" and being allocated right back to men in Black Hats. In round 254, flag 17153 reappears, now associated with reeses and owned by Shellphish (we can only assume that it was transferred to shellphish as part of an availability penalty) and gets reallocated to raon_ASRT who promptly lose the flag to PPP in the very next round. Figure 28 [Fig 28] shows the availability results for Men in Black Hats from rounds 253 and 254. The only availability check that they failed was a check for service avoir. This would have resulted in the loss of some flags, one of which must have been 17153 which was apparently reallocated to Shellphish (who passed their avoir checks) though apparently not to their avoir bin given that Shellphish lost the flag through their reeses service. We may also be able to conclude that availability checks are reconciled before redemptions as 17153 has migrated to Shellphish in round 254 as the result of an availability failure which must take place before it subsequently passes to raon_ASRT as a result of a round 254 capture. ASSUMPTIONS: Flags are not restricted to any particular bin and availability check flags are distributed prior to capture flags.

4. Replaying the game attempting to validate the published results

At this point we may have enough familiarity with the data to contemplate auditing the final scores so lets think about what a final score formula might look like:

 +       capture_flags_awarded
 +  availability_flags_awarded
 -      capture_flags_deducted
 - availability_flags_deducted
 =                 final score

As shown above, one source for the capture_flags numbers is from the captures table. Alternatively, below, we will attempt to independently derive these same numbers, by stepping through the game data one round at a time.

The availability related numbers are harder to come by without running a simulation of the game. If we can assume that a team's score never approached zero and therefore the team always had sufficient flags available to absorb an availability failure penalty, then availability_flags_deducted is simply 19 * #of availability failures. However seven teams finished with zero flags and others may have hovered around zero at times. Any team that possessed fewer that 19 flags at the time of an availability failure lost fewer than 19, perhaps even no flags so the formula can't be used in all case and we must rely on a simulation of the game to compute the correct deductions. We are also unable to query the database to determine accurate numbers for the availability_flags_awarded value. This number must be computed on a round by round basis and summed across the entire game. For a single sound, the partial scores for availability of each service must be summed to even compute a per round value. A score for a specific service in a specific round requires that you first pass your own availability check, and then compute the number of other teams that also passed the check. This gives us the number of teams that will share the flags confiscated from the teams that failed the availability check which we know is not as simple as 19*N where N is the number of failures since any one or more of the failing teams may have fewer than 19 flags left to confiscate.  4. Replaying the game attempting to validate the published results In order to verify the published results of the game we need to attempt to replay the game using event data recorded in the LegitBS database. The principal tables we draw upon will be availabilities and redemptions. We will join to other tables as necessary to build a per round availability picture and a per round redemption picture. We completely ignore the flags and captures tables since we are attempting to verify the data contained in those tables by re-running the game. The teams and services tables are really just conveniences that allow us to work with names rather than numbers.

We will employ the following algorithm to step through the game one round at a time:

   for each round
      #process availability data
      for each service
         compute the group of "up" teams
         compute the group of "down" teams
         deduct MIN(19, current score) flags from each down team
         add deducted flags to fail_pool
         if up.size != 0
            add fail_pool.size / up.size flags to each up team
         add unallocated fail_pool flags to LegitBS pool
      #process redemption data
      for each redeemed token
         compute the group of redeemers
         determine the victim
         loss = MIN(19, current victim score)
         deduct loss from victim
         add loss / redeemers.size flags to each redeemer
         add loss % redeemers.size flags to LegitBS pool
      print the current scores if we want per round numbers
   print the final scores

Code here. We make no attempt to keep flags sorted into bins as we do not believe this was done during the game. We allow teams to submit their own tokens and receive flags back in return as was done in the game.

The query we use to aggregate availability data is:

   select i.team_id, a.status
   from availabilities a, instances i
   where a.round_id = ? and i.service_id = ?
         and a.instance_id = and i.team_id != 21;

This gives us availability data on a per round/per service basis and excludes LegitBS because we have no idea what to do with poll results for them.

the query we use to aggregate redemption data is:

   select r.team_id as attacker, i.team_id as victim
   from redemptions r, tokens t, instances i
   where t.round_id = ? and i.service_id = ?
         and r.token_id = and t.instance_id =

and then we dump some data which may be seen in Figure 29 [Fig 29]. The primary result is shown here:

LegitBS remainder flags: 10782

Estimated final scores
 PPP                           | 14451
 more smoked leet chicken      | 7939
 raon_ASRT                     | 6850
 men in black hats             | 6098
 routards                      | 1779
 Alternatives                  | 1164
 sutegoma2                     | 428
 shellphish                    | 323
 9447                          | 186
 ... all other teams ...       | 0

We show more keys being held in the end by LegitBS and leave it to LegitBS to clarify whether they ever returned any keys into circulation and if so what algorithm they used for doing so. The numbers we generated for capture related gains and losses do not match Figures 20 and 22 exactly but are at least similar in magnitude and ordering. This leaves the availability data as the most likely cause of the disparity in the final standings. In particular, More Smoked Leet Chicken appears to have both the most flags gained from other team's failed availability checks and the least flags lost from their own failed availability checks. This is the likely cause of their higher finish as computed by our methodology.

We hope this look at the LegitBS data generates discussion and invite feedback as to where we may have made mistakes or erroneous assumptions. We especially invite LegitBS to add any details that would eliminate some of the guess work we have needed in performing our analysis.

5. Conclusions

We applaud LegitBS' effort to open up the process they use to run Defcon CTF and hope they continue to be as open in the future. We caution that just because data is open does not mean that it is possible to accurately audit that data. In this particular case we feel that the LegitBS database lacks sufficient detail to reconstruct every action in the game. Additionally the scoring rules lack clarity and in any case do not seem to have been enforced as stated in the rules document. Nowhere is the handling of remainder flags explained. We wonder if allowing for the distribution of fractional flags wouldn't have greatly simplified things. Finally we hope that LegitBS takes the opportunity clarify and/or modify both their rules and their scoring infrastructure to eliminate any disparities between the two and allow for more accurate, repeatable auditing of this year's results.



No comments:

Post a Comment