Reliability Score-A Deep Dive with Data and Analytics Senior Consultant Scott Mendelssohn

June 30, 2024

Improving the DUPR algorithm is an ongoing process that spans honing the statistical methods that underpin the research, fine-tuning the look-and-feel of the product, and increasing the number and value of the applications and integrations within the community. Some iterations, like the Win Intensity model we released earlier this year or the instantaneous algo release of last year touch on one or maybe two of those categories modularly. It’s a particularly fun project when all three pillars of the DUPR ecosystem are involved; DUPR’s Reliability Score has the distinct honor of spanning all three.

For those of you who missed the release or are perhaps still curious about what this random new circle is that appeared beside your rating lately, there’s a great FAQ that the team put together which walks you through the basics of what the score is at a high level and how to interact with it, understand it, and progress through your own path to ratings reliability. For many folks, that’ll give you everything you need to know, but for those of you who read my last blog posts put out when the new dynamic algorithm was released in January and like to go behind the scenes: this one’s for you.

Model Inspiration

In the theme of assured and steady incremental progress, we took a look at each component of the rating algorithm and decided on a research path forward. One such component was the halflife that we had inherited from previous versions of DUPR. The name “halflife” borrowed from a concept in physics (amongst other fields) describing the rate of things like radioactive decay in unstable atoms. The physics term is a parameter that specifically denotes the typical time it takes for something to decay in half and that concept was applied in the decay of that original reliability metric over time. Then, the metric itself eventually just began to use the name of its parameter in common parlance.

As it stood, the halflife was measuring the frequency and recency of your play on DUPR: matches played today counted for 1 “halflife” each and matches played 3 months ago counted for approximately ½ “halflife” each; the halflife shown on your profile was the sum of these values. Higher values of halflife were then seen as more stable player ratings because there was more information in those ratings, but there wasn’t a clean line to be drawn from the halflife number we provided to how to understand it in practice and users were left trying to establish good rules of thumb as filters for things like clubs, leagues, or tournaments.

The components of halflife weren’t wrong, by many means. Frequent play pretty clearly represents this concept of ratings reliability–the more you play, the more data we have and the more we can average out the inherent noise within pickleball data. Recent play is also clearly important–if you played 10 matches 10 years ago, can we really expect your current form to be well captured by those results? But the halflife was a simple approximation of what we were really trying to capture, namely, “is this number well informed” and the elements motivating progress and bringing users back to the platform were indirect at best.

There are so many other features besides frequency and recency that go into how information spreads through the pickleball network. And even something as simple as play frequency has its nuances. For instance, playing with the same partner against the same two opponents five million times over and over again and never once playing with anybody else does not make for a particularly informed rating. The algo would absolutely learn the relative rating between the two teams, but learning the level of play of each of the members of the team relative to one another would be impossible with the granularity of the data we have access to–we wouldn’t be able to tell by simply looking at the scores which of the teammates contributed more to the winning percentage. And even if we did have a good understanding of the players relative to one another, we would have no way to numerically compare them to the rest of the population because they’ve only ever played amongst themselves. 

While never quite as drastic as this, the network of pickleball matches has plenty of pockets of hyper-localized information in this general theme, most notably amongst clusters of geography, age, and gender, where the community currently builds up these insular networks. A universal rating can and should be accounting for this both in how the rating algo learns but also in how we express the informational content, or reliability, of its result.

Our research project was clear. How can we extend the concept of the halflife to:

  • More accurately and precisely measure the intended statistical concept
  • Be built in such a way that we can later integrate it cleanly it into the rating itself
  • Provide a clear format to the community for engaging with it and using it to improve the DUPR experience

More Accurate and Precise

Starting from specifically trying to measure the flow and collection of information in the DUPR network led us to re-envision the rating as a large graph of connections rather than as a series of isolated interactions. The context of a match matters significantly as well as the context of each of the participants in said match; e.g. competing in tournaments with many other reliable players allows your rating to soak up a lot of the information present in each of your matches, and playing across multiple clusters of subpopulations gives your rating the opportunity to meld into the universal rating level that DUPR tracks across the globe. So we took the network of every match ever entered on DUPR and enriched it with measurements of the information flow we’d expect from that match based on all of the metadata available to us. Was this with a new partner? Rec play or a tournament? Did you play with some DUPR newbies or was it with some seasoned veterans? Was the match against an evenly-matched opponent or did this look like a blowout from the get go? All of these things and more compile into just how statistically informational this match was in the journey for a reliable rating.

The result is a way for the algorithm to visualize the web of relationships we model and in a complex, multifaceted way, far surpassing the quality of the simplistic halflife measure before. Every match you play now, regardless of partner, opponent, match type, etc., has its informational value measured and added to your Reliability Score which will accumulate as you play more and begin to fade over time if you aren’t adding many matches on the app.

Ratings Integration

While not yet integrated into your rating itself (like any sustainable research project, we wanted to roll this out first and make sure it was ironed out completely before affecting your rating), the goal is to have the reliability score go hand in hand with the way the rating moves over time. More reliable ratings should have less movement, less reliable ratings should have more movement. Playing in a match that provides a lot of information to your rating (entering a tournament for the first time or joining a league) should move your rating relatively more vs. one that is less informative (perhaps some pickup games with a close group of friends you play with all the time). When you take a break from your league play to take lessons and come back and dominate next season, the algorithm should respond to that appropriately. When you venture into new territory and play a match in which you learn something about yourself, DUPR should as well. The intuition around your own experience should be well mirrored by the algorithm. Stay tuned for more here as the research continues, but DUPR’s Reliability Score was designed intentionally to be able to make a significant impact on the intuition and feel of the DUPR rating algorithm itself.

Community Engagement

The raw output of the reliability algorithm is basically unintelligible to most things that aren’t a computer, so a crucial step beyond just getting the analysis right was being able to convert that analysis into a format that the community could use. We determined that the most sensible format here (in part to avoid confusion with the ranking itself) was to provide a Reliability Score as a 0-100% range. In order to map something inane like 0.000947392 to a Reliability Score Meant for Humans, we had to figure out a transformation that provided us with key milestones. For instance, 60% is where we marked a “passing grade”, but the actual number of 60% is somewhat arbitrary. The important part is what we’re calling “60%” actually represents.

To determine what constitutes a passing grade, we looked at the entire universe of DUPR players and matches and performed a stability analysis, looking for key elements that correlated to having accurate predictions and stable rating. Identifying a threshold of stability across these features, we were able to model the raw reliability number that strongly predicted whether a rating was going to be steady and accurate for most of the population. Rarely does a model predict something perfectly, and the Reliability Score is no different. Leagues, clubs, and tournaments should continue to use DUPR’s Reliability Score as a strong indicator of the reliability of the DUPR rating, but we continue to urge organizers to look holistically at the players and profiles and make their own evaluations. The Reliability Score is meant to help enrich the ways the pickleball community can come together to get fair, level-based play and shouldn’t be used to blindly limit or restrict.

We’re excited about what the Reliability Score can bring to the pickleball community and look forward to continuing to research and advance our product to bring as much to this fantastic community of players, coaches, organizers, and fans!

Thanks for reading, and talk to you next time!

More Posts For You

No items found.