Investigate surprising "10% Other" portion of Analytics Browsers report
Open, Needs TriagePublic13 Estimated Story Points

Description

Background

For the past 8 years or so, it's been the case that the Browsers report at analytics.wikimedia.org reports a portion of 10-12% "Other".

In my view, this continuously calls into question the validity of the data and makes it difficult to trust. This lack of trust is due to the following assumptions being difficult to believe in their totality. The below assumptions are based solely on my own interpretations, and I expect these assumptions to be incorrect or incomplete. Having said that, I've shared these interpretations many times over the years, and have yet to encounter an alternate explanation, including from conversations with PMs, Directors, and engineers working on the measurements methods and underlying datasets.

  1. 100% = only from "user" type (excludes known bots and spiders).
  2. 100% = only page views (excludes page loads during other actions such as edit, history, search, special pages, and non-HTML requests; regardless of user agent string).
  3. Other = other browser families, representing a long tail of lesser known and "fake" browsers.

I have no trouble believing there are thousands or even millions of lesser known browser families or distinct (unparsable) user agent strings seen in pageview traffic.

The part that's unbelievable is that we have a combined 12% of (assumed-human) page view traffic worldwide coming from lesser known browsers. If true, this would make a very significant thing to talk about publicly and widely to support organisations like Open Web Advocacy speak up for browser diversity. And we should then strive to try to publish some kind of dataset that provides more insight into what some of the "biggest" of the smallest browsers are.

What's also hard to believe is that the market share of the main browsers are as high or as low as reported, based on other information available.

  • 31.9% Chrome Mobile
  • 21.2% Mobile Safari
  • 16.4% Chrome
  • 11.7% Other
  • 3.4% Edge
  • 3.3% Firefox

Impact

When we make product decisions around which part of our global audience we can support at a certain level (Basic/Modern, Grade C/A, as per mw:Compatibility#Browsers), we theoretical maximum of 88%. That's a pretty low ceiling.

Based on stats.wikimedia.org, that's more than 2 billion page views every month (of 24 billion), and 190 milion unique devices (of 1.6 billion); that we can't account for.

Examples:

Compared to other data

https://en.wikipedia.org/wiki/Usage_share_of_web_browsers

SourceChromeFirefoxOther
analytics.wikimedia.org (Week of 2023-07-09)48%3.3%12%
StatCounter (Worldwide: June 2023)62.58%2.81%1.37%
W3Counter (Dec 2022)71.2%3.0%5.7%

Investigation so far

I've raised this numerous times internally, including to Jon Katz (then-Director in WMF Product) in 2020, in the hopes someone could analyze this.

The issue is also raised regularly when I publicly share Wikimedia'a browser (example 1). Most recently, Sime Vidas (of WebPlatformNews fame) raised it again on social media (example 2).

So, I'll try to investigate it now and report my findings here.

Event Timeline

First off. We can query the underlying pageviews Hadoop dataset, using Turnilo to get a rough sense of the data (https://w.wiki/75NW, restricted):

  • Filter: Last 7 days, Agent Type: user.
  • Split: Ua Browser Family
  • Measure: View Count
ChromeFirefoxOtherTotal
217 million (61%)43 million (12%)3.4k (0.001%)352 million

Note how much larger Chrome and Firefox are here, even in our own dataset.

Note that in this somewhat "raw" query to the pageview dataset in Turnilo/Hadoop, "Other" refers to "not recognised by ua-parser". This difers from the public Browser report, which goes through an anonymisation phase that folds all entries under the 0.1% threshold into "Other". That's why the public dataset has a more than 10% "Other" segment.

I'm sorry I've been so slow to take this more seriously. I took it seriously, just there's always other stuff. Ok, so here's what I'm doing to at least satisfy our curiosity here.

Take the base query that all these reports run on top of. Make a separate table to store our investigation (copy the create table statement from wmf.browser_general and change the db and hdfs location):

CREATE TABLE `milimetric`.`browser_general` (
  `access_method` STRING COMMENT '(desktop|mobile web|mobile app)',
  `os_family` STRING COMMENT 'OS family: Windows, Android, etc.',
  `os_major` STRING COMMENT 'OS major version: 8, 10, etc.',
  `browser_family` STRING COMMENT 'Browser family: Chrome, Safari, etc.',
  `browser_major` STRING COMMENT 'Browser major version: 47, 11, etc.',
  `view_count` BIGINT COMMENT 'Number of pageviews.',
  `year` INT COMMENT 'Unpadded year of request',
  `month` INT COMMENT 'Unpadded month of request',
  `day` INT COMMENT 'Unpadded day of request')
USING parquet
PARTITIONED BY (year, month, day)
LOCATION 'hdfs://analytics-hadoop/user/milimetric/wmf/data/wmf/browser/general'
;

Then call the query with the appropriate params (note to self: the doc string should be updated to use spark3 and --master yarn:

spark3-sql --master yarn -f browser_general.hql       \
    -d coalesce_partitions=1                          \
    -d projectview_source=wmf.projectview_hourly      \
    -d pageview_source=wmf.pageview_hourly            \
    -d year=2023                                      \
    -d month=7                                        \
    -d day=10                                         \
    -d threshold=0                                    \
    -d os_family_unknown=Unknown                      \
    -d os_major_unknown=Unknown                       \
    -d browser_family_unknown=Unknown                 \
    -d browser_major_unknown=Unknown                  \
    -d destination_table=milimetric.browser_general

And then we can look at that day in detail, as if that threshold doesn't apply. I'll update in a bit when the query finishes.

This gives me like ~800 browser families that our user agent parser figures out. I start to not recognize families around like 20 and by line 50 I really have never heard of any of these. But the view counts in the long tail are really low, checking on that...

use analytics_hive.milimetric;

with totals as (
select sum(view_count) as total_views
  from browser_general
),

views_by_family as (
select browser_family,
       sum(view_count) as view_count
  from browser_general
 group by browser_family
),

percentages_by_family as (
select browser_family,
       round(cast(view_count as double) / total_views, 4) as percent_views
  from views_by_family
           cross join
       totals
)

select * from percentages_by_family

Ok... looking at this on this day, it looks to me like "Other" is closer to 1.2%, but I have to look more carefully. If you want to take a look, the output of above is available as /home/milimetric/browser-family-stats-2023-07-10.tsv on stat1004.

Down to the source

I'll work backwards from the Dashiki frontend at analytics.wikimedia.org. The frontend links to, and itself loads data from, the following directory of TSV files: https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/browser/

I use Codesearch (query) to try to find the source of one of these files, e.g. all_sites_by_browser_family_and_major. This leads to a config file in the analytics/reportupdater-queries repository. I don't know what system interprets this or how, but it appears to be setting a schedule for running queries in Hadoop, and producing these TSV files. The surrunding directory (analytics/reportupdater-queries.git:/browser contains files that each contain SQL-like queries, that I imagine are compatible with Hive.

Looking at all_sites_by_browser_family_percent, for example, ti appears to target a Hadoop table called wmf.browser_general. I'm not familar with this dataset, but based on the query being very simple (no logic about "Other" or the 0.1% threshold), I'm assuming this is an intermediary dataset, which I'm not interested in.

Next, searching on Wikitech wiki for "browser_general" to hopefully find out more about how the sausage is made. This brings us to Analytics/Data Lake/Traffic/Browser general. This wiki page describes schema well, and I could query it from a stats server using the Hive command-line at this point. But, as mentioned above, this intermediary dataset is too far down the pipeline, the mystery has already ocurred at this point. Fortunately, the wiki page also says exactly where it comes from:

Oozie job

wmf.browser_general is loaded through this oozie job. It is a basic query that extracts data from wmf.pageview_hourly […], ewith a threshold-based anonymization applied to it. For privacy […] reasons, any rows that initially represent less than K% of the total view count are grouped into a single row, where all columns […] are assigned to "Other" […]

The Oozie-run query starts to shine a light on what's going on.

  1. It queries wmf.projectview_hourly to quickly compute a total pageview count (presumably identical to summing wmf.pageview_hourly, but quicker as its a simpler dataset).
  2. It queries wmf.pageview_hourly to compute the sum of pageviews and their precentages relative to the total, grouped by (os_family, os_version, browser_family, browser_version)
  3. It folds all rows having <0.1% into "Other" (aka unknown).

You might already see the problem. I didn't when I first read it, but the following queries highlighted the problem that is now obvious in retrospect.

Confirm the problem

I SSH'ed to one of the stat machines (this is restricted behind data access group: analytics-privatedata-users), login with kinit (Kerberos), and launch hive command-line to start exploring the data.

Generally when exploring data in Hadoop, I find it's quickest to focus on a single year-month-day slice. I picked 10 June 2023.

krinkle@stat1008.eqiad.wmnet: hive Query 1
SELECT browser_family,
browser_major,
SUM(view_count) as view_count
FROM wmf.browser_general WHERE year=2023 AND month=6 AND day=10
GROUP BY browser_family, browser_major
ORDER BY view_count DESC
LIMIT 1000;
browser_family      browser_major   view_count
Chrome Mobile       114             111765204
Mobile Safari       16              92692217
Other               -               57242449 # ←
…
Chrome Mobile iOS   114             11630463
Edge                114             11540794
…
Samsung Internet    21              9778647
……
Chrome Mobile       95              273008

The wmf.browser_general table is more or less identical to the public TSV files that we publish and render in our Browsers report dashboard. As expected, "Other" is very large here. But, I did not expect that "Other" is the third-largest entry in the entire dataset!

Peeling the onion

To get answers about what "Other" is made of, we have to dive into the wmf.pageview_hourly table.

I re-created the "Oozie job" query (linked above) that is behind this browser_general table. I took note of the last row above (Chrome Mobile 95 with 273,008 pageviews in one day). This is effectively the cut-off as otherwise there would have been another row below it. I limited my results to those smaller than 273008, with an additional limit of 20 rows, as there no doubt is a very long tail of garbage data here from Internet's sewage sending malformed requests at our servers.

Beware: If you try this yourself, do not publish the results of your queries unless you've learned from specialists at Wikimedia on how to avoid leaking identifying data. Aggregate numbers are not automatically considered anonymous, because they can be used in combination with other datasets to identify people's activity. Our global audience is especially sensitive to privacy attacks and it is our responsibility to protect their data.

krinkle@stat1008.eqiad.wmnet: hive Query 2
SELECT
user_agent_map['browser_family'] AS browser_family,
user_agent_map['browser_major'] AS browser_major,
SUM(view_count) as view_count
FROM wmf.pageview_hourly
WHERE year=2023 AND month=6 AND day=10
AND agent_type = 'user'
GROUP BY user_agent_map['browser_family'], user_agent_map['browser_major']
HAVING SUM(view_count) < 273008
ORDER BY view_count DESC LIMIT 20;
browser_family  browser_major   view_count
Chrome Mobile   79              271242
Firefox iOS     114             270017
Firefox         105             269201
Chrome Mobile   74              268443
Edge            109             263393
Firefox         111             250197
Mobile Safari   17              248484
Opera           95              239790
Opera Mobile    70              224850
IE              8               214597
IE              6               207486
Chrome          107             202719
Edge            106             202667

The main realization here is that the 0.1% threshold is applied to individual rows, and then rolled up straight into the global "Other" entry. The vast majority of page views represented by "12% Other" are from top-5 browsers such as Chrome, Firefox, Edge, Safari, and Opera. However, their individual versions have such a long tail (due to rapid upgrade cycles with literally hundreds of versions in active use) that each one can be <=0.1% and yet still add up to 5-10% globally.

For example, it seems likely that older versions of Chrome trapper under "Other" alone, add up to more than Edge's total market share.

Note that we are not talking about Chrome-like browsers here. The ua-parser software we use, recognises most lesser-known browser families very well, and classifies them as something else (e.g. Vivaldi, Pale Moon, SeaMonkey, Brave, etc.). We're just talking about straight-up Google Chrome and their non-current versions.

Another realization here is that the 0.1% threshold appears to be applied at a time before rows are rolled-up by browser_family. If it were "only" a matter of unpopular versions being taken out of their family and into "Other", we would not have a "12% Other" share. The threshold is applied to os_family, os_version, browser_family, browser_version; with all other datasets derived from that, after the threshold of anonymisation.

Speaking of lesser known browsers, I figured I'd run one more query to see what the tail end looks like after the top-5 browser families drop off.

krinkle@stat1008.eqiad.wmnet: hive Query 3
SELECT
SUM(view_count) as view_count
FROM wmf.pageview_hourly
WHERE year=2023 AND month=6 AND day=10
AND agent_type = 'user'
ORDER BY view_count DESC LIMIT 10;
# total view_count = 510,934,018 = 500M

SELECT
user_agent_map['browser_family'] AS browser_family,
SUM(view_count) as view_count
FROM wmf.pageview_hourly
WHERE year=2023 AND month=6 AND day=10
AND agent_type = 'user'
GROUP BY user_agent_map['browser_family']
HAVING SUM(view_count) < 273008
ORDER BY view_count DESC LIMIT 100;

browser_family				view_count
Crosswalk					218873 0.4%
BackgroundShortcutRunner	94880  0.2%
Vivaldi						92493  0.2%
HeadlessChrome				84088  0.2%
YandexSearch				60088  0.1%
Chromium					42115  0.008%
Pale Moon					32754  0.007%
SeaMonkey					23669  0.005%
Waterfox					21321  0.004%
							      

Geeky side-note, BackgroundShortcutRunner appears to be Apple Shortcuts (powered by Swift CFNetwork, Git source), an iOS app for "visual programming" way to automate various tasks, in which it is not uncommon to have steps like "open browser". I'm guessing this is what that is!

Conclusion

Overall, I do no longer feel reasons to doubt the underlying validity of the pageview dataset. Per T342267#9028708, the "true Other" is only 0.001%. Contrary to what I suspected in March this year (at T178356#8709512), I do not think we have a big problem with the version of ua-parser we run. It still correctly identifies >99.999% of browsers and does so without much noise or variability.

My suspicion that the browser dataset (derived from the pageview dataset) isn't what it appears, was confirmed. The biggest surprise was that we fold rows under <0.1% into "Other" before we perform OS-level and global browser-level aggregations.

I believe this means that, if Firefox 89 is were used by 0.5% of page views (well above the threshold), but its use is spread across various different Android, Windows, and macOS versions, it may end up ommited from the dataset entirely; captured in "Other". And this happens even when browsing dashboards built on the generic "browser family + version" dataset, which is OS-agnostic.

Likewise, a browser like Vivaldi may very well have more than 0.1% of pageviews, but its use spread across too many different versions.

Next questions:

  • Is it feasible to compute this data such that the threshold is applied last?
  • Do we have pageview_hourly (or equivalent at daily/weekly resolution) going back far enough, to be able to recompute the data from 2015 onwards if/when we agree on a better query? Note that we don't need the raw user agent string (which I know we don't store beyond 90 days), just the aggregate ua_map.

I can think of two approaches.

  1. One is that we'd roll-up from the edges first. E.g. first roll up "Android x + y + z, Windows x + y +z - Firefox 89" into "Android other, Windows other, Firefox 89". And first roll up Vivaldi 10,11,12 into Vivaldi other. This has the benefit of keeping the derived nature of the datasets, continuing to come from one publicly-sharable source and the rest trivially derived from it. The downside is that it would create a lot of "other" buckets and would still omit more data than our threshold allows.
  2. Another is that we'd apply the threshold last after all other aggregations. This would offer maximum detail with the threshold we set. The downside is that it would no longer make the datasets individually correlate and add-up to the same information. E.g. there may be a higher percentage of "Android 8.0" in the OS-only dataset, than in the "OS+Browser" dataset, where more rows would have been replaced with "Other".
  1. Is it feasible to compute this data such that the threshold is applied last? I can think of two approaches.

Yeah, definitely

  1. Do we have pageview_hourly (or equivalent at daily/weekly resolution) going back far enough, to be able to recompute the data from 2015 onwards if/when we agree on a better query? Note that we don't need the raw user agent string (which I know we don't store beyond 90 days), just the aggregate ua_map.

Yes, we have all of pageview_hourly since we started populating it, May 2015

I believe the new query I left in T342267#9029101 goes most of the way towards addressing the problems we found. Namely, it finds percentages by family first. We can join to this and choose to do pretty much anything we want. Do you have a sense of what would be most useful?

We can join to this and choose to do pretty much anything we want. Do you have a sense of what would be most useful?

[…] I can think of two approaches:

  1. One is that we'd roll-up from the edges first. E.g. first roll up "Android x + y + z, Windows x + y +z - Firefox 89" into "Android other, Windows other, Firefox 89". And first roll up Vivaldi 10,11,12 into Vivaldi other. This has the benefit of keeping the derived nature of the datasets, continuing to come from one publicly-sharable source and the rest trivially derived from it. The downside is that it would create a lot of "other" buckets and would still omit more data than our threshold allows.
  2. Another is that we'd apply the threshold last after all other aggregations. This would offer maximum detail with the threshold we set. The downside is that it would no longer make the datasets individually correlate and add-up to the same information. E.g. there may be a higher percentage of "Android 8.0" in the OS-only dataset, than in the "OS+Browser" dataset, where more rows would have been replaced with "Other".

From a high-level, the main constraint here is the cut-off threshold. This forces us to aggregate. However, even with the same 0.1% threshold, there are actually multiple ways to slice the data, where some ways to lead to more detail in the output.

  • Status quo (Great simplicity, Poor detail). We release 1 main dataset with all low-level datapoints combined, and the threshold applied. The other datasets are derived from this one. This is very simple, but also means a 12% global bucket of "Other", larger than non-Other portions of Edge, Samsung, and Firefox combined. And it means people often wrongly assume that "Firefox 5%", in the browser-only dataset, means this is a total, when actually it is only the total of individual (Firefox, version, OS) combinations that individually are >0.1%.
  • Idea 1 ("Roll up") (Good simplicity, High detail). We release 1 main dataset, like before. But rather than only 1 large global "Other" bucket, we create smaller "Other" buckets within a given OS or browser family. This means we would, for example, have "Firefox 8%" made up of:
    • 3% (Firefox 112, Windows 10)
    • 2% (Firefox 112, macOS)
    • 2% (Firefox 112, other), covers OS/version combos too small, including uncommon Linux/Windows versions.
    • 1% (Firefox other, Windows 10), covers uncommon versions of Firefox on Windows 10.
  • Idea 2 ("Last-mile") (Low simplicity, High detail). We release separate datasets. Each one having a fairly large "Other" entry for data below the threshold, but only within that dataset. This means we would, for example, do get close to the (true) "Firefox 8%" in the browser-only dataset, but in the OS+browser dataset it'd add up to less like today. To get the detail you seek you'd have to find the closest dataset first.

Of these, I believe "Idea 1" would be most useful.

I originally thought "Idea 2" would offer more detail than "Idea 1" but I realize now that it wouldn't offer anything more. We don't have a "browser-only" data set. What we have is a "browser_family+browser_version" dataset, which is means the "Other" bucket would still eat up entries for individual Firefox versions below 0.1%. You'd want those to roll-up in "Firefox version=other" at least, at which point you might as well implement Idea 1 instead!

The level of complexity and inability to correlate with "Idea 2" seems... undesirable. It's not impossible justify, for example, if it did offer more detail within the same privacy threshold, and if that detail was very valuable to us. Right now, though, we have whopping 12% Other. More than Edge, Samsung and Firefox combined. I expect "Idea 1" will cut this to less than 1%. If someone manages to find a way to squeeze more out of that 1%, and if that is valueable enough when weighted against the increased complexity in understanding the data, then that's perhaps worth exploring later? We have some low-hanging fruit to pick up first, I think :)

@SNowick_WMF @nettrom_WMF @mpopov can you share ways this change would positively or negatively affect your work supporting product decision makers.

Idea 1 has the level of information and detail that my teams would find useful, especially in conversations around browser versions, support levels, and such.

@Krinkle: Thank you for sharing the results of your queries in a manner consistent with the Data Publication Guidelines.

From Product Analytics side, idea 1 would make it less likely for P+T teams to request our support for self-serviceable questions around browser/OS usage.

I can see Idea 1 potentially being helpful to teams answering questions like these:
T303301: Support on understanding traffic and behaviors for users on legacy browsers (somewhat timely)
T250385: Estimate readership from Internet Explorer 8
themselves, although most folks at the Foundation have access to tools like Turnilo (which was used to answer that first one) and Superset. Staff also have access to more detailed data than what is released publicly. Maybe the browsers report on analytics.wikimedia.org is also more accessible / easier to use to staff than Turnilo and Superset are? (Although I'd like to get to a state where everyone in P+T is data-literate and proficient with both of those.)

I think where the public data is really helpful is for conversations with communities / product decisions made made in collaboration with communities, since volunteers don't have access to the more detailed data that is only available internally.

Idea 1 ("Roll up") (Good simplicity, High detail). We release 1 main dataset, like before. But rather than only 1 large global "Other" bucket, we create smaller "Other" buckets within a given OS or browser family. This means we would, for example, have "Firefox 8%" made up of:

3% (Firefox 112, Windows 10)
2% (Firefox 112, macOS)
2% (Firefox 112, other), covers OS/version combos too small, including uncommon Linux/Windows versions.
1% (Firefox other, Windows 10), covers uncommon versions of Firefox on Windows 10.

Seems like folks are onboard for Idea 1. Speak now or forever hold your peace. Will likely start work on this starting June 16th.

CC @Htriedman

Reading up on this thread now! I think that idea 1 sounds good and shouldn't be privacy-breaking if we report counts/percentages that are above 250 views. Given that we get something like 600m views per day, that lower threshold accounts for 0.000042% of our traffic.

@Krinkle

I think we could modify (split) the schema of the table wmf.browser_general and recompute it, so that idea 2 does indeed drastically reduce the other bucket for charts that are browser-only or OS-only.
We could just have 3 tables:

  • one grouped by browser_family and browser_major (would back up browser-based charts)
  • one grouped by os_family and os_major (would back up OS-based charts)
  • the one that we already have now, grouped by all fields (would back up browser+OS-based charts)

When aggregating the first 2 tables, since there are fewer dimensions, buckets would be bigger, and I think applying the same 0.1% threshold would result in much smaller "Other" bucket indeed.
The only drawback of this solution is that if we want the change to apply retroactively, we'll have to re-run the calculations since 2015, but I think it is possible.


Also,
I think "Other" is not a good name for that bucket.
It implies that everything that is in it, is different from what is outside of it.
But actually, the current "Other" bucket in, say, ("Windows", "iOS", "Android", "Other", ...) does mostly contain requests from "Windows", "iOS", "Android", etc.
So, I think the name is misleading.
I'm not sure what could be a good name, but some ideas: "Unreported", "Hidden", "Protected", "Small buckets", "Privacy sensitive", "Unknown", ...?

Made up some slides to help think about this data:

https://docs.google.com/presentation/d/1jnL4E0LVEnh1WsNK7IM-mS2tQpIXxZJN2tz7iP05_OA/edit#slide=id.g2e65179e474_0_328

What Marcel says above I think is more valuable for the browser-only and OS-only reports than anything we can do with the combined report. Because of what I show in the slides. Adding to that, for the combined report, we could continue with idea 1 in two ways:

  1. keep browser_general the way it is now, with a threshold of 0.1%
  2. recompute browser_general with a threshold of 250 views

If we recompute, we'd have to bucket at some point, not for privacy but for readability of the report. So a question for @Timo and others, what level of detail is useful here? Do we need a list with everything that is above 250 views? I looked at just Chrome and Firefox browser families and for 1 day we have 1500 different combinations of all dimensions that have more than 250 hits. That seems more like noise than information to me, so what would be a good size of this list? I can then try to optimize the output towards that.

I think "Other" is not a good name for that bucket. […]
But actually, the current "Other" bucket in, say, ("Windows", "iOS", "Android", "Other", ...) does mostly contain requests from "Windows", "iOS", "Android", etc.
So, I think the name is misleading. […]

I'd agree. My thinking around "Idea 1" is that this would become less common. The global "Other" bucket would, I believe, mostly be reported under buckets like "Windows 10 / Firefox other". This is based on the assumption that most lost buckets today are in the long tail of browser versions, under OS versions and brower families that themselves are fairly large. A small contingent of noise and uncommon browser or OS families would continue to fall under "Other" of course.

What I am starting to realize, though, is that we go with "Idea 1" and follow its implication of continuing a single dataset in the same flat format and re-usable format we have today, that we have a choice to make. Rolling-up browser_version first or os_version first seems to produce the same output in my limited testing. I expect the main difference and trade-off will be in rolling up browser_family vs os_family first. This will dictate whether you end up with more "Other / real browser" rows or more "Real OS / Other" rows.

My instinct is that we have far more diversity in browser_family/version (and thus more loss due to division) than in os_family/version, which is more homogenous and resistent to data loss, I think? If that's the case, then rolling up by os_family would be more valuable I imagine. Example at P65115.

ok, I have some results for us to peruse, from rolling up in different ways. First of all, my query so we can debate whether or not it's accurate.

-- CTEs omitted for brevity, quick descriptions here (all data from 2024-06-10):
-- pageview_sanitized: filtered to (user) pageviews on (desktop, mobile_web)
--                     AGGREGATED by all dimensions, filtered to HAVING > 250 views
-- totals: the total view_count for all requests on this day, no filtering

SELECT
    'all' as aggregation_type,
    p.access_method,
    p.os_family,
    p.os_major,
    p.browser_family,
    p.browser_major,
    p.view_count,
    p.view_count * 100 / totals.total_view_count AS percent_of_total
FROM
    pageview_sanitized p
    CROSS JOIN totals
WHERE
    p.view_count * 100 / totals.total_view_count > 0.1

UNION ALL

SELECT
    CASE
        WHEN os_family is null THEN 'only_browser_family'
        WHEN browser_family is null THEN 'only_os_family'
        WHEN browser_major is null and os_major is null THEN 'browser_and_os_families'
        WHEN os_major is null THEN 'all_but_os_major'
        WHEN browser_major is null THEN 'all_but_browser_major'
        ELSE 'unknown'
    END as aggregation_type,
    p.access_method,
    COALESCE(p.os_family, 'other-families') as os_family,
    COALESCE(p.os_major, 'other-versions') as os_major,
    COALESCE(p.browser_family, 'other-families') as browser_family,
    COALESCE(p.browser_major, 'other-versions') as browser_major,
    sum(p.view_count) as view_count,
    sum(p.view_count) * 100 / MIN(totals.total_view_count) AS percent_of_total
FROM
    pageview_sanitized p
    CROSS JOIN totals
WHERE
    p.view_count * 100 / totals.total_view_count <= 0.1
GROUP BY
    access_method,
    os_family,
    os_major,
    browser_family,
    browser_major
GROUPING SETS (
    (access_method, os_family, browser_family),                 -- 'browser_and_os_families'
    (access_method, os_family, os_major, browser_family),       -- 'all_but_browser_major'
    (access_method, os_family, browser_family, browser_major),  -- 'all_but_os_major'
    (access_method, browser_family),                            -- 'only_browser_family'
    (access_method, os_family)                                  -- 'only_os_family'
)
HAVING
    sum(p.view_count) * 100 / MIN(totals.total_view_count) > 0.1

So, the first SELECT grabs all data which, aggregated by all columns (os family, browser family, os major, browser major) represent more than 0.1% of the total pageviews on this day.

And the second SELECT then tries five different grouping sets and labels results by them:

(access_method, os_family, browser_family),                 -- 'browser_and_os_families'
(access_method, os_family, os_major, browser_family),       -- 'all_but_browser_major'
(access_method, os_family, browser_family, browser_major),  -- 'all_but_os_major'
(access_method, browser_family),                            -- 'only_browser_family'
(access_method, os_family)                                  -- 'only_os_family'

In the next comment I'll dump all the data so we can all look at it and make decisions. For those who want to look, I've inserted it in an Iceberg table, analytics_iceberg.milimetric.bg_01. Access this on SQLLab, or from a stat machine:

kinit
...
presto
use analytics_iceberg.milimetric;
desc bg_01;

This spreadsheet (1) has all the different aggregations in separate sheets. The name of the sheet is the aggregation type. Described here:

allaggregate by browser, br. major, os, os major, and keep only if total views > 0.1%
browser_and_os_familiesaggregate anything NOT > 0.1% by browser and os families and keep results that are now > 0.1%
all_but_browser_majoraggregate anything NOT > 0.1% by all but browser major and keep results that are now > 0.1%
all_but_os_majoraggregate anything NOT > 0.1% by all but os major and keep results that are now > 0.1%
only_browser_familyaggregate anything NOT > 0.1% by browser family and keep results that are now > 0.1%
only_os_familyaggregate anything NOT > 0.1% by os family and keep results that are now > 0.1%

[1] https://docs.google.com/spreadsheets/d/1a6JiDupQn4I7JujyoIvcR8oDJ942DPYTyCBQ_P60ARo/edit?usp=sharing

The long and the short of it is that we can get that "other" to about 2% if we simply roll up remaining data by browser family and os family. We could get fancier but let's see what folks think about just this approach.

From a discussion with @Krinkle about the data, a preliminary idea of how to roll up is:

first ( os, os_major, browser, browser_major ), the rest by ( os, browser, browser_major ), the rest by ( os, browser ), the rest by ( browser ) and everything else by "Other" where we will think very hard what word to use instead of "Other".

By "the rest" here we mean everything that didn't previously roll up over the threshold. We discussed threshold a bit, and found that if an item has more than 100k hits in a week, it feels substantial enough to include. This corresponds to roughly 0.01%, an order of magnitude more than before. I'll run some queries and see where this lands us in terms of output and what assumptions about our current jobs would need to be re-evaluated as a result. For intuition, 100k hits in a week is roughly 10 hits per minute, so it makes sense to report on these.

We get a ton more detailed results this way, and the total coverage increases to 99.7%. Still not 99.9%, but I think we may have too much detail at some point. I'm fairly happy with these results, and I'm going to prepare the new browser general query as a gerrit change. It'll be good to get some review.

 select coalesce(aggregation_type, 'overall') as agg_type,
        sum(view_count) views,
        sum(percent_of_total) percent_of_total,
        count(*) as distinct_results
   from bg_15k
  group by aggregation_type
  grouping sets ((), (aggregation_type))
;

Below, agg_type says which columns we're aggregating by, with overall being an aggregate of the other four aggregates. We progressively aggregate as described in the previous comment.

agg_typeviewspercent_of_totaldistinct_results
overall49850242999.733854553588781182
browser_major_and_os_major (all)48533809897.10010715965323904
all_but_os_major53060511.0615653764205122209
browser_and_os_families77060581.541727428079431762
only_browser_family1522220.0304545894356241947

The simpler way to do this, just two phases as opposed to progressive, gets us fairly similar results, with about 200 fewer rows which are all detailing specific browser versions.

select coalesce(aggregation_type, 'overall bg_250') as agg_type,
       sum(view_count) views,
       sum(percent_of_total) percent_of_total,
       count(*) as distinct_results
  from bg_15k_simple
 group by aggregation_type
 grouping sets ((), (aggregation_type))
agg_typeviewspercent_of_totaldistinct_results
overall49840874499.71511128250266979
browser_major_and_os_major (all)48533809897.10010715965323904
browser_and_os_families130706462.61500412284941575

So cool @Milimetric! Thanks for the data study.
The process makes sense to me!
Only question I have is: the query cuts the data 3 times,
once in pageview_sanitized (mentioned in the CTE description "filtered to HAVING > 250 views");
once in the first select "p.view_count * 100 / totals.total_view_count > 0.1";
and lastly in the second select "sum(p.view_count) * 100 / MIN(totals.total_view_count) > 0.1".
Wouldn't it be better to just cut once at the very end? Wouldn't it further reduce the Other bucket size?

Great question, @mforns. This was mostly for performance reasons. I couldn't find a way to get Spark to optimally work on the full day of pageviews without first aggregating it like this to > 250. But the execution plan I ended up with looks pretty wild. Let's talk tomorrow when you have some time. I'm attaching the change here.

Change #1049281 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery@master] [WIP] Implement new way to aggregate browser statistics

https://gerrit.wikimedia.org/r/1049281

Milimetric set the point value for this task to 13.Wed, Jul 3, 7:50 PM

Apologies for the week delay here, I was out sick, picking it back up soon.

Ok, sent updated code, it's fast now due to a CACHE statement, but that doesn't change the query plan which is still absolutely nuts, check this out:

Crazy Query Plan.png (4×3 px, 1 MB)

Heya @Milimetric, sorry for taking so long to review this.
I left a comment and a +1, I think that the code looks great and that we can deploy this 👍
This new query is so cool! Kudos :-)

ok, moving to ready to deploy. I'm going to ping @Krinkle one more time for data review. I executed this as I was testing and the results are available in milimetric.browser_general_test. You can query this like this:

https://superset.wikimedia.org/sqllab/?savedQueryId=928 (using presto_analytics_iceberg) as the database, since it's an iceberg table like in prod)

Or from a stat box you can do:

presto
use analytics_iceberg.milimetric;

 select *
   from milimetric.browser_general_test
  where day = date '2024-06-14'
  order by view_count asc

And for dates, I ran 2024-06-10 through 2024-06-17, so those should all be available. The "leftover" or "Other" bucket is now identified by this query:

select *
  from milimetric.browser_general_test
 where day = date '2024-06-17'
   and os_family = 'Redacted'
   and os_major = 'Redacted'
   and browser_family = 'Redacted'
   and browser_major = 'Redacted'
 order by view_count asc

And for the 17th of June, it's a total of about 1.2 million hits, or about 0.26%. Let me know what you think and if I can help make the data more accessible for testing.