Initial Analysis of Plugin Search Logs

In preparing to do some relevancy testing of the new WordPress.org plugin search here is some data analysis of the search logs from the existing plugin search. This data and the search logs should help us improve the search query more methodically, but it may also be of use when thinking about the overall search user experience.

Aggregated Stats

  • 53 days of aggregated, anonymized search records with 4,952,788 total searches and 805,488 unique searches. This excludes API searches(*)
  • Almost 100k searches a day on average.
  • Focusing on en_US: 499,489 unique searches from 3,929,003 total searches
  • Top 100 en_US terms: 944,243 searches (24% of the total)
  • Top 1000 en_US terms: 1,834,942 total searches (46.7% of the total)
  • Bottom 400k en_US terms: 479,604 searches (12.2% of total)
  • 320,390 en_US terms occurred once in 53 days (8.1% of total)
  • 472,500 en_US terms occurred ten times or less 839,123 total (21% of total)
  • 492,500 en_US terms occur less than once a day 1271786 total (32% of total)
  • 245,151 of 472,500 search queries pass aspell (though aspell is limited, there are lots of fun spelling mistakes “composor”, “conditon”, “taxanomy”, “produkt”)
  • Here is a random set of queries that only occurred once in the 53 days to get a sense of what the long tail of queries looks like.

Non-English data is a little hard to draw conclusions from because there aren’t currently any search indices for non-English languages. So the top searches in other languages look about the same as English. About 20% of all searches are not in English. Here are the search counts by language. This likely undercounts the demand since search in non-English languages almost certainly don’t work well.

(*) A very large volume of searches come through the API, but a lot of these appear to be bots or direct lookups of an exact plugin name rather than organic search results. These should be looked at separately and should maybe not use the same search query.

Some Takeaways from the Data

Hopefully from this data it is clear why we can’t just focus on the top 100 queries. Doing so ignores probably 75% of our users. Even top 1000 is less than 50% of all searches. So in testing out our search quality at the very least we should look at top 1000. Ideally we’ll also do some random sampling of the other 50% of searches and find ways to address spelling mistakes and other edge cases that affect a very large percentage of users. Similarly, adding better support for non-English languages could help as many users as improving the top 100 queries.

#plugin-directory