Wikistats reports on article revert trends

At Wikimania Gdansk I talked about the newest addition to Wikistats: article revert trends.
This week the new set of tables and charts has been released for almost 800 Wikimedia wikis. For example here is the report for the English Wikipedia.

In recent years there has been much talk within the Wikimedia community about edit reverts. Is there a growing tendency to revert (undo) new contributions, especially edits by anonymous users? If so, does this discourage new users to contribute? How large is the share of reverts anyway? Is an initiative like Flagged Revisions (FlagRevs), which seeks to shield the general public from deliberate but mostly short-lived lapses in article quality, effective? Is it also non-disruptive?  The latter questions also bear some relevance for the current Pending Changes initiative, which was inspired by FlagRevs.

Particularly on the German Wikipedia, where FlagRevs first took off, there is a set of relevant statistics already, notably these trend charts by user ParaDox. Also some months ago Felipe Ortega was asked by the German Chapter to look at the effect of FlagRevs. Felipe also presented at Wikimania Gdansk. Although there are differences in our approaches, and our research only partially overlapped, I am glad our conclusions are roughly the same (see our presentations).

analysiswpde

As always I looked for an approach that is language independent, and therefore applicable for all our wikis, yet yields solid results. In this case the crucial factor was how to recognize reverts in our archives (xml dumps).

One way is to look for key words in the edit comments that correlate well with reverts. Advantage is: it can detect incomplete reverts, where a user manually undoes most of the last changes , but not quite completely restores the previous state of the article. Drawbacks are: quite many false positives are inevitable, and of course it does not scale well to 280 languages.

There is a way to detect complete reverts efficiently: calculate a checksum for each revision, and compare this with checksums of earlier revisions.  The MD5 checksum algorithm guarantees with near certainty that equal checksums imply equal input, byte for byte, in this case full revision contents. Another important advantage: it is completely irrelevant what language or encoding is used in any wiki. For a few large wikis I tried both approaches: a wiki specific set of keywords in comments, and MD5 checksums. However tables and charts in the new reports are almost solely based on the more reliable MD5 approach.

articlebush1

Here is a summary of the tables and charts that together comprise the new report: Tables: ranking of most reverted users, most reverting users, most reverted articles. Charts: edit trends (some published earlier) , and revert trends. Reverts trends come in two variations: revert ratio per class of editor (registered or anonymous editor, or bot) and breakdown of anonymous edits by class of reverting editor. All charts come in two variations: one shows the raw numbers, another show trends  (raw data minus seasonal patterns and random variations).

revertratios

Surely there is ample risk of information overload. But I am convinced that for many of our wikis some people will be motivated to dig into these stats and come up with a context and some explanations, a story which will have wider implications beyond that particular wiki.

Hopefully the new stats can bring some solid numbers to discussions about our revert policies. Of course these numbers are influenced by methodology, are selective (meaning they don’t address all issues), and probably raise some new questions.The data files generated in this project can be reused  in further research, e.g. to determine which share of bad edits is detected and undone before a an article update revision is released to the public.

Disclaimer: complex social phenomena will never be explained by quantative data only, and seldom yield to what-if questions, neither do they lend themselves easily to double blind experiments.

See also my slides for Wikimania Gdansk presentation.

Note: in the new reports some language links are missing. Until fixed please use this table to access all reports.

Update: anyone interested in studying revert patterns be advised to look into the work of Palo Alto Research Center scientist Ed Chi and his colleagues.  Ed has been studying Wikipedia and its growth patterns for many years now. Vandalism and reverts  were part of that research. Among his many publications several focus on Wikipedia, two recent ones relevant to this topic are The Singularity is Not Near: Slowing Growth of Wikipedia (2009) and What’s in Wikipedia? Mapping Topics and Conflict Using Socially Annotated Category Structure (2009).

This entry was posted in Wikimedia Edit(or)s, Wikistats Reports. Bookmark the permalink.

7 Responses to Wikistats reports on article revert trends

  1. Graham87 says:

    Interesting stats! However In the table of reverts for the English Wikipedia, the page “Portal:Arts” is listed as “Portail:Arts” and its link goes to “100:Arts” instead of the intended location.

  2. Thanks for the post. For me, one of the most interesting things was the overall patterns that you showed (slide 6, from http://stats.wikimedia.org/EN/PlotsPngEditHistoryTop.htm and slide 7). The seasonal patterns (slide 22) were also quite interesting.

    There was an unresolved question about MD5 and the Japanese language edition; I’m wondering if anyone has verified that the MD5 implementation you’re using is handling Japanese characters correctly?

  3. Erik says:

    Jodi, MD5 works on any sequence of bytes, regardless of the content or meaning, be it plain text, images, audio or anything else. There is no relation with language or encoding.

  4. emijrp says:

    Hi Erik;

    Very useful and interesting stats, thanks. Looking at this graph[1], we know that about the 20% of anonymous edits are being reverted. I think that it is an high ratio, and that we need to reduce it. But, if we want to do that, we need to know why are these edits being reverted. I mean, it is vandalism? good faith edits but without proper style? npov? test edits?

    So, if reverts to anons (in English Wikipedia) are due to 90% vandalisms, 7% test edits and 3% other, we need to create the best antivandalbot of the history, or find a way to discourage vandals (show to anons a warning above the edit box “YOUR IP WILL BE STORED. VANDALISM IS FORBIDDEN, YOU WILL BE BLOCKED.”), make longer blocks, ect. But if it is 30% vandalism, 40% tests, and 30% other, we need to show a big warning with a big link to the Sandbox. Today, we show this warning[2], and I think that it can be improved in many ways.

    I think that FlaggedRevs is not so bad. It doesn’t discourage vandals, but, the vandalism is not showed because it doesn’t pass the filter (the change is not approved). In the cons are that good edits have to wait to be approved too.

    Another question, why languages as Japanese, Korean or Chinese have anon reverts ratio so low? Like Esperanto and Latin.

    Regards,
    emijrp

    [1] http://stats.wikimedia.org/EN/PlotRevertsEN.png
    [2] http://en.wikipedia.org/wiki/MediaWiki:Anoneditwarning

  5. Abhay Natu says:

    Hi,

    Thanks for providing this great service. We at Marathi wikipedia wait for these numbers every month to decide our tactical (and to some extent, strategic) approach…however, Marathi (w:mr) does not seem to be updated often. Is there a way you can include that in your monthly updates more regularly?

    Thanks again for the great service.

    Abhay Natu
    Admin/Bureaucrat, w:mr

  6. Erik says:

    I assume you refer to these revert reports specifically. These are produced from the full archive dumps. Processing these monthly is no longer possible with current hardware, due to their ever increasng size.

    For other wikistats reports (and Marathi is supposedly refhreshed as often as other Wikipedia’s) bad luck struck twice. This week we got news that the dump server has been repaired. Almost at the same time I heard the wikistats now has hardware failure. I am eagerly awaiting repair. A new power unit is on order. Sorry to keep you waiting.

  7. Abhay Natu says:

    Thanks for the prompt response. Hopefully, servers will be repaired soon.

Leave a Reply

Your email address will not be published. Required fields are marked *