Wikipedia talk:Wikipedia Signpost/2013-02-04/News and notes

Discuss this story

  • Article Feedback Tool: I've said this before many, many times, and I'll say it again here. The Foundation's passion for stats and feedback does not always contribute to the improvement of Wikipedia and its management by the volunteer community. WMF projects thrust upon the Wiki have required massive community incentive to carry out the cleanups when they misfire, and reasonable solutions for improvement in quality of new articles required by community consensus have been summarily rejected by the Foundation. While a truly excellent tool in the hands of the right users, NewPagesFeed/CurationTool does not address these issues and has not improved the quantity and quality of new-page patrolling. AfT creates more work for this community than the net useful information that it is designed to produce. Someone recently stated words to the effect that the Foundation's answer to the community's claim that a car (project) is broken, is 'Keep pushing'. The only real solution is to deploy Foundation funds and resources to re-launch development of the Article Creation Workflow as a proper landing page for new users/page creators. Instead of simply wanting quantity instead of quality, the Foundation would probably rejoice at the result which would greatly reduce the burdens and backlogs in such areas as Articles for Creation, Deletions and AfD, largely resolve the issues surrounding the work of admins, and their appointment at WP:RfA, and reduce the endemic hat-collecting of minor rights. Meta areas, including WP:NPP, WP:AfC, WP:AfD, and possibly also the AfT, are a magnet to inexperienced users who cannot, or prefer not to expand or create content.Kudpung กุดผึ้ง (talk) 02:15, 6 February 2013 (UTC)Reply
    I'm sorry that you don't feel statistics and hard data has a role in helping the volunteer community with its workload, but I must confess to being bemused by how ACTRIAL or the problem(s) with patrolling incoming content has anything to do with AFT5 (or how AFT5 can be creating work for the community when we've said 'if you guys want to turn it off, we'll turn it off' and people seem to be heading in that direction). The Foundation is not looking at quantity instead of quality; it's looking to raise the number of people who can help with maintenance tasks. And yes, sometimes this involves not only training but also making the software easier, as we did with Page Curation, or pointing people towards those tasks that need to be done. The vast majority of users do not engage in meta areas, which is why it would surprise me to find that a majority or substantial chunk of inexperienced users did; to resort to statistics for a moment, I ran a quick database query against the patrolling tables. In the last 30 days, there have been 51 patrollers with fewer than 500 edits - that's 14 percent of the patrollers overall. They are responsible for 484 patrols, which is...6 percent of patrols. If they were doing that terrible a job, presumably people would be un-reviewing their pages - and yet in the time period specified, experienced users (>= 500 edits) unreviewed...8 pages. In total. Not sure if the initial reviews were by new people or not. I'm happy to accept that quantitative and qualitative information go hand in hand, but your argument doesn't seem to be backed up by either as you've presented it. Okeyes (WMF) (talk) 18:45, 6 February 2013 (UTC)Reply
    'In the last 30 days, there have been 51 patrollers with fewer than 500 edits - that's 14 percent of the patrollers overall' - you've just backed it it up for me, and it's far too many. The reason their patrolls have not been reverted is probably because not many patrollers are patrolling the patrollers - and that's not what we're supposed to be doing. --Kudpung กุดผึ้ง (talk) 12:34, 7 February 2013 (UTC)Reply
    Note also "They are responsible for 484 patrols, which is...6 percent of patrols" - really, the number of patrollers in [tranche] is not useful for looking at 'are they doing it well/badly/causing more work'; the thing that counts is "how many patrols are they doing?". If we have one patroller doing 400 patrols, that makes a much bigger impact on the value of patrolling-as-a-way-of-triaging-junk than 10 patrollers doing 5 each. So, yes, they are 14 percent of patrollers: they are responsible for a much smaller chunk of the work. I certainly agree that patrollers do not exist to answer the quis custodiet problem - but either patrollers aren't seeing bad work, in which case your argument that there is a substantial problem involving poor-quality patrolling is...confusing, or patrollers are seeing bad work, and at no point deciding it's worth undoing. Okeyes (WMF) (talk) 13:03, 7 February 2013 (UTC)Reply
  • What percentage is useful? Regarding the claim that "Between 30 and 60 percent of all feedback was rated by editors as 'useful", at Wikipedia:Article Feedback Tool/Version 5/Feedback evaluation#Is this useful? the instructions say "It is only the most entirely useless feedback that should be categorized as 'no' (not useful)." Several editors have worked together to post a random sample of 1000 feedbacks (after the anti-abuse filters and excluding anything that an editor has marked as hidden) at User:Guy Macon/Workpage. I welcome the interested reader to look at it and make their own estimate of what percentage is useful. --Guy Macon (talk) 02:43, 6 February 2013 (UTC)Reply
    Yeah; that's actually an outdated description :). Would you like me to pull the categories/descriptions for the most recent tests? Okeyes (WMF) (talk) 17:58, 6 February 2013 (UTC)Reply
    My personal preference is that when WMF publishes the results of a study, it should have a two prominent links to "methodology" and "raw data" on the main page of the study. In this particular case the methodology link should tell me, among other things, how the test subjects were selected, what instructions they were given, etc. The raw data should be such that if I want to I can replicate your work. This would bring a welcome level of scientific rigor to these studies. While I am waiting for that to happen, I would like to see a hatnote on anything that is outdated. --Guy Macon (talk) 18:44, 6 February 2013 (UTC)Reply
    Obviously our raw data is not necessarily possible (some of it might be oversighted) but I'll see what I can do. Okeyes (WMF) (talk) 23:04, 6 February 2013 (UTC)Reply
    It might be best to start with the next one. If you know that you are eventually going to publish some raw data, it is pretty easy to make a version with [Name redacted] and [Email redacted] or [Redacted for privacy reasons] as you go along. If you try to go back and do that after the fact, you always have a doubt about whether you missed one. I care far less about this particular result than I do about instilling a mentality in the WMF where they wouldn't dream about not publishing full details about methodology or not publishing raw data. And we haven't even started talking about single-blind vs. double-blind...
If you really want to focus on this particular study, rather than gathering raw data, somebody should start asking why WMF got "Between 30 and 60 percent useful" and my preliminary results are about 10% useful. That's a huge red flag. Is it because only one person cared enough to look at my data and post an estimate? was 200 a big enough sample? Is it because your study used 3 people? If you personally looked at the data would you come back and say that your estimate is 30%, not 10%? Is it because in both cases the person doing the evaluation was self-selected? If I saw results like that I would try to rip my own methodology to shreds and then I would try to rip the methodology of the other study to shreds. Somebody is doing something wrong. My attitude toward science: http://xkcd.com/242/ --Guy Macon (talk) 03:00, 7 February 2013 (UTC)Reply
Frankly, I can't answer those questions; I'm not the researcher here ;p. I'll poke Aaron and see if he can comment. Okeyes (WMF) (talk) 11:29, 7 February 2013 (UTC)Reply
poke received First of all, I want to direct you to the official report I wrote which includes the strategy for drawing both a random and stratified sample and the details of my methodology. I'm sad to find that this report was clearly referenced. You're not the first to have missed it. meta:Research:Article_feedback/Final_quality_assessment We had 18 Wikipedians evaluate at least 50 feedback items individually (though some evaluated more than 200). All feedback submissions were evaluated by two different people. The 30-60% number is a non-statistically founded, conservative minimization of these two evaluations/item. In the study, we found that 66% of feedback was marked *useful* by at least one evaluator ("best" in the report) and 39% of feedback was marked useful by both evaluators ("worst" in the report). Here's the breakdown of the four category classes we asked the evaluators to apply:
  • Useful - This comment is useful and suggests something to be done to the article.
  • Unusable - This comment does not suggest something useful to be done to the article, but it is not inappropriate enough to be hidden
  • Inappropriate - This comment should be hidden: examples would be obscenities or vandalism.
  • Oversight - Oversight should be requested. The comment contains one of the following: phone numbers, email addresses, pornographic links, or defamatory/libelous comments about a person.
Note that these exact descriptions appear as tooltips in multiple places in the feedback evaluation tool. If you'd like to personally replicate the study, I'd be happy to pull another random sample for you and load it up in the evaluation tool. --EpochFail(talkwork) 15:42, 7 February 2013 (UTC)Reply
Before I respond, let me reiterate that I think everyone at the WMF is doing a good job and has the right goals. This is a discussion about possible improvements, starting with some future study. Those who are looking for a club to beat WMF with should look elsewhere.
meta:Research:Article_feedback/Final_quality_assessment is a very useful overview of the methodology used, but in my opinion an additional detailed methodology would be a Good Thing. (I am about to write some questions, but please don't post the answers. They are examples of what should be in a detailed methodology -- I cannot explain what I am talking about without giving examples of questions that the overview does not answer.) For an example, the overview says "We assigned each sampled feedback submissions to at least two volunteer Wikipedians." A detailed methodology would have said something like this:
"Between 3AM and 4AM on December 24th, we posted a request for volunteers (in French) on Talk:Mojave phone booth and on the main page of xh.wikipedia.org. 43 people volunteered, and we rejected 20 of them for being confirmed sockpuppets of User:Messenger2010 (See Wikipedia:Long-term abuse/Messenger2010) and rejected 11 of them because Guy drank too much and decided he doesn't like editors with "e" in their username. That left us with Jimbo and a six-year-old girl (username redacted for privacy reasons). We then..."
Unlike "We assigned each sampled feedback submissions to at least two volunteer Wikipedians", the above details exactly how those volunteers were chosen. Again, I don't care how they were chosen. I just want future studies to contain a detailed methodology page that answers questions like this or questions about the RNG used. To pick another example, the post above this one says "We had 18 Wikipedians evaluate at least 50 feedback items individually (though some evaluated more than 200)." That detail is not found in the methodology overview. --Guy Macon (talk) 16:51, 7 February 2013 (UTC)Reply
The specific 'how they were chosen' list, I can provide, actually. The purpose of the study was to compare the rating of feedback that did get rated to feedback that got missed out on, suspecting that people overwhelmingly checked feedback for high-profile articles. In order to get some consistency between the two sets of numbers, I pulled from the database a list of all users who had, in the 30 days before we started the recruitment process, monitored more than 10 pieces of feedback in some fashion. The users in question were then sent a talkpage invitation going 'would you like to participate in this?'. I appreciate that's more a specific example to highlight a general point than anything else - and I'm going to bear your general point in mind when writing up something I've been working on recently, actually - but I thought I'd address it :). Okeyes (WMF) (talk) 18:50, 7 February 2013 (UTC)Reply
  • Thanks for writing about this RFC; I wouldn't have noticed it otherwise, and it's an important subject. -- phoebe / (talk to me) 22:42, 6 February 2013 (UTC)Reply
  • The highest quality article feedback I've seen is most always on the article talk pages. To be useful, feedback generally needs to be longer than the short tweets I generally see from the Feedback Tool. I often find useful comments on talk pages that sit unanswered for months, or even years, before I address the issues raised. So we already have a backlog on talk pages, without increasing it with more chatter from this tool. I don't feel that even 10% of the AFT comments are useful, but I've only looked at these comments on a very limited number of articles. Wbm1058 (talk) 23:37, 7 February 2013 (UTC)Reply