0

I'm doing a gamification web app to help Wikimedia's community health.

I want to find what editors have edited the same pages as 'Jake' the most in the last week or 100 last edits or something like that.

I know my query, but I can't figure out what tables I need because the Wikimedia DB layout is a mess.

So, I want to obtain something like

Username Occurrences Pages
Mikey 13 Obama,..

So the query would be something like (I'm accepting suggestions):

  1. Get the pages that the user 'Jake' has edited in the last week.
  2. Get the contributors of that page in last week.
  3. For each of these contributors, get the pages they have edited in the last week and see if they match with the pages 'Jake' has edited and count them.

I've tried doing that something simpler in Pywikibot, but it's very, very slow (20secs for the last 500 contributions of Jake).

I only get the edited pages and get the contributors of that page and just count them and it's very slow.

My pywikibot code is:

site = Site(langcode, 'wikipedia')
user = User(site, username)
contributed_pages = set()
for page, oldid, ts, comment in user.contributions(total=100, namespaces=[0]):
    contributed_pages.add(page)

return get_contributor_ocurrences(contributed_pages,site, username)

And the function

def get_contributor_ocurrences(contributed_pages, site,username):
contributors = []
for page in contributed_pages:

    for editor in page.contributors():
        if APISite.isBot(self= site,username=editor) or editor==username:
            continue
        contributors.append(editor)

return Counter(contributors)

PS: I have access to DB replicas, which I guess are way faster than Wikimedia API or Pywikibot

Destokado
  • 53
  • 1
  • 7
  • 2
    DB replicas are definitely much faster. See https://quarry.wmflabs.org/query/56404. Queries related to a user's latest 500 edits or edits in the past few weeks take less than 0.5 seconds each. – user15517071 Jul 01 '21 at 22:02

1 Answers1

1

You can filter the data to be retrieved with timestamps parameters. This decreases the needed time a lot. Refer the documentation for their usage. Here is a code snippet to get the data with Pywikibot using time stamps:

from collections import Counter
from datetime import timedelta
import pywikibot
from pywikibot.tools import filter_unique
site = pywikibot.Site()
user = pywikibot.User(site, username)  # username must be a string

# Setup the Generator for the last 7 days.
# Do not care about the timestamp format if using pywikibot.Timestamp
stamp = pywikibot.Timestamp.now() - timedelta(days=7)
contribs = user.contributions(end=stamp)

contributors= []

# filter_unique is used to remove duplicates.
# The key uses the page title
for page, *_ in filter_unique(contribs, key=lambda x: str(x[0])):
    # note: editors is a Counter
    editors = page.contributors(endtime=stamp)
    print('{:<35}: {}'.format(page.title(), editors))
    contributors.extend(editors.elements())

total = Counter(contributors)

This prints a list of pages and for each page it Shows the editors and their contribution counter of the given time range. Finally total should have the same content as your get_contributor_ocurrences functions above.

It requires some additional work to get the table you mentioned above.

xqt
  • 280
  • 1
  • 11