1

I am trying to fill-up page numbers of a Book in its Index Wikisource page. The following code writes well in the specific pageNumber parameter. If the page is empty, it looks fine. But if i run the code another time, due to the concatenation the 67 becomes 6767. How can i know that the pageNumber parameter ('|Number of pages=') is empty? or If the parameter already filled how can i set the skip option in the code.

The writing code;-

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import pywikibot

indexTitle = 'அட்டவணை:தமிழ் நாடகத் தலைமை ஆசிரியர்-2.pdf'
indexPages = '67'
site1 = pywikibot.Site('ta', 'wikisource')
page = pywikibot.Page(site1, indexTitle)
indexTitlePage = page.text.replace('|Number of pages=','|Number of pages='+indexPages)
page.save(summary='67')
info-farmer
  • 255
  • 3
  • 18

2 Answers2

1

you can use re - the regular expression library to search for a pattern:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pywikibot
import re

indexTitle = 'அட்டவணை:தமிழ் நாடகத் தலைமை ஆசிரியர்-2.pdf'
indexPages = '67'
site1 = pywikibot.Site('ta', 'wikisource')
page = pywikibot.Page(site1, indexTitle)
print(page.text)
res = re.compile('\|Number of pages= *(\d+)').search(page.text)
if res:
    print("number of pages is already assign to %s" % res.group(1))
else:
    indexTitlePage = page.text.replace('|Number of pages=','|Number of pages='+indexPages)
    page.save(summary='67')

Also, if you are dealing with processing utf8 text, it's better to move to python3 as it has much better support for that.

dafnahaktana
  • 837
  • 7
  • 21
  • Now i am using python3. The above code skips well. Thanks indeed. What is the regex for replace. Because, few books to be corrected by a new data. – info-farmer Mar 26 '18 at 14:46
  • The code's regex works fine for numbers. My native language is Tamil. If the parameter is non-latin, what regex, i should use to set the skip option.? For example, '|Category= மின்னூல்கள்' – info-farmer Apr 12 '18 at 03:51
1

I've came across a similar situation, Parsing templates with pywikibot seems to me not good enough (using 'extract_templates_and_params_regex_simple' and 'glue_template_and_params' from textlib).

My solution finally used - mwparserfromhell. This library is more convenient while trying to parse/change templates (and their arguments).

There is a potentially problem in your code, you are not searching for any template, so if somehow two templates will use the same argument you will change both (you can still ignore that, but jfyi).

Using mwparserfromhell + pywikibot will be like (using 'page' from your code):

parsed_mw = mwparserfromhell.parse(page.text)
my_template = parsed_mw.filter_templates(my_template_name)[0]  # Taking the first template
my_template.get('Number of pages').value=67

page.text = parsed_mw
Kosho-b
  • 161
  • 5