R Regular expression to search citations of law using tidytext and tm

Question

I use tidytext, tm and quantedafor text mining.

I try to:

filter a tibble with plain, processed text according to presence of a citation of law
count the number of the same citation per text document

Unfortunately, I am weak at using specific regex.

Here is a paragraph that shows typical legal citations:

4.1. Die selbstständige Tätigkeit als Arzt oder Ärztin bedarf einer Bewilligung des Kantons, auf dessen Gebiet sie ausgeübt wird (Art. 34 MedBG). Die Bewilligung wird erteilt, wenn die gesuchstellende Person ein entsprechendes eidgenössisches Diplom besitzt (Art. 36 Abs. 1 lit. a MedBG) und vertrauenswürdig ist sowie physisch und psychisch Gewähr für eine einwandfreie Berufsausübung bietet (Art. 36 Abs. 1 lit. b MedBG). Die Bewilligung wird entzogen, wenn ihre Voraussetzungen nicht mehr erfüllt sind oder nachträglich Tatsachen festgestellt werden, auf Grund derer sie hätte verweigert werden müssen (Art. 36 MedBG).

Swiss law is generally sturctured as follows (https://www.admin.ch/opc/de/classified-compilation/20040265/index.html):

Article (Art. x, while x is a number)
Paragraph (Abs. x, while x is a number)
Letter (lit. x, while x i a lowercase letter)
depending of the Code of Law even more subheader could present, such as "Satz 2",
Code of Law (mixed case letters)

Moreover, 2-4 are not mandatory or could be combinded e.g.:

Art 34. MedBG

Art. 42 Abs. 2 und 100 Abs. 1 MedBG

Even tough it would be very nice to have a regex that would specify every substructure of a given Article and Code of Law, (there are hundreds of them, many of them with tiny differences in 2. - 4)... if you have any idea how I could do that automatically, let me know ;D) I am only interessted in the Article and Code of Law (Article 1-67 and MedBG in this case)

My approach

I cleaned the text by removing tall letters and removed punctuation using tolower and tm::removePunctuation and got back to tidy data using tidytext::tidy()

> head(docs_tidy %>% select(id,text))
# A tibble: 5 x 2
  id                     text                                                                                                                
  <chr>                  <chr>                                                                                                               
1 31.12.2015_9C_911_2015 "\n      \n \nbundesgericht \ntribunal fédéral \ntribunale federale \ntribunal federal \n \n \n\n    \n  \n        ~
2 31.12.2015_9C_910_2015 "\n      \n \nbundesgericht \ntribunal fédéral \ntribunale federale \ntribunal federal \n \n \n\n    \n  \n        ~
3 31.12.2015_9C_934_2015 "\n      \n \nbundesgericht \ntribunal fédéral \ntribunale federale \ntribunal federal \n \n \n\n    \n  \n        ~
4 31.12.2014_9C_904_2014 "\n      \n \nbundesgericht \ntribunal fédéral \ntribunale federale \ntribunal federal \n \n \n\n    \n  \n        ~
5 31.12.2014_9C_907_2014 "\n      \n \nbundesgericht \ntribunal fédéral \ntribunale federale \ntribunal federal \n \n \n\n    \n  \n        ~

What I miss: I thought about a regex that searches for 1 (Article) and skips the expressions from 2-4 until it reaches 5 (Code of Law).

art 1.(code to identify unnecessary items from 2-4(e.g abs. 1 100){0,until medbg reached}medbg

If it is not possible to stop it at a specific point, it could be stopped after skipping about 8 items (that is the maximum I exspect between 1. and 5.)

lawCitation <- c("art. 1 medbg","art. 42 abs. 2 und 100 abs. 1 medbg","art. 1 abs. 1 lit a medbg","art. 36 abs. 1 lit. b medbg","art. 1 satz 2 bgg und abs. 1 lit a lit b","art. 22 totally random number 21 medbg")

grepl("REGEX",lawCitation) for art. 1 and medbg should return TRUE FALSE TRUE FALSE FALSE FALSE

if I had a good regex, I would go on like this:

# search docs for speficic Article from a specific Code of Law
dplyr::filter(docs_tidy, grepl("REGEX",text)) -> filtered
# count n of citations
filtered %>% group_by(id) %>% mutate(citations=length(grep("REGEX",.)

Or I would try to write a function that searches my tibble for all Articles (art. 1-67) of medbg and counts them.

If you need a regex for something, you have to define exactly what you want and don't want. That's the _law_ of regular expressions. — , Jan 13 '18 at 20:50
Thanks for your comments. @sln I totally agree, however, I have a large dataset (100k docs) and the citation style differs with different Code of Law, it is not possible to be more specific than this: the item starts with "art. x" and stops with "y" whereas both x and y are known and its distance to each other is roughly not more than 20 digits , the space beween x/y consists out of a variable number of items such as abbrevations (e.g. abs. lit. i.v.m), single or combined numbers (e.g. 1, 3-5), single or combined characters (e.g. a, a-c). Is there mb. a better solution than regex for my case? — captcoma, Jan 14 '18 at 00:02
@ Jan, almost! I updated the test string. https://regex101.com/r/raQGwq/4. the problem is, that there is a variable amount of items between x and y and the citation does not always have brackets. — captcoma, Jan 14 '18 at 00:04
@captcoma: I am by no means an expert in Swiss law but it seems this heads in the right direction: https://regex101.com/r/raQGwq/5 — Jan, Jan 14 '18 at 08:20

R Regular expression to search citations of law using tidytext and tm

0 Answers0