I use tidytext
, tm
and quanteda
for text mining.
I try to:
- filter a
tibble
with plain, processed text according to presence of a citation of law - count the number of the same citation per text document
Unfortunately, I am weak at using specific regex.
Here is a paragraph that shows typical legal citations:
4.1. Die selbstständige Tätigkeit als Arzt oder Ärztin bedarf einer Bewilligung des Kantons, auf dessen Gebiet sie ausgeübt wird (Art. 34 MedBG). Die Bewilligung wird erteilt, wenn die gesuchstellende Person ein entsprechendes eidgenössisches Diplom besitzt (Art. 36 Abs. 1 lit. a MedBG) und vertrauenswürdig ist sowie physisch und psychisch Gewähr für eine einwandfreie Berufsausübung bietet (Art. 36 Abs. 1 lit. b MedBG). Die Bewilligung wird entzogen, wenn ihre Voraussetzungen nicht mehr erfüllt sind oder nachträglich Tatsachen festgestellt werden, auf Grund derer sie hätte verweigert werden müssen (Art. 36 MedBG).
Swiss law is generally sturctured as follows (https://www.admin.ch/opc/de/classified-compilation/20040265/index.html):
- Article (Art. x, while x is a number)
- Paragraph (Abs. x, while x is a number)
- Letter (lit. x, while x i a lowercase letter)
- depending of the Code of Law even more subheader could present, such as "Satz 2",
- Code of Law (mixed case letters)
Moreover, 2-4 are not mandatory or could be combinded e.g.:
Art 34. MedBG
Art. 42 Abs. 2 und 100 Abs. 1 MedBG
Even tough it would be very nice to have a regex that would specify every substructure of a given Article and Code of Law, (there are hundreds of them, many of them with tiny differences in 2. - 4)... if you have any idea how I could do that automatically, let me know ;D) I am only interessted in the Article and Code of Law (Article 1-67 and MedBG in this case)
My approach
I cleaned the text by removing tall letters and removed punctuation using tolower
and tm::removePunctuation
and got back to tidy data using tidytext::tidy()
> head(docs_tidy %>% select(id,text))
# A tibble: 5 x 2
id text
<chr> <chr>
1 31.12.2015_9C_911_2015 "\n \n \nbundesgericht \ntribunal fédéral \ntribunale federale \ntribunal federal \n \n \n\n \n \n ~
2 31.12.2015_9C_910_2015 "\n \n \nbundesgericht \ntribunal fédéral \ntribunale federale \ntribunal federal \n \n \n\n \n \n ~
3 31.12.2015_9C_934_2015 "\n \n \nbundesgericht \ntribunal fédéral \ntribunale federale \ntribunal federal \n \n \n\n \n \n ~
4 31.12.2014_9C_904_2014 "\n \n \nbundesgericht \ntribunal fédéral \ntribunale federale \ntribunal federal \n \n \n\n \n \n ~
5 31.12.2014_9C_907_2014 "\n \n \nbundesgericht \ntribunal fédéral \ntribunale federale \ntribunal federal \n \n \n\n \n \n ~
What I miss: I thought about a regex that searches for 1 (Article) and skips the expressions from 2-4 until it reaches 5 (Code of Law).
art 1.(code to identify unnecessary items from 2-4(e.g abs. 1 100){0,until medbg reached}medbg
If it is not possible to stop it at a specific point, it could be stopped after skipping about 8 items (that is the maximum I exspect between 1. and 5.)
lawCitation <- c("art. 1 medbg","art. 42 abs. 2 und 100 abs. 1 medbg","art. 1 abs. 1 lit a medbg","art. 36 abs. 1 lit. b medbg","art. 1 satz 2 bgg und abs. 1 lit a lit b","art. 22 totally random number 21 medbg")
grepl("REGEX",lawCitation)
for art. 1
and medbg
should return TRUE FALSE TRUE FALSE FALSE FALSE
if I had a good regex, I would go on like this:
# search docs for speficic Article from a specific Code of Law
dplyr::filter(docs_tidy, grepl("REGEX",text)) -> filtered
# count n of citations
filtered %>% group_by(id) %>% mutate(citations=length(grep("REGEX",.)
Or I would try to write a function that searches my tibble
for all Articles (art. 1-67) of medbg and counts them.