0

I need to find a way to tag references to publications in text. We've been doing this via regex but it won't work these new patterns.

Some examples (language is german):

Herzog (August 2012), Einkommensteuerskriptum Band 1, S 8

Achatz/Bieber in Achatz/Kirchmayr, Körperschaftsteuergesetz (2011)

Heinrich in Quantschnigg/Renner/Schellmann/Stöger, Die Körperschaftsteuer (2013) § 7 Rz 32

Raab/Renner in Quantschnigg/Renner/Schellmann/Stöger/Vock, Die Körperschaftsteuer, 24. Lfg., § 8 Tz 292,293

Quantschnigg/Renner/Schellmann/Stöger/Vock (Hrsg), KStG23 (2013) § 13 Rz 67

So it mostly starts out with author names and the Title of the publication but then it becomes pretty diverse. It might not look as bad in the examples but I could give a bunch more that again look differently.

So I thought this might be a task for machine learning. However having very little experience in that field i find it hard to find the right technique.

I found POS tagging but that doesn't seem to be the way to go here. I also stumbled upton CRF but there is little material on it that would get a beginner like myself started.

I've done some classification and regression in sklearn but that's about it.

Could anyone point me in the right direction ?

Community
  • 1
  • 1
pypat
  • 1,096
  • 1
  • 9
  • 19
  • 1
    What you probably want to do is "named entity recognition". POS probably won't help you. A probably good technology is Conditional random fiels – CAFEBABE Jan 22 '16 at 09:20
  • As I mentioned in my post I have thought about CRF (noticed typo in my post there) but there is hardly any information on how to get started with those. There are a few libraries but little tutorials on how to create my own models for those. – pypat Jan 22 '16 at 09:27
  • Read more on `machine-learning` and `natural-language-processing` and possibly you can get a clearer idea. Because the task sounds a little fuzzy for now. Once you know a little more of what's possible and what's available, you should be able to break the problem down into multiple sub-tasks and handle them =) See http://stackoverflow.com/questions/34791491/where-to-start-natural-language-processing-and-ai-using-python/34791965#34791965 – alvas Jan 22 '16 at 10:47

0 Answers0