Improving Automated Citations with Machine Learning

One of our main goals at RefME is ‘Automated, accurate citations’[1]. Ideally, a good citation tool should allow you to enter any website address or identifier and generate a correct citation quickly. We want to enable you –  the student, the researcher, the librarian – to be able to create citations from any source as you read and write, in the style of your choice, without interrupting your workflow.

You shouldn’t be concerned whether it is a news article, a journal paper, a PDF report, a YouTube video, a PowerPoint presentation, etc. The citation tool should handle this process for you. As a step towards this, we have added a new feature that has the ability to automatically cite the author, title, year and source of a PDF document while you are reading it in the web browser or via our mobile app. This is a feature that is unique to RefME.

As shown in the screenshots below, you can now use our WebClipper for Chrome to reference a PDF while you are reading it in the browser, automatically determining the reference type and populating all the fields that you need to cite it.

automated citations - research paper webclipper
PDF Conference Proceedings Paper
automated citations - webclipper
PDF Government Report
automated citations - government webclipper
PDF 700 page Government Report

Automatically generating a complete citation for a PDF document is a difficult problem for three main reasons:

  1. There is a huge variety in formatting, size and complexity. The PDF could be anything from a short proceedings paper [2] to a full colour government report [3], some of which can be very large indeed [4] – these three examples are shown above.
  2. Unlike a web page, which uses simple HTML tags to identify pieces of information, the semantics of the content in a PDF cannot easily be interpreted by a computer.
  3. Aside from journal articles, most PDFs do not have an identifier that can simply be looked up in a database to retrieve the citation data. This is particularly true for public sector reports.

So, how do we address this?

At RefME, we are using machine learning to help. In particular, a specific machine learning algorithm called Conditional Random Fields (CRF) [5], which has been shown to be effective at solving this problem for research papers [6].  Now, there are some excellent, open-source tools that use CRF to do this, such as Grobid [7] and Cermine [8], which provide an out-of-the-box solution, but they don’t always transfer easily to a production environment for the following reasons:

  1. They tend to only work well with research papers, rather than PDFs in general
  2. They do far more than we need, such as extracting the complete PDF document structure and content, which means that –
    • They are slow, so not ideal if you don’t want to wait more than a couple of seconds to get your reference, and/or
    • They require significant computing resources to run at scale.

Before discussing our solution, it’s worth saying a bit about how conditional random fields work their magic. CRF is a type of model known as a discriminative classifier [9]. The goal, given a set of inputs, x, is to predict a set of output labels, y, by modelling the conditional probability p(y|x). In our case, our inputs are individual words, and their properties such as their positions on the page, font size, font weight, capitalisation, whether they exist in a dictionary of known words (such as first names), and so on. Our outputs labels are that these words are part of the title, part of the authors sequence, or part of the source.

So in our case, p(y|x), means “what is the probability distribution of a word in the PDF being part of the title, the authors, or the source, given that the word is ‘Introduction’, appears near the top of the page, is bold and has font size 14”. In this model, both the inputs and outputs are interdependent, so the conditional probability that our word is part of the title will also depend on the probability distribution of the output labels (title, author, source) for the previous and following words. This interdependence is shown in Figure 1.

Automated citations- conditional random fields

Figure 1.  Adapted from Sutton and McCallum [5]

Given the set of features (position, size, capitalisation etc) that we have determined to be important, the model needs to be trained by learning feature functions for each feature: that is, learning the relative weight of each feature (e.g. the word position might be more important than its font) and its bias (e.g. bold may be more predictive of title than of author) which will be used to maximise the conditional probability in order to say:

‘Given that ‘Fields’ is bold, appears at the top of the page, and is followed by another word that is not bold, that appears lower down the page, that has been determined to be part of the authors, I’m 90% confident that ‘Fields’ belongs in the title’.

These feature functions are learned by providing the model with the expected output label for a given input. In Figure 1, this would be ‘source’ for each word of ‘Foundations and Trends in Machine Learning’, ‘title’ for each word of ‘An Introduction to Conditional Random Fields’ and ‘author’ for ‘Charles Sutton’ and ‘Andrew McCallum’, along with the features for each of these words. The good news is that there are open source libraries, such as Wapiti [10], that do all these calculations. We ‘just’ need to provide the features, in the form of a template, and their values for thousands of training examples.

We ended up using the feature-sets and Wapiti wrapper classes developed by Grobid [7], but removed around 90% of the core code base and dependencies, to reduce the deployment footprint from over 1GB down to just over 100MB, and the memory footprint when running with 10 parallel threads from 4GB down to around 1.8GB. In addition, we created new models from a much wider range of open-access PDFs such as government reports, white papers, guidelines and legislation.

The result is a pretty fast service (the median response time is around half a second, most of which is to load the PDF from its host) which is getting more accurate all the time, as we learn from the PDFs that you reference to create further training data.

So, what’s next? Our goal is to apply similar machine-learning techniques to accurately automate the citing of any source in a format-agnostic way. In the meantime, as always, we welcome your feedback and comments, while we continue to improve how RefME enables you to create your bibliographies as quickly, accurately and painlessly as possible – just as the bibliography for this blog post was created!

Phil Gooch, Content and Innovation Researcher at RefME

Bibliography

  1. Wright, J. Future EdTech: Innovating to create automated, accurate citations. (2015). at <https://www.refme.com/blog/2015/05/28/future-edtech-conference-2015/>
  2. Lee, G., Lin, J., Liu, C., Lorek, A. & Ryaboy, D. The unified logging infrastructure for data analytics at Twitter. Proceedings of the VLDB Endowment 5, 1771–1780 (2012).
  3. Department of Health. Tackling demand together. (2009). at <http://webarchive.nationalarchives.gov.uk/20130107105354/http:/www.dh.gov.uk/
    prod_consum_dh/groups/dh_digitalassets/documents/digitalasset/dh_106924.pdf
    >
  4. The Scottish Government. Scotland’s future. (2013). at <http://www.gov.scot/resource/0043/00439021.pdf>
  5. Sutton, C. & McCallum, A. An introduction to conditional random fields. Foundations and Trends in Machine Learning 4, 267–373 (2011).
  6. Peng, F. & McCallum, A. Information extraction from research papers using conditional random fields. Information Processing & Management 42, 963–979 (2006).
  7. kermitt2. Grobid. GitHub (2015). at <https://github.com/kermitt2/grobid>.
  8. CeON. CERMINE. GitHub (2016). at <https://github.com/CeON/CERMINE>.
  9. Wikipedia. Discriminative model. Wikipedia (2015). at <https://en.wikipedia.org/wiki/Discriminative_model>
  10. Lavergne, T., Cappé, O. & Yvon, F. Practical Very Large Scale CRFs. in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics 504–513 (Association for Computational Linguistics, 2010).
blog comments powered by Disqus