Introduction¶
The research program History of the Max Planck Society (MPG) aims at a comprehensive analysis of the MPG’s history embedded in the contexts of contemporary history and history of science.
Apart from the many facets of research in the MPG among its almost 90 institutes and the corresponding “localized” histories of research scopes in different fields and clusters, as well as its involvement with contemporary political and social events, the MPG can also be seen as one specimen of a research institution among several of its kind world wide.
To capture the MPG in whole as a research institution, one can look at the combined scientific output of the organization and compare its statistics and international reception to other research institutions. Additionally, its desirable to include the topics of research in such an analysis, e.g. by using (self-)attributed research categories of publishers or by compiling lists of used language and cluster these for similarities.
The main challenge is to find reliable and broad enough sources for this endeavor. The origin of the analysis presented here is a dataset published by the Springer Nature publishers, available from the Springer Nature Linked Data platform. This dataset contains metadata from Nature and Springer publications dating back to 1839. Apart from the size of the dataset (~90GB of text files), the main challenge lies in consolidating and cleaning the data to a usable format. The outline of the task ahead is as follows.
Outline¶
Preprocess raw data¶
In a first set of Jupyter Notebooks, the downloaded data is preprocessed to have
information on every author and affiliation available in the dataset.
The datasets, as available online, are not ordered for publication year, which is another important step in the procedure.
To have more reliable information on the data quality, some statistics are generated, e.g. how many of the papers contains author or affiliation information.
Other general statistics include the number of published papers per year, the number of journals per year, or the size of author teams per publication.
Furthermore, with the scope of later applying machine learning techniques, the distribution of languages in the corpus is analyzed.
Focus on Max Planck Society¶
In a first step to capture the MPG as a whole, its research output in the corpus is compared to other institutions, e.g. of the University of California System group or Harvard University. Following this, the identified publications of the MPG are consolidated into a dataset allowing the attribution of institutes to one of three Sektionen, which form an organizational unit of the MPG. This allows first interpretations of the differences in output and collaboration strategies in the Sektionen.
Keywords and multilayer networks¶
To come closer to the final goal of finding overall research trends, the titles of publications contained in the MPG corpus are then analyzed using a technique which takes into account words and word group (ngram) surroundings in a weighted fashion . These scores for ngrams form the basis to construct a mulitlayer network which connects publications, authors and ngrams in titles for each year between 1945 and 2005. Using a multilayer clustering algorithm allows to identify groups of important entities in the network and follow their evolution over the years.
Interactivity¶
This documentation allows for several ways of interaction with the content.
Annotation and highlighting¶
Using the hypothes.is service, all pages of the document can be annotated or highlighted. This requires an account on hypothes.is. If you set the visibility of a comment to public, it will be visible for all audiences.
Interactive Binder instance¶
Furthermore, the top menu offers access to run the notebooks on a Binder instance offered by GESIS. In these you can change parameters and re-run part of the analysis. Note that due to the file size, the first two notebooks can not be run on Binder.
Source code¶
The full source code for this Jupyter Book is available online, see corresponding Github button. If you find problems with the analysis or code, feel free to open an issue using the Issue button.
Download¶
Each page can be downloaded in either text format or as Jupyter Notebook, depending on its source. All data is publised as CC-by.