Light data analysis

Light data analysis#

To understand the nature of the data better, we first do some light data analysis.

data_dir = 'Physics_2020_Dimensions'
df = pd.read_pickle(os.path.join(data_dir, 'dimensions_data_clean.pkl')) 

print('Altogether we are dealing with a total of {} publication items in the category Physical Sciences published in the year 2020'
      .format(len(df)))


all_labels_full = [item for sublist in df.Labels_str.values.tolist() for item in sublist] 
all_labels = list(set(all_labels_full))

multilabel_combs = (set([' '.join(sublist) for sublist in df.Labels_str.values.tolist()]))

print('\nThese are the {} subfields we are dealing with:\n'.format(len(all_labels)))
print(', '.join(all_labels))
print('\nIn the dataset these 10 labels are distributed over {} different multilabel combinations.'.format(len(multilabel_combs)))

Altogether we are dealing with a total of 140717 publication items in the category Physical Sciences published in the year 2020

These are the 10 subfields we are dealing with:

Nuclear and Plasma Physics, Synchrotrons and Accelerators, Astronomical Sciences, Medical and Biological Physics, Quantum Physics, Space Sciences, Classical Physics, Particle and High Energy Physics, Atomic Molecular and Optical Physics, Condensed Matter Physics

In the dataset these 10 labels are distributed over 96 different multilabel combinations:

Visualize the distribution of the publication items over the different subfields. The distribution is quite heterogenous with the highest number of publication items labeled Condensed Matter Physics and the lowest number of items labeled Medical and Biological Physics.

labels, counts = np.unique(all_labels_full, return_counts=True)

fig, ax = plt.subplots()

ax.bar(labels, counts, align='center',color='#86bf91')
ax.set_xticks(ax.get_xticks())  # just get and reset whatever you already have to avoid a bug in matplotlib
ax.set_xticklabels(labels, rotation=90)

plt.show()

_images/249c5ebee8be4f3702d2f75eb0cb4e04267405a5bc84435a0284321b1cc51aec.png

Remember that we are dealing with a multilabel classification. To get a feel for the distribution of the number of different labels per item in the dataset we plot the distribution.

labels_nr = df.Labels_num.apply(lambda x: len(x))
df = df.assign(Labels_nr=labels_nr.values)

df.hist(column='Labels_nr', bins=4, grid=False, color='#86bf91', rwidth=0.9, figsize=(3,4))
plt.show()

_images/6a0238ac254e7e2483f929c1c12d0c6a462d3c21d64e8529e541d18481aa6d50.png

In the majority of cases the publications are assigned to only one subfield, about a quarter of the items has two labels, less then 5 percent have three labels, the number of items with four labels is insignificant, no publication item is assigned to more than 4 subfields.

Quickly inspect the co-occurence graph of the labels:

G = nx.from_edgelist((c for n_nodes in df.Labels_str.values.tolist() for c in combinations(n_nodes, r=2)),
                     create_using=nx.MultiGraph)

# visualize graph
fig, ax = plt.subplots(figsize=(20,20))

pos = nx.draw_spring(G, with_labels = True, ax=ax)

_images/22ed3079e40fbb7f9c0185f5889f3ca57653b55e04d7d7e724669e487a14bae7.png

The co-occurence graph is not too informative, as expected Space Sciences and Astronomical Sciences have very similar co-occurrences. Medical and Biological Physics and Classical Physics stand out somewhat as they never occur together, and likewise not with either Space Sciences or Astronomical Sciences.

Further analysis of the co-occurence of labels:

# pearson correlation for label combinations
single_label_corr = single_label_bool.corr(method = "pearson")

# frequency for label combinations
labels_int = single_label_bool.astype(int)
labels_freq_mat = np.dot(labels_int.T, labels_int)
labels_freq = pd.DataFrame(
    labels_freq_mat,
    columns = all_labels,
    index = all_labels
)

The frequency matrix of label co-occurences:

labels_freq

	Nuclear and Plasma Physics	Synchrotrons and Accelerators	Astronomical Sciences	Medical and Biological Physics	Quantum Physics	Space Sciences	Classical Physics	Particle and High Energy Physics	Atomic Molecular and Optical Physics	Condensed Matter Physics
Nuclear and Plasma Physics	28159	7605	603	73	743	289	307	11872	621	298
Synchrotrons and Accelerators	7605	9626	113	77	78	10	4	1388	161	29
Astronomical Sciences	603	113	15094	0	21	1981	4	1516	151	4
Medical and Biological Physics	73	77	0	2217	0	0	0	2	4	5
Quantum Physics	743	78	21	0	30787	0	1317	863	7981	3882
Space Sciences	289	10	1981	0	0	7170	0	435	5	0
Classical Physics	307	4	4	0	1317	0	12389	208	281	488
Particle and High Energy Physics	11872	1388	1516	2	863	435	208	23423	271	147
Atomic Molecular and Optical Physics	621	161	151	4	7981	5	281	271	21142	1671
Condensed Matter Physics	298	29	4	5	3882	0	488	147	1671	32562

Co-occurence of labels; frequency and Pearson matrices as a heatmaps:

import seaborn as sn
fig, (ax1, ax2) = plt.subplots(1,2,figsize = (10,4))
ax1.title.set_text('Label co-occurence frequency matrix')
ax2.title.set_text('Label co-occurence Pearson matrix')
sn.heatmap(labels_freq, cmap = "Blues", ax=ax1)
sn.heatmap(single_label_corr, cmap = "Blues", ax=ax2, yticklabels=False)
plt.show()

_images/c2454818a0d077aa4dc541ee203788ff50055945ad53311fb35d66a1c02ee484.png