Tim Evans' site on Complex Networks

Author: T.S.Evans (Page 1 of 4)

The Importance of Being Close

Finding the most important pieces of information in a mass of data has always been a key task in any job. In network science the tools used to do this are known as “centrality measures“. My most recent paper (with Bingsheng Chen) looks at the relationship between two of the oldest and most important examples, degree and closeness.

Networks (for example see Coscia 2021) are a powerful way to organise the large data sets faced by modern researchers in many fields, from physics through biology to social sciences.  In their simplest form, the objects of interest are represented by nodes of a network. So a node might represdent proteins in biology or people in social science. The edges of a network then represent relationships between pairs of nodes.

Zachary's Karate club network.
Zachary’s Karate club network. The size of the node is proportional to the degree while the darker the colour of the node, the higher the closeness. The shapes indicate the two clusters in the network.

One of the most common questions we ask about our data is what are the most important pieces of information?  In a network representation, this corresponds to finding the most important nodes. Not surprisingly, this was one of the earliest problems tackled in network science where the importance of nodes is measured by centrality measures mentioned above. These have been developed in social network analysis since the 1950’s. The name centrality measure comes from an analogy with a city where most important institutions are found at the geographical centre of a city to enable the easiest access for the largest number of people. The arrival of large electronic datasets prompted a flurry of new research on centrality measures. For instance, in 1998 the founders of Google developed their own centrality measure, PageRank (Brin & Page 1998). This uses a simple diffusion process on a network of webpages (nodes) connected by hyperlinks (edges) with the idea that relevant pages in a search can be placed in order of their importance. The web page with the highest PageRank going top of the list returned by Google.

Finding the most important nodes is such a critical job that this problem has been studied many times and this has produced hundreds (if not thousands) of centrality measures. These encode different interpretations of what is the “importance” of a node and perhaps the context of the data and of the application may point a researcher to one definition over another. Nevertheless, one is left with the feeling that there are so many centrality measures that many may well be producing similar information and therefore there is a large redundancy in the centrality measures in the literature. For instance, David Schoch produced a periodic table of centrality measures, a nice visualisation of a large number of centrality measures placing similar ones close together imitating the periodic table of elements. To be more quantitative, one can measure the correlations between different values for centrality measures for the same node. What one finds from these studies is that there are clear correlations and yet no clear pattern has emerged. As Valente et al. (2008) put it, “The amount of correlation between degree, betweenness, closeness, and eigenvector indicates that these measures are distinct, yet conceptually related”. This leaves researchers uncertain what to do when choosing a centrality measure for their problem.

Our Work

In my work with Bingsheng Chen (Evans & Chen 2022) we looked at two of the oldest and most popular centrality measures: degree and closeness (Bavelas, 1950) defined closeness. The degree kv of a node v is simply the number of nearest neighbours, the number of nodes connected by a single edge to the node v. Closeness is defined in terms of the length of the shortest path between two nodes. A path in a network is just a sequence of nodes where each node is connected by an edge to the previous node in the sequence. The length is the number of edges traversed as you move along the path. You can then calculate the average distance from a node v to all other nodes. The smaller this average value of distance is, the closer node v is to the other nodes in a network.  So a low value of this average indicates which nodes are the most central and, using the analogy of a city mentioned above, we might define these to be most important. Centrality measures are defined so that the larger the value of a  centrality measure, the more important a node is. So the centrality measure closeness cv for a node v is defined to be the inverse of the average distance to all other nodes.

Example network showing shortest path and degree
One of the two shortest paths from the red square to the red circle is shown by the green edges. The red square node has three nearest neighbours so the degree of this node is 3. The number beside each node shows the distance to the red square node. The average distance from a node to the red square node is (2+2+1+1+1)/5 giving the closeness of the red square node as 5/7.

What we suggested in our Communications Physics paper (Evans & Chen, 2022) is that you can do better for degree and closeness than the qualitative statement of Valente et al. (2008) that they are “conceptually related”.

In our work, we picture the shortest paths used to calculate closeness as a tree which is a network with no loops and where there is only one path between any two nodes. In our case, our shortest-path tree for node v has every node from the original network, the edges in the tree are a subset of the edges in the original network, and the path in the tree from the root node v to any other node is always a shortest path in the full network. Such trees can always be defined for each node though they are not in general unique. Shortest-path trees are easily created using standard breadth-first search algorithms.

Our key algebraic assumption is that the branches of these trees will look statistically similar for most networks. The reason for this is that the number of nodes at distance L from the root grows exponentially in most networks.  This is the reason why we see the phenomena of six degrees of separation in most networks. That is, even in very large networks, the length of the shortest path from any node to any other node is usually small, typically of order log(N) (where N is the number of nodes in the network) rather than N1/D, you would expect if the network is in a D-dimensional Euclidean space (e.g. if our network was a regular cubic lattice so D=3). 

Example of a shortest-path tree
One example of a shortest-path tree for the red square node as the root node. The edges of the tree are shown in solid green lines while the two black dashed edges are in the original network but are left out of this tree.
Shortest-path tree for Zachary Karate club network.
One of the shortest-path trees rooted on node 1 for the Zachary Karate club network. Nodes of the same colour and shape are the same distance from the root node, 1.

Outline of Derivation

You can skip to the next section on results but I will give a brief outline of the analytical part of the work.

We used the simplest model for the branches of the tree. Namely, we assume that on average the degree of each node in the tree is z. That is one edge links every node to the parent node that is one step closer to the root node v, while the remaining (z-1) nodes are one step further away. Given that this means the number of nodes which are L steps away from the root is growing in proportion to (z-1)L i.e. exponentially fast, we have to stop the growth somewhere and we do so by demanding that the total number of nodes in the tree is N, the number of nodes in the original network Since we know that the distance in the tree from any node to the root node is along a shortest path in the full network, we can use the shortest path tree to calculate the average distance from v to all other nodes and hence its inverse, the closeness centrality measure cv for node v.

So far the degree of node v has not appeared in our calculation.  However, the first step of the shortest path tree must be to the nearest neighbours. So the shortest path tree from node v starts with kv branches and after that we assume these branches are all statistically equivalent and the degree in the rest of the tree is on average z

Analytical Result

Putting this all together gives us our central result that the inverse of closeness is a simply b + ln(kv)/ln(z-1)  and we have an expression for b  in terms of the number of nodes and the average degree of the tree z . In general, it is hard to determine the average degree of the tree z  in all but the simplest network models so in practice, we treat z as free parameter which we fit to the data. Since we usually have large numbers of nodes, hundreds and often many many more, we lose little by treating as a free parameter too.

The only reason we can do these calculations is that we have made some sweeping generalisations in order to reduce our networks to simple trees which represent branching processes used widely in physics and science in general. Given the simple pictured used here, it would not be unreasonable to assume that even if our relationship between degree and closeness works for some simple numerical network models, the relationship will fail badly on most real-world networks.  And yet, if there is one thing our experience as physicists has taught us, it is that there are situations where we can ignore the messy microscopic details and still arrive at results that are broadly true on large scales across many different systems.  The derivation of the ideal gas law using ideas from statistical physics would be a classic example. So how good is our result on real examples?

Results from Real Networks

Plot of inverse closeness against degree for artifical networks.
Plot showing the inverse of closeness against degree on a log scale for networks produced using the Erdős-Réyni (left) model or the Barabási-Albert model (right). The networks have 1000 (red, bottom), 2000 (yellow, middle) and 4000 (blue, top) nodes and an average degree of 10.0.

The answer turns out that our relationship works far better than we thought it would do. On three simple artificial models for networks, we get fits with reduced chi-square values which are between 1.0 and 1.3. The good fit is clear visually in the figure above. That is great but the real surprise is when we tried it on over one hundred networks based on real data.  One of our referees suggested that since we had the code working for an initial set of 18 real-world networks, it would be relatively easy to look at more networks. So for the published version we used a python package graph-tool to download all the networks kept in the Netzschleude repository (both are maintained by Tiago Peixoto) and we processed all of the networks there provided no further programming effort was required.  A number of these networks were too big for our code or had some technical issue that broke our code (perhaps non-ASCII characters), but we ended up looking at a further 112 networks from real-world data sets without imposing any further bias. To our surprise that we still found our relationship held well (reduced chi-square of less than 2.0) for around two-thirds of these networks, while around 80% had a reduced chi-square below 3.0. Of those that failed, a quick look at their description would tell us that they were always unlikely to work well: they were very small, or very dense (that is every node is connected to most of the other nodes) or had some other special structural feature not covered in our simple picture.

Conclusions

So what does our result do for network science? First, our relationship is a clear quantitative example of how two centrality measures are related, in this case degree and closeness, two of the oldest and best-known examples. The fact that our relationship is non-linear highlights the fact that most previous studies look at possible linear correlations between centrality measures, e.g. by using Pearson correlation coefficients. If non-linear relationships exist between centrality measures, these would not be so clear and this might explain why the previous studies have not reached any very specific conclusions. Second, our results tell researchers that, to a first approximation, we can forget closeness as most of the information is already in the degree of a node which is much quicker to calculate.  However, the corollary of this is that if you do calculate closeness, you can now remove the general dependence of closeness on degree by comparing the measured value against the predicted value made using our formula. The difference from our predicted value of closeness is genuine new information in the closeness measurement that is not given by the degree value.

The final message is that the success of our method reminds us that for most networks, the exponential growth in the number of nodes as a function of the distance L from any node is the fundamental property that makes complex networks distinct from systems constrained by our three-dimensional world. 

Note

This post is a revised version of one I wrote for the Nature Portfolio Physics Community entitled “Network Centrality“.

References

Bavelas, A. 1950.  Communication patterns in task-oriented groups. The Journal of the Acoustical Society of America 22, 725–730.

Brin, S. & Page,  L. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN systems, 30, 1–7.

Coscia, M. 2021. The Atlas for the Aspiring Network Scientist.  https://doi.org/10.48550/arXiv.2101.00863

Evans, T. S. & Chen, B., 2022. Linking the Network Centrality Measures Closeness and Degree. Communications Physics 5, 172, https://doi.org/10.1038/s42005-022-00949-5 .

Schoch, D. 2016. Periodic table of network centrality. URL http://schochastics.net/sna/periodic.html

Valente, T. W., Coronges, K., Lakon, C. & Costenbader, E. 2008. How correlated are network centrality measures? Connections 28, 16.

Recommended Books on Complex Networks

Network Books
Some of my Recommended Books on Complex Networks

These are texts and books I generally recommend at the start of my introductory course on Complex Networks given to third-year physicists at Imperial. The students are not expected to read any of these texts and the list is as much a way for me to keep track of the lecture notes and books I have found on the subject.

I give a short note on each one including my opinion of what each text offers. This often includes a price that will not be up to date but hopefully, it still gives a sense of the relative price of texts. The pricing of texts pitched for a similar audience (as defined by my interests) does not appear to be proportional to size. Interestingly, Cambridge University Press has five books in my list with a large overlap in their material. As this list was aimed at my students, there is an emphasis here on anything that has introductory material and which is available as a free legal download.

I split my list into two categories. First, those which give a more general and a less technical introduction, most of which put the work on networks into a wider context. The second section contains the texts with much more technical detail, including those I turn to when teaching these topics or when doing my own research.

General Reading on Networks

These are background material of a less technical type. These recommended books put the ideas into a much broader context.

  • Guido Caldarelli and Michele Catanzaro, Networks: A Very Short Introduction (2012) Oxford University Press ISBN: 978-0199588077.
    [Pleasingly short. I have scanned this but not yet read this but colleagues have recommended it to me. £5]
  • Philip Ball, Critical Mass (2004) ISBN: 9780099457862.
    [General popular introduction to ideas about Complexity in general, including networks. Readable background material. £7]
  • Duncan Watts, Six Degrees: The Science of a Connected Age (2004) Vintage Press, ISBN: 978-0099444961.
    [A non-technical overview focussing on networks. I found this very balanced even though written by one of the first authors to bring field to the attention of physicists. £9]
  • Tim Evans, Complex Networks, Contemporary Physics 45 (2004) 455–474
    [cond-mat/0405123].
    [My own Complex Network review aimed as an introduction to general physics researchers written some time ago. It has only the very basic technical issues making it short. Download Complex Networks free as cond-mat/0405123 from arxiv.org. Free.]
  • Network Literacy: Essential Concepts And Core Ideas
    [Free. It is a very short ‘poster’ overview to the main concepts.]
  • H.Sayama et al., What are essential concepts about networks? (2016). Journal of Complex Networks, 4, 457–474.

More Technical and Detailed Reading

Technical and more detailed sources, with the most useful for my course given first. If you look up one of these books online and check the recommendations made by the seller, you will find many more, some with an emphasis on a particular area or on one programming package.

  • M. Coscia, The Atlas for the Aspiring Network Scientist (2021).
    [This is an excellent, comprehensive, well presented and readable book. It has the right amount of technical detail for physicists starting out in this topic. Far more material than covered in this course (over six hundred pages) so you will need to pick out what you need. Don’t let the first serious chapter (chapter 2) on probability put you off, I don’t think you will need it for most of this text. Best of all it is free to download the Atlas for the Aspiring Network Scientist and there is also a printed version which when shipped from the US cost me around £50.]
  • Aaron Clauset, Network Analysis and Modeling lecture notes (2013).
    [Longer course so greater depth than mine at Imperial but with similar topics covered in a different order. Free. Clauset’s blog, Structure and Strangeness, and twitter feed, (@aaronclauset), are also informative.]
  • Mark Newman, Complex Networks, (2nd ed. 2018) Oxford University Press, ISBN: 9780198805090.
    [A very large book with 800 pages in the second edition which makes it physically less easy to use as a reference. A lot of technical detail in the style of a physicist rather than a mathematician. It has a lot of discussion so not a bad place to learn from but a beginner may get a bit lost by the comprehensive nature of the book. A good place to go when you want to go beyond the basic material in an introductory course and when you want more detailed discussions. £42.]
  • David Easley and Jon Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World (2010) Cambridge University Press, ISBN: 978-0521195331.
    [Well written book. Focus of later parts is rather different from my course but the first sections on networks in general should be very useful. Download a free copy of Networks, Crowds, and Markets or purchase a printed copy £32.]
  • Michael Gastner, Networks: Theory and Application (2011).
    [Notes from a longer and more mathematical course than mine, given in the Maths department at Imperial in Autumn 2011. General approach very similar to my course but with much more mathematical detail. Accessible to Physics students after some effort. Good place for more general proofs of topics. Unfortunately, these notes are not publically available unless the author can be persuaded to put them out there (image copyright issues need to be sorted out). Free.]
  • Robert Hanneman and Mark Riddle, Introduction to social network methods, (2005).
    [Has a social science focus making the later parts less relevant to my course. However, since it is free you might still find the early sections useful. Free download of Introduction to social network methods. Free.]
  • Filippo Menczer, Santo Fortunato, and Clayton A. Davis, A First Course in Network Science, Cambridge University Press, (2020), ISBN: 9781108471138,
    [I saw a copy of this book on the publisher’s stand and it looks good and at the right level for students on my course and will take them beyond that too. £35]
  • Vito Latora, Vincenzo Nicosia, Giovanni Russo, Complex Networks: Principles, Methods and Applications, Cambridge University Press, (2017) ISBN: 9781107103184.
    [Well written with examples and exercises. Covers pretty standard ground. I did find a bit more useful mathematical detail on one or two issues that I did not find in other texts. A very solid textbook worth a look. £55]
  • Ted G. Lewis, Network Science: Theory and Applications, Wiley, (2009) ISBN: 0470331887.
    [Has exercises. Looks comprehensive. Table 1.1 p2 for a timeline of papers as a history of network science. £70].
  • Nino Boccara, Modeling Complex Systems, Springer, (2010) ISBN: 1441965610.
    [General coverage of complex systems including chaos, recurrence equations. In particular has chapter 6 on spatial models including cellular automata and sociophysics models, chapter 7 on networks, and chapter 8 on power laws e.g. power-law vs log-normal distributions. £63]
  • Guido Caldarelli and Alessandro Chessa, Data Science and Complex Networks, (2016), Oxford University Press, ISBN: 9780199639601,
    [Yet to read this in detail. Remarkably thin for the price (144 pages in total, closer to 120 for actual text) but looks quite useful. £40]
  • Maarten van Steen, Graph Theory and Complex Networks, (2010).
    [Free download from web site. Style is rather different from this course but again general introductory sections on networks could be useful. Free download of Graph Theory and Complex Networks or printed version £15.]
  • Reuven Cohen and Shlomo Havlin, Complex Networks: Structure, Robustness and Function, Cambridge University Press, (2010) ISBN: 0521841569.
    [The text often seems a bit too bit brief. Some chapters are very short and feel like the transcript of a lecture rather than more extensive notes expanded from an hour long presentation. Good basic mathematics. See table 3.1 for nice list of properties of standard networks. Contains exercises. Smaller and lighter (238 pages) than Newman’s book (771 pages but similar text density) yet no cheaper than Newman’s book and less comprehensive. £42.]
  • A.-L. Barabási, Network Science, Cambridge University Press, (2016), ISBN: 9781107076266.
    [Very nice layout and great graphics so perhaps the most beautiful of these Networks books. Interestingly, the preface acknowledges in more detail than normal a list of other people who contributed to the development of the book. Very readable but not as comprehensive as others. For instance, there is no section on centrality measures. The online version of Network Science is (and will remain) free and has additional resources. The printed version is available for £42 hardback.]
  • Stanley Wasserman and Katherine Faust, Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences), Cambridge University Press, (1994), ISBN: 978-0521387071.
    [This is a standard text written from the viewpoint of social science, hence the title. It is, however, quite mathematical in places. It is a standard reference book for Social Network Analysis as well as a textbook. I’ve never found it very readable. I generally only use it to check to see the origin of some concept as many ideas in Networks go back beyond the 1990s. I don’t recommend this for beginners but it is a book I use as a reference in my work. I feel that my dislike of the text is my problem, not the book’s. £47]
  • Freeman, L. C., The Development Of Social Network Analysis: A Study In The Sociology Of Science, ΣP Empirical Press, Vancouver, (2004), ISBN: 1-59457-714-5.
    [A history of Social Network Analysis from one of the leading authors in the field. Useful background for those deeply interested in network science. Note a free copy of The Development Of Social Network Analysis is available on ResearchGate. ]

What can models tell us about history?

One of my interests has been trying to use ideas from complexity and network science to look at historical questions.  In part, this is because historical data is often really challenging to a researcher like me used to modern data sets.  At the same time, that gap in our knowledge means that there is a real opportunity for modelling to be able to contribute something positive and substantial to historical debates. Part of the challenge is to think about the role of uncertainty in models.  The effect of uncertainty in the data, how to quantify uncertainty in the models. My physics background has taught me that a conclusion without some sense of the accuracy of that conclusion carries little practical meaning. My recent paper with Ray Rivers, Was Thebes Necessary? Contingency in Spatial Modelling (Frontiers in Digital Humanities, 2017, 4, 8 ) was all about finding ways to probe uncertainties.

So I was delighted to support a suggestion from Chiara Girotto that, along with Ray Rivers, we should organise a session on the challenges faced by modelling in an archaeological context at the next  EAA (European Association of Archaeologists) meeting The session has been accepted so we are really looking forward to this session in Barcelona, 5-8 September 2018 . We are looking for contributions, from anyone with something interesting to say. All types of contribution considered, for instance, presenters might want to highlight the limitations of this approach, or they will show how some technique from physical sciences might be adapted to this context. I’d love to hear about examples built upon good archaeological methods so I can learn more about the issues that archaeologists may take for granted but that I, with no formal training in archaeology, haven’t even thought about. So do think about contributing or attending.

I really enjoyed the EAA in Maastricht in 2017.  A lot was outside my immediate research but still intriguing to me and I learnt a lot. There was also a solid core of modellers that made it both an exciting and relevant conference for me. I can see that our session entitled “Challenging the models: Reflections of reality?” fits in well with several other sessions so again, there is a really good strand through the meeting to keep me entertained and busy. At the time of writing the deadline for submissions was 15 February 2018

Session 545 at EAA Barcelona 5-8 September 2018: Challenging the models: Reflections of reality?

Content:

Currently modelling is a central part of archaeological behavioural research. Many papers focus on the ability to extract the reflections of past social interactions and structures from a variety of archaeological and environmental sources. Especially in the light of highly theoretic archaeological modelling in pre- and proto history this often leads to environmentally driven, Darwinian like models, devoid of cognitive human factors, fuzzy decision making, and the possibility of non-rational choice. Considering all implemented assumptions required for social interaction models we have to question whether a model might be too complex to operate on the basis of our data. Has it entered the vicious circle of self-affirmation? Are our models questioning our own lack of knowledge? Where are we on an epistemic-ontic scale?
In our session we wish to address and discuss current problems in archaeological behavioural modelling. Questions tackled might include
• whether we are creating Processualism 2.0?
• how narratives are encoded in models, as discussed from a theoretical, methodological or practical viewpoint?
• how the inclusion of social theory and the fuzziness of human decision making alters the results from a model?
• what is the impact of assumptions on modelling results?
• what is the impact of archaeological data on a model’s outcome?
• how we can use inherent capabilities and inabilities of models to better interpret and narrate our approximations of reality?

Main organiser:

Chiara Girotto  (Goethe University Frankfurt, Germany)

Co-organisers:

Tim Evans (Imperial College London, U.K.), Ray Rivers (Imperial College London, U.K.)

 

Networks, Geometry and Clustering

Clustering is a vital tool when handling data making it a central part of data science. By grouping similar objects together, it helps us find what we are looking for. I don’t go to a bakery to find a book. Clustering is part of a wider idea in science as we are always faced with thousands of potential or actual measurements but we need to focus on the few which are relevant to the process we are trying to understand. I do not need to know the nuclear properties of the constituents of a gas to understand its properties, while measuring temperature, pressure and volume do throw a lot of light on that problem. In whatever branch of science we are working in, we are always trying to reduce the dimensionality of our data, to use the language of statistics and data analysis.

Many of the techniques we use will need a measure of distance and it is most natural to call upon the everyday distance as defined by any ruler – formally the Euclidean distances d where for example d2 =  x2  +  y2  +  z2  for the distance between the origin and a point at (x,y,z) in 3-dimensions.

However, what if time is present? Time is very different from space. Mathematically it leads to new types of geometry for space-times, Lorentzian rather than Euclidean. The simplest example is the Minkowski space-time used for studying special relativity. James Clough and I have been using Minkowski space as part of our study of networks which have a sense of time built into them – Directed Acyclic Graphs (see my blog on Time Constrained Networks for instance). Essentially these networks have a time associated with each vertex and then any edges present always point in one direction in time, say from the more recent vertex to an older one. Typically the time is a real physical time but for these types of network one can always construct an effective if artificial time coordinate.

There are many types of data with a directed acyclic graph structure. Citation networks are excellent examples and we will use them to illustrate our ideas in the rest of this article. Each node in a citation network is a document. The edges represent the entries in the bibliography of one document which always reference older documents – our arrow of time. We have worked with several different types of citation network: academic paper networks based on sections of the arXiv paper repository, US Supreme court judgements, and patents. My blog on citation network modelling gives some more background and how I think about citation networks in general.

Combining these two concepts  James Clough and I have adapted a well-known clustering method, MDS (Multidimensional scaling), so that it works for directed acyclic graphs (Clough and Evans 2017). Traditional MDS is usually applied to data sets where you have a matrix of distances between each object. For a network, this would usually be the length of the shortest path between each node. MDS then assumes that these objects/nodes are embedded in a Euclidean space and suggests the best set of coordinates for the objects in that space. Clustering can then be performed by looking at which points are close together in this space. We found a way to take account of the fact that two papers on exactly the same topic can be published at the same time in different places. They are clearly ‘close’ together in any common sense definition of close yet there is no direct connection through their citation network. Our method will show that these papers are similar just from the pattern of their citations. Indeed the text could be fairly different (perhaps with two documents on networks, one uses the terms node, link, network while the second uses vertex, edge, graph for the same concepts) but the way these two documents are used by others later, or the way the two documents were based on the same material, indicates they are likely to be working on the same ideas.

Once you have the coordinates of each document in the citation network there are many other standard geometric tools you can use to do other jobs. For instance, to recommend similar papers to one you are reading, you just look for other documents close in a geometric sense given the coordinates we have calculated. In the figure, we show the top two hundred papers from the first decade of the hep-th part of the arXiv paper repository (this is dominated by string theory). The visualisation uses coordinates found using our Lorentzian MDS technique.

Top 200 hep-th citation network

A two-dimensional embedding of the 200 most cited papers in the hep-th citation network where coordinates are found using our Lorentzian MDS algorithm. From Clough and Evans 2016b.

Our work with Minkowski space fits into a broader programme of looking at networks in terms of the geometry of different types of space, what I call Netometry (Networks + Geometry, or perhaps Neteometry is better), as exemplified by Krioukov et al 2009. For instance, a good indication that a low dimensional Minkowski space might be a good representation of many citation networks came from our measurements of dimension (Clough and Evans 2016a).

Bibliography

Clough, J.R. & Evans, T.S., 2016. What is the dimension of citation space? Physica A 448, 235-247 [ DOI 10.1016/j.physa.2015.12.053
arXiv:1408.1274 ]

Clough, J.R. & Evans, T.S., 2017. Embedding graphs in Lorentzian spacetime, PLoS ONE 12 e0187301 [DOI: 10.1371/journal.pone.0187301 , arXiv:1602.03103]

Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A. and Boguna, M. 2010. Hyperbolic geometry of complex networks.
Phys. Rev. E, 82 [ arXiv:1006.5169 ]

 

Exploring Big Historical Data

I’ve really enjoyed reading my copy of Exploring Big Historical Data: The Historian’s Macroscope (Macroscope for short here) by Shawn GrahamIan Milligan and Scott Weingart. As the authors suggest the book will be ideal for students or researchers from humanities asking if they can use big data ideas in their work. While history is the underlying context here, most of the questions and tools are relevant whenever you have text based data, large or small. For physical scientists, many of whom are not used to text data, Macroscope prompts you to ask all the right questions. So this is a book which can really cross the disciplines.  Even if some readers are like me and they find some aspects of the book very familiar, they will still find some new stimulating ideas.  Failing that, will be able to draw on the simplicity of the explanations in Macroscope for their own work. I know enough about text and network analysis to see the details of the methods were skipped over but enough of a broad overview was given for someone to start using the tools. PageRank and tf-idf (term frequency–inverse document frequency) are examples where that practical approach was followed. Humanities has lot of experience of working with texts and a physical scientist like myself can learn a lot from their experience. I have heard this piecemeal in talks and articles over the last ten years or so but I enjoyed having them reinforced in a coherent way in one place. I worry a bit that that the details in Macroscope of how to use one tool or another will rapidly date but on the other hand it means a novice has a real chance to be able to try these ideas out just from this book alone. It is also where the on line resources will come into their own. So I am already planning to recommend this text to my final year physics students tackling projects involving text. My students can handle the technical aspects without the book but even there they will find this book gives them a quick way in.

Imperial College Physics Staff

Staff in the Physics Department of Imperial College London clustered on the basis of the abstracts of their recent papers.

I can see that this book works as I picked up some of the simpler suggestions and used it on a pet project which is to look at the way that the staff in my department are related through their research interests. I want to see if any bottom-up structure of staff research I can produce from texts written by staff matches up to existing and proposed top-down structures of faculties – departments – research groups. I started using by using python to access to the Scopus api. I’m not sure you can call Elsevier’s pages on this api documentation and even stackoverflow struggled to help me but the blog Getting data from the Scopus API helped a lot. A hand collected list of Scopus author ids enabled me to collect all the abstracts from recent papers coauthored by each staff member. I used python libraries to cluster and display the data, following a couple of useful blogs on this process, and got some very acceptable results. However I then realised that I could use the text modelling discussed in the book on the data I had produced. Sure enough a quick and easy tool was suggested in Macroscope, one I didn’t know, Voyant Tools.  I just needed a few more lines to my code in order to produce text files, initially one per staff member containing all their recent abstracts in one document. With the Macroscope book in one hand, I soon had a first set of topics, something easy to look at and consider. This showed me that words like Physical and American were often keywords, the second of these being quite surprising initially. However, a quick look at the documents with a text editor (a tool that is rightly never far away in Macroscope) revealed that many abstracts start with a copyright statement such as “2015 American Physical Society”, something I might want to remove as this project progresses. I am very wary of such data clustering in general but with proper thought, with checks and balances of the sort which are a key part of Macroscope, you can extract useful information which was otherwise hidden.

So even for someone like me who has used or knows about sophisticated tools in this area and is (over) confident  that they can use such tools, the technical side of Macroscope should provide a very useful short cut despite my initial uncertainty. Beyond that I found that having the basic issues and ideas behind these approaches reinforced and well laid out was really  helpful for me. For someone starting out, like some of my own physical science masters and bachelors students working on some of my social science projects, they will find this book invaluable. A blog or intro document will often show you how to run a tool but they will not always emphasise the wider principles and context for such studies, something you get with Macroscope.

I should make clear that I do have some formal connections with this book, one of my contributions to the pool of academic goodwill. I suggested the general topic of digital humanities and Shawn Graham in particular as a potential author at an annual meeting of the physics and maths advisory committee for ICP (Imperial College Press). For free sandwiches we pass on ideas for topics or book projects to the publisher. I also commented on the formal proposal from all three authors to ICP, for which I often get a free book. My copy of Macroscape was obtained for reviewing a recent book proposal for ICP. Beyond this I get no remuneration from ICP. It is nice to see a topic and an author I highlighted to come together in a real book but the idea is the easy bit and hardly novel in this case. Taking up the idea and making it into a practical publishing project is down to Alice Oven and her ICP colleagues, and to the authors Shawn Graham, Ian Mulligan and Scott Weingart. That’s particularly true here as the book was produced in an unusual open source way and ICP had the guts to go along with the authors to try this different type of approach to publishing.

References

Exploring Big Historical Data: The Historian’s Macroscope
Shawn Graham (Carleton University, Canada),
Ian Milligan (University of Waterloo, Canada),
Scott Weingart (Indiana University, USA)
ISBN: 978-1-78326-608-1 (hardback)
ISBN: 978-1-78326-637-1 (paperback)

Modelling the Footprints of Innovation: Citation Networks

When one document refers to another in it’s text, this is called a citation. The pattern of these citations is most naturally represented as a network where the nodes are the documents and the links are the citations between the documents. When documents were physical documents, printed or written on paper, then these citations must (almost always) always point back in time to older documents. This arrow of time is imprinted on these citation networks and it leads to interesting mathematical properties.

One of the most interesting features of citations is that they have been carefully curated, sometimes for hundreds of years. The data I use on the US supreme court judgments goes back to the founding of the USA. So citation data is one of the oldest and continuous ‘big data’ sets to study.

The reason why records of citations have been maintained so carefully is that they record the process of innovation, be it in patents, law or in academic study. When you try to claim a patent you must by law indicate the prior art, earlier patents with relevant (but presumably less advanced) ideas. When a judge makes a judgement in a case in the USA, they draw on earlier cases which have interpreted, and so created, the law needed to come to a conclusion in the current case. And of course, academics can’t discuss the whole of existing science when explaining their own new ideas so they have to refer back to papers where previous researchers have set out a key idea needed in the current work. Citations are therefore a vital part of knowledge transfer, new ideas build on all the work done in earlier judgments, previous patents or older papers. That is why citations have been so carefully recorded. The network formed by documents and their citations show whose giant shoulders we are standing on when innovations are made, to paraphrase Newton.

From a theoretical point of view there are many interesting features in these networks. If you follow citations from one document to another to another and so on, at each step you will always reach an older paper. So you can never come back to the starting point and such paths. There are no cycles in a citation network (it is an example of a directed acyclic graph). If you look at the number of citations each document gets, how many newer documents refer back to one document, then they follow a fat-tailed distribution – a few documents have most of these citations, while most documents have very few citations each. Derek de Solla Price’s 1965 paper for an early discussion of this feature. Moreover, if you look at documents in the same year, you get roughly the same shape for the number of documents with a given number of citations (see Radicchi et al 2008, Evans et al 2012), at least for well cited documents. Since these networks are of such great interest,many other features have been noted too.

One way for a theorist to understand what is happening is to build a model from a few simple rules which captures as many of the features as possible. One of the first was that of Derek de Solla Price (1965) whose theory of “cumulative advantage” suggested that as new documents were created they would cite existing papers in proportion to the number of citations they already had, that is the richer get richer. This follows a principle used in many other models of fat-tails in other data, and indeed was later rediscovered in the context of the number of links to modern web pages – Barabási and Albert’s preferential attachment (1999). One trouble with this simple model is that the oldest documents are always the ‘rich’ ones with the most links. In reality, each year of publication there are a few documents with many citations (relative to the average number for that year) and most have very few. The Price model does not give this as all documents published in the same year will have roughly the same number of citations. To address this problem we (Sophia Goldberg, Hannah Anthony and myself,, Goldberg et al 2015) searched for a simple model which reproduced this behaviour – fat tails for the citation data of papers published in one field and one year (Goldberg et al, 2015).

The simplest model we found works as follows. At each step we add a new document, representing the evolution in time of our citation network.

A new document (red diamond) is added to an existing set of documents (blue circles) and their citations to earlier documents (arrows).

A new document first looks at recently published documents, as it is well known that citations tend to favour more recent documents. What we mean by recent is set by one of our parameters, a time scale τ.

First look at recent documents only.

We choose these recent documents partly at random (fraction (1-p)) and partly with cumulative attachment (fraction p) in which we pick recent papers to cite in proportion to the number of their current citations. This choice of papers is not realistic since it requires the authors to be able to choose from all recent documents while in reality authors only have a limited knowledge. However this stage is meant to capture, statistically at least, the way authors learn about recent developments: a recommendation from a colleague, a talk at a conference, scanning new editions of certain journals and so forth. Sometimes it will be essentially random, sometimes this first choice will reflect the attention papers have or will receive.

Choose to cite a recent document, with probability p use cumulative advantage (preferential attachment) or simply random with probability (1-p).

Once we have chosen these primary documents to cite, our new document then looks at the references within these primary documents.

Look at the papers cited by the primary paper just selected.

Each paper cited in the primary paper is then cited, copied, by the new document with probability q, the third and last parameter of the model.

Each paper cited in the primary paper is selected with probability q.

The new paper cites the selected secondary papers, so on average q references are copied from the primary paper.

This ‘copying’ process is known to be a way of getting cumulative attachment with only local knowledge of the network (see for instance my paper with Jari Saramäki (Evans & Saramäki 2005) and references therein). That is you only need to read the papers you have already cited, already found, to find these secondary documents to reference. There is no need for the model to know about the whole network at this point, reflecting the limited knowledge of actual authors.

It was only when we added the last copying process that we found our model reproduced the fat-tails seen within the citations to documents published in the same year. Nothing else we tried gave a few well cited papers in every year. Comparing with one data set, taken from the hep-th section of the arXiv repository, we found that the best values for our parameters led to a typical paper in our model of hep-th made up as follows:

  • Two primary papers chosen at random from recent papers.
  • Two primary papers chosen in proportion to the number of their citations from recent papers.
  • Eight secondary papers chosen by copying a reference from one of the first four primary papers.

This may seem like a very high level of papers being copied from the primary ones – on average we found 70% of papers were secondary citations, papers already cited in other papers being cited. One has to ask if the more recent paper, the primary one, contained all the information from the earlier ones as well as innovations being built on in the current paper. Did the new documents really derive useful information from the secondary papers cited? Often you see old ‘classic papers’ gathering citations as they are name checked in the introduction to a new paper. Not clear if the classic paper was even read while performing the current research. This feeling that some papers gain attention and acquire citations that does not reflect any direct influence on the current work is supported by at least two other studies. One was a study by Simkin and Roychowdhury (2005) of the way errors in the bibliographies of papers are copied in later papers. They suggest that this meant 80% of citations came from such copying of references. In another approach, James Clough, Tamar Loach, Jamie Gollings and myself (Clough et al, 2015) exploited the special properties of citation networks and this also suggested that 70%-80% of links were unnecessary for the logical structure of academic citation networks.

Of course constructing simple models will never capture the whole story. Models are, though, a good way to see if we have understood the key principles underlying a system.

References

Time Constrained Networks

One of the key features of complex networks is that they capture interactions which have no limitations.  In most electronic systems, be they Facebook, emails or web pages, we can make connections across the world with little if any cost.

However what if there are constraints on the links made in a network? Surely we should change the way we study networks if space, time or some other constraint is having a significant effect on the formation or use of the network.  This has been a major interest of mine over the last few years. Space is one obvious limitation as in some cases long distance are less likely to be made. There has been a lot of work in this area over many decades but I will leave this constraint for another blog.

It is only more recently that the role of time in networks has began to receive more attention. A lot of this recent interest in how to deal with networks where the connections, are made at one time.  That is because most communication networks, emails, phone calls and so forth, are of this type. The recent review by Holmes and Saramäki (2012) is such temporal edge networks.

Yet networks are made of two parts: vertices and edges. My recent work has focussed on the case where it is the vertices, not the edges, which are created at a definite time. In such temporal vertex networks, causality forces the interactions between nodes to always point in one direction. For example consider a citation network formed by academic papers. The nodes in our network are the academic papers and the links are formed by their bibliographies.  So if paper A refers to another paper B then we can be (almost) sure that A was written after B. Information can therefore flow only from B to A. In fact any set of documents can only refer to older ones such networks are common. In law, judges refer to previous judgments to support their arguments.  When registering a patent, prior art needs to be cited, that is other previously granted work which may have some relevance to the current claim.

The same types of structure occur in several other situations.  Any situation where there is a logical flow has the same causal structure.  If we have a project where the nodes represent individual tasks then an edge from task S to task T could represent the fact that task T requires task S to have been completed before task T is started.  This has been the focus of work on temporal vertex networks in computer science. The logical flow of a mathematical argument or an excel spreadsheet show the same properties.  These networks define what is called a partially ordered set or poset and it is under this title that you find relevant work coming from mathematicians. A final example comes from the Causal Sets approach to quantum gravity (see Dowker 2006 for a review).  Here space-time is discrete not continuous, and these discrete points are the nodes of the network.  The nodes are connected by edges only if they are causally connected and causality again gives these a common direction.

All of these temporal vertex networks have a key property.  That is they contain no loops if you always follow the direction on the edges.  You can not go backwards in time.  Thus the traditional name for such a network is a directed acyclic networks or DAG for short.

So the question is how can we adapt traditional network measures to deal with the fact that these networks, DAGs, are constrained by causality?  Are there new measures we should employ which give more insights to such networks?

I’ve been looking at these problems with several students (undergraduates in their final year projects and some MSc students), one of whom, James Clough, is now working for his PhD on this topic.

Paths in networks are always important.  However one feature of a DAG we have been exploiting is that if we always follow the direction of the arrows, the direction of time, then not all nodes are connected. If we like we could add edges whenever there is such a path connected a late node to an earlier one, a process known as transitive completion.  On the other hand we could remove as many edges as we can while leaving the causal relationships intact, a process known as transitive reduction. That is, if there is a path between two nodes in the network before transitive reduction, then there will still be a link afterwards.

Example of the transitive reduction (left) and the transitive completion (right) of the directed acyclic graph in the centre.

What we have done (in Transitive Reduction of Citation Networks) is look at how real data from citation networks behaves after transitive reduction.  What we find is that different types of citation network behave very differently. The citation network formed from academic papers taken from the arXiv repositoryand the network of US Supreme Court judgments both show that about 80% of the edges are not needed to retain all the causal relationships.  On the other hand the patents network shows the opposite behaviour with all but 15% of edges being essential. The edges removed tend to be the citation to older papers. One interpretation is that academics and and judges may be citing well known early papers and judgments though their current work is only directly related to more recent documents. Perhaps some of these citations do not indicate the early work was needed but reflect other motivations, such as simple copying of popular papers or review in the field which at best only have general relevance. For academic papers this interpretation is supported by the work of Simkins and Roychowdhury In this sense unnecessarily.

The number of citations to a document after transitive reduction certainly gives us a different view of the importance of different documents. For instance paper hep-th/9802109 on the arXiv (Gauge Theory Correlators from Non-Critical String Theory by Gubsner et al.) was cited by 1641 papers in the network, but only three citations remained after TR! On the other hand, paper hep-th/9905111 (Large N Field Theories, String Theory and Gravity by Aharony et al.) has also large number of citations in the raw data, 806, yet after transitive reduction it has 77, so retaining far more of its original citations. Perhaps information in the second paper was used more diversely.

We can find similar examples in the US Supreme Court citation network. The case Schneider vs. New Jersey (1939)’ has 144 citations in the original data but this drops to just just one after transitive reduction. Stromberg vs. California (1931) also falls from 132 citations to just one. Conversely, the case Heller vs. New York (1973) only shows a slight fall after transitive reduction, falling from from 68 to 48 citations and has the most citations in our reduced network. The second most cited case after transitive reduction is Hamling vs. United States, which drops from 68 to 38 citations. Wikipedia lists hundreds of Supreme Court cases but the last two are not famous enough to make the Wikipedia list.  Our analysis suggests they may have more importance than a simple citation count would suggest. At the very least it might be be worth checking out documents that are highly cited in the essential.

Another way to look at citation networks is to see if we can define a dimension to the network.  That is we can try to quantify how much variation there is in the citation process.  A low dimension means that there are few directions , few distinct themes relevant for citation in a document.  A high dimension indicates that there is a wide range of relevant but distinct directions from which a document will draw on for inspiration.  What James Clough and I found (in What is the dimension of citation space?) is that we were often able to assign an interesting value for the dimension of our citation data.  For academic papers, we found that different fields of research have different dimensions.  For papers in the hep-th arXiv section (largely string theory) we found a low dimension of around 2 while for theoretical papers closely linked to particle physics experiments (hep-ph section) we found more variation as indicated by a higher dimension of 3. The quant-ph also around 3 while the astro-ph section had a slightly higher dimension of around 3.5. So clearly despite similarities in the main data using standard measures, our time-aware dimension measures show clear differences in the citation behaviour of different areas. String theory in particular seems to be a tightly knit collection of work with each work largely dependent on all the other work, few independent directions can be pursued. The US Supreme Court judgments were more complicated.  Small samples (usually from modern judgments) showed a dimension of around 2.5 to 3 but larger samples, typically ranging from modern to the earliest judgments, had lower dimensions, closer to 2. We interpreted this as reflecting the way that there were few early judgments compared to the number produced to day.  So that the further back we traced in time to find the influence of judgments on recent ones, the smaller the variation.  Perhaps that is not so surprising and we might expect a similar shape if we could follow scientific papers back to the 18th century! patents on the other hand showed a much higher dimension though again these were involved.

It is clear from just the few studies we have made that time makes a crucial difference to the structure of a network. We have tried a few new measures adapted to take account of time and in doing so we have thrown up some intriguing features in real data.  There is surely much more to find when networks are embedded in time.

References

Clough, J.R. & Evans, T.S. What is the dimension of citation space?, arXiv:1408.1274

Clough, J. R.; Gollings, J.; Loach, T. V. & Evans, T. S., Transitive Reduction of Citation Networks, J.Complex Networks to appear 2014, arXiv:1310.8224 DOI: 10.1093/comnet/cnu039 (open access)

Dowker, F. Causal sets as discrete spacetime, 2006. Contemporary Physics, 47, 1-9

Holme, P. & Saramäki, J. 2012. Temporal Networks Physics Reports, 519, 97-125

Simkin M.V. and Roychowdhury V.P., 2003. Read before you cite! Complex Systems 14 269-274

The Many Truths of Community Detection

You do not need to know the detailed properties of every small part making up a gas, it turns out the bulk properties of a gas can be derived from very general principles. In the same way when looking at Facebook data, we might be able to identify groups of people who behave in a similar way. Searching for these groups or clusters in data is central in many areas of physical and social science. It is often easier to understand the behaviour of a large system by looking at these clusters, which are much fewer in number.

In terms of networks, the clustering is based on the structure (topology) of the network and the groups found are called Communities. In this case we might expect a coherent group to be one which has more links between members of the group than it ha to nodes outside the group in other clusters. I have done some work on what is called Community Detection, particularly in methods which assign nodes to the membership of several clusters (e.g. my line graph and clique graph papers referenced below).  After all, my social connections are likely to show that I am part of several groups: work colleagues, family relationships, connections made through hobbies or sports.

For some time I have been very wary about the meaning of the clusters found with such methods and particular about claims of one method being able to find “better” communities than another.  A recent paper prompted me to think about this again. In Community detection in networks: structural clusters versus ground truthHricDarst, and Fortunato from Aalto University in Finland (a big centre for networks research) asked if the network methods were finding different sorts of clusters from those found using other aspects of the data. Typically when testing a community detection method, one sets up artificial networks in which each node is assigned to one community. The edges between nodes are then assigned at random but with a preference for edges to be between nodes from the same community.  I can do all the tests I like on artificial data but I am always worried that this approach has introduced some hidden bias. Perhaps we end up choosing the methods that ‘work’ on artificial data but which are perhaps not so good on real messy data? It all comes down to the fact that we have mathematical ways to quantify the difference between community assignments but defining what we mean by “the best” clustering is impossible. Even with artificial networks, the “ground truth” is not generally an absolute truth.  Typically the “truth” are input parameters and the actual network generated is partly random. So while the resulting artificial network is correlated with the ground truth it is not designed to be a perfect match. So in this case the “actual truth” will, in almost most cases, be different from the ground truth. [Note added 5th January 2016. Another more recent paper which includes some expert evaluation of communities found as well as a comparison of many different methods is Šubelj, van Eck and Waltman 2015 Clustering scientific publications based on citation relations: A systematic comparison of different methods.]

I also worry about what we do when we run network community detection methods on large real data sets where there is no simple ground truth.  When I have done this, I can find a variety of possible answers for communities in the data.  Many look reasonable but none correlate perfectly with each other or with what I know from other sources. This leaves me wondering if the automatic methods are finding one truth and my other information gives another. Alternatively the automatic methods might be rubbish, good on artificial cases, not so good in reality. There is no simple way of telling.

In any case do real networks have a “ground truth”?  Quite often people have data from other sources about real networks and they use this to construct a “ground truth”.  The test is then to see if automatic methods can find this ground truth. However what if the other data is wrong? People don’t always tell the truth, they can deliberately mislead or they can misunderstand the problem. Children surveyed about their friendships may tell you who they’d like to be friends with (the most popular person in the class) and not who they actually spend time with.

Zachary Karate Club network clustered using clique graph methods

Zachary Karate Club network clustered using clique graph methods

Take the famous Zachary karate club data set used by many (including myself) as a simple test. This is a network of members of a karate club that split in two during the sociologist’s study. Let us accept that the professionalism of Zachary has produced data that is a true reflection of the situation despite the difficulty of measuring associations in social science. If you look at the published paper it actually gives two truths. One is based on which of two factions the members actually joined, and one based on an automatic community detection method.  I suspect most people are using the latter as the ground truth (unwittingly) when testing their work.  Perhaps this is a further example supporting the claim that academics only read 20% of their references. Worse the data given in the published karate club paper is not consistent – the unweighted adjacency matrix is not symmetric.  So which truth was used for all those papers using the Karate club network?

American College Football network

American College Football network clustered using clique graph clustering methods

Another example comes from some work I did on overlapping community methods. Like many other people I downloaded a standard data set from Mark Newman’s web site, an extremely useful resource. The American College Football data was created by Girvan and Newman (in Community structure in social and biological networks) and represents the games played between American College Football teams in one season.  Also provided are the conference membership of each team.  Teams play more games against teams from their own conference than from any one other conference. In fact, this data is so well clustered that surely no method should get anything wrong beyond a few independent teams as my visualisations here illustrate (taken from my clique based clustering paper). So I looked at the “mistakes” made by my method. After about two afternoons of wading through interminable websites of stats on American College football and Wikipedia pages on the College Conference system, I realised that in fact most of the “mistakes” were not from the automatic community detection but lay in “the ground truth”, that is in the conferences assigned to teams in the data file. It turns out that the assignments in the original football.gml file are for the 2001 season while the file records information about the games played for the 2000 season. For instance, the Big West conference existed for football till 2000 while the Sun Belt conference was only started in 2001. There were 11 conferences and 5 independents in 2001 but 10 conferences and 8 independents in 2000.   Care is needed as American College athletic conferences cover many sports, with some sports joining or dropped from any one conference time to time. Teams can also switch conferences too. In fact, around 10% of the college teams playing American football at the top level changed conferences around 2000-2001. (Note added 5th January 2016.  These errors were also noted in Gschwind et al 2015, Social Network Analysis and Community Detection by Decomposing a Graph into Relaxed Cliques, only the second paper I’ve seen which does this independent from my discussion).

So often the “ground truth” is just another truth not some absolute truth! The errors in the Zachary Karate club and American College Football data do not matter in one sense as they still provide valid and valuable tests for methods.   The conclusions in the hundreds of papers using these data sets and which use these questionable ground truths would not change. Indeed it highlights one role for automatic methods.   You can see that where Girvan and Newman’s methods get the “wrong” answer in their original paper (Community structure in social and biological networks) they are in fact highlighting problems with their conference data. Validation of data is a very useful if boring job. A final question will always be if there is a single truth. For instance I am in the theoretical physics group of the physics department of the Faculty of Natural Sciences at Imperial College London.  That top-down hierarchical truth is important when allocating desks and teaching.  However another truth would emerge if you studied my research relationships.  Those are with staff and students based in other physics research groups and with colleagues from other departments and even other faculties.

So I was really pleased to see that Community detection in networks: structural clusters versus ground truth were questioning the meaning of truth in community detection from a quantitative point of view. Clustering of data, finding communities in data is of tremendous value commercially and for research, but there is still a lot more work to do before we understand these different truths.

References

M. Girvan, M.E.J. Newman, Community structure in social and biological networks, PNAS (2002) 99, 7821-782

W. Zachary, Information-Flow Model For Conflict And Fission In Small-Groups Journal Of Anthropological Research, (1977) 33  452–473

D. Hric, R.K. Darst,and S.Fortunato,  Community detection in networks: structural clusters versus ground truth,  arXiv:1406.0146

T.S. Evans, American College Football Network Files. figshare. (2012).  http://dx.doi.org/10.6084/m9.figshare.93179

T.S. Evans, and R. Lambiotte, Line Graphs, Link Partitions and Overlapping Communities Phys.Rev.E, 2009, 80, 016105 [arXiv:0903.2181].

T.S. Evans, Clique Graphs and Overlapping Communities J. Stat. Mech. (2010) P12037  http://dx.doi.org/10.1088/1742-5468/2010/12/P12037
[arXiv:1009.0638]

L. Šubelj, N.J. van Eck, L. Waltman, Clustering scientific publications based on citation relations: A systematic comparison of different methodsarXiv:1512.09023 .

Why Google scholar is no worse than anything else

Just read an interesting couple of blogs by Stacy Konkiel from the ImpactStory team, one entitled “4 reasons why Google Scholar isn’t as great as you think it is” nicely followed up by “7 ways to make your Google Scholar Profile better” which keeps things constructive.  My immediate response to the first blog is that the problems are not unique to Google scholar and that you could highlight the same issues with most bibliometric sources.  Certainly the criticisms also apply to the two other commercial bibliometric data sources, Scopus and Web of Science. The four points were as follows.

1) Google Scholar Profiles include dirty data

What is “dirty data”? A site like Impactstory pulling in data from a wide range of non-traditional sources ought to be a little more careful about throwing this term about! One person’s dirty citation is another’s useful lead. It does seem Google scholar is more open to gaming but at least it is easy to spot this using Google scholar if you see some anomalous data. Scopus and Web of Science make their decisions behind closed doors about what to include and what not; how many ‘weak’ journals and obscure conference proceedings are included there, how many book citations are excluded? I’ve heard at least one story about the way bibliometric data was used as a pawn in a dispute between two companies over other commercial interests. I just have no idea how much manipulation of data goes on inside a commercial company. On the altmetrics side of the story, most departments still regard any social media counts as dirty.

2) Google Scholar Profiles may not last

Surely a problem with anything, commercial or not. Your institution may switch subscription and cut off your access even if it is still out there.  Google certainly has poor reputation on this front.  In so many ways, we always gamble when we invest time in a computer product – not sure my PL/1 programming knowledge is much use these days.

3) Google Scholar Profiles won’t allow itself to be improved upon

Scopus and Web of Science also carefully control what you can do with their data. In any case you need a subscription before you can start to do anything. So surely this is a criticism of all closed data systems.

4) Google Scholar Profiles only measure a narrow kind of scholarly impact

Again, I don’t see Scopus and Web of Science producing much more than bare citation counts and h-indices. The UK 2012 research assessment procedure (REF) only quoted bare citation counts from Scopus. This is a problem of education. Until more people understand how use bibliometric data nothing much will happen and I know h-indices still get thrown about during promotion discussions at my institution (again see an Impactstory blog about why people should stop using the h-index).

My Approach

I tend to think of all sources as data.  Your interpretation should vary as you take each into account.  Like all data and measurements, results derived from bibliometric information needs to be checked and validated using several independent sources and alternative methods.

For instance I have access to these three commercial sources, and they tend to give citations counts which differ.  Web of Science is generally the most conservative,  Scopus is in the middle and Google scholar leads the counts.  So I can use all three to give a balanced view and to weed out any problems. They also have different strengths and weaknesses.  Google is ahead of the curve and shows where the other two will go a year or two later.  My work on archaeology has a large component in books which  Google scholar  reflects but the other two fail to capture.  Scopus is very weak on my early Quantum Field Theory work, while both Web of Science and Google scholar are equally strong in this area and time period.

The tips discussed in “7 ways to make your Google Scholar Profile better” are very useful but many apply to all data sources. For instance Scopus just added two papers by another T.S.Evans working near London to my profile, even though its in a completely different research field, the address is completely different (not even in London) and there is basically zero overlap between these papers and my work.  Makes you worry about the quality of the automatic detection software used in commercial bibliometric firms. I can’t fix this myself, I have to email  Scopus while I can tidy up Google scholar myself whenever I want. Currently I also feel that the Google scholar recommendations are the most useful source of targeted information on papers I should look at but I am always looking for improvements.

Overall, I feel you need to keep a very balanced approached.  Never trust a statistic until you’ve found at least two other ways to back it up independently.

CitNetExplorer – Citation Network Analyser and Visualisation

I have just come across an interesting citation network analyser and visualiser – CitNetExplorer. Looks to be a very professional package that I will certainly be using.

One of the interesting things about citation networks is that their vertices have an order given by their publication date. This is a very strong constraint on the system so when you analyse or visualise such a network you should take the time ordering  into account. The simplest example is that you should not just look at the vertex degree in these networks, but at in-degree (citation count) and out-degree (length of bibliography). Perhaps the most obvious aspect of the constraint comes when you try to visualise the network. You can just put such networks into a standard network package and treat it as a directed network, which it is. However any standard visualisation will undoubtedly place the vertices all over the two-dimensional surface used for display. Standard visualisations pay no attention to the time-ordering of the vertices yet you almost certainly want to show that information when displaying a citation network as it is such a critical part of the definition. So many of the properties will depend on the age of the publication for instance. I have encountered this myself and played around with a few ad-hoc solutions but came to the conclusion I needed to write something myself, adapting a standard layout method to set one dimension of the vertex coordinates while the second dimensions is set by the vertice’s time. Since the same problem is encountered when making diagrams showing the critical paths in a set of tasks (such as Gantt charts) there are packages which will do this. However you will also want to do different types of analysis on a citation network plus they are likely to be much bigger than a normal Gantt chart.

This is where CitNetExplorer comes in. This comes from Nees Jan van Eck and Ludo Waltman at the CWTS (Centre for Science and Technology Studies) in Leiden, so comes from one of the leading institutes in bibliometric research. Its very early days and I have only had a short play but for me its good points are:

  • Free for noncommercial and teaching purposes.
  • Cross platform as written in java.
  • Stable on my Windows 7 machines.
    As it is written in java, it is likely to be stable on other platforms too.
  • Well presented with a reassuring professional feel.
  • Good graphical display.
    The publications are laid out using their publication data for the vertical coordinate and a layout algorithm to place the publications horizontally
  • Good default options.
    I got an instantly readable figure every tine I tried it
  • Good range of graphical output options.
    Vector graphics, especially postscript (eps), is essential for me. Note these are all under the Screenshot menu option.
  • Two basic network format output options.
    A pajek .net and a simple text file format (see below)
  • Various basic analysis tools.
    This includes transitive reduction  which is something I have been very interested in and can throw up some new insights into the citation counts of papers (see arXiv:1310.8224).

The forty most highly cited papers in hep-th (1992-2003) after transitive reduction as an example of output from CitNetExplorer. (click on image for better version)

So this looks to be a really nice package. Of course, I am never satisfied so what would I like to see in future versions:

  • Open source.
    It would be nice to be able to learn from their computational work and to add to this myself. Maybe some type of plug-in could be added to solve the latter problem. I have a few more tricks for citation networks in the pipeline for instance.
  • More input options.
    There are only two and one is tied to Thomson-Reuter’s WoS (Web of Science) database. In the example given by the authors you perform a search on WoS and then save the results in a text file (saverecs.txt).  Note you must select the “Web of Science Core Collection” not the “All Databases” option which the example clearly shows but I didn’t read, otherwise the output file will not include the full citation information needed to construct the citation network.  This file is a simple text file so you should be able to combine them by hand if like me you are limited to 500 records per file.
    The alternative is a pair of relatively simple text files.  These are not as yet explained in the documentation. Basically there are two files.  First is namepub.txt file lists the properties of the publications and the order in this file assigns each publication an index (the publication on line 2 is vertex 1, line 3 defines vertex 3 and so on). The second file is called namecite.txt and is an edge list written in terms of the vertex index. Look at the first few lines of the example data James Clough made from the open source KDD cup arXiv citation network data that we have been using in our recent work. Alternatively if you can produce a file from WoS open it in CitNetExplorer then save it in what is called CitNetExplorer format. These CitNetExplorer files are easy to look at, edit and prepare in a spreadsheet or a basic text editor and appear to be tab separated.
  • Visualisation editing.
    No layout is perfect so it is essential to be able to move the vertices by hand. One of my favourite visualisation packages, visone, shows what you can do in java, and even my own ariadne package built on the jung library gave that functionality automatically.

Rather less seriously, I am not sure about the name.  I would pronounce the “Cit” in “CitNetExplorer” as “sit” or perhaps “chit” so I would have kept the “e” in “cite”, CiteNetExplorer, but its not my product. As I’m getting bored typing it it, I’m sure it will become just CNE in any case.

Links

CitNetExplorer http://www.citnetexplorer.nl/

Van Eck, N.J., & Waltman, L. (2014). CitNetExplorer: A new software tool for analyzing and visualizing citation networks. [arXiv:1404.5322]

Van Eck, N.J., & Waltman, L. (2014). Systematic retrieval of scientific literature based on citation relations: Introducing the CitNetExplorer tool. In Proceedings of the First Workshop on Bibliometric-enhanced Information Retrieval (BIR 2014), pages 13-20.

James R. Clough, Jamie Gollings, Tamar V. Loach, Tim S. Evans (2013).
Transitive Reduction of Citation Networks. [arXiv:1310.8224]

Clough, James; Evans, Tim; Loach, Tamar (2013). Transitive Reduction of Citation Networks. (data set) figshare
http://dx.doi.org/10.6084/m9.figshare.834935

Clough, James; Evans, Tim (2014). KDD cup arXiv data for CitNetExplorer. figshare fileset.
http://dx.doi.org/10.6084/m9.figshare.1021647

« Older posts

© 2024 Netplexity

Theme by Anders NorenUp ↑