"Everything looks like a graph, but almost nothing should ever be drawn as one."

seb

I get scratched with this statement made by Ben Fry in the book ‘Visualizing Data‘ (2008). Although I have a great respect for Ben Fry’s work and his position may have evolve since then, I want to moderate this statement so that data explorers like danbri can make their own opinion.

Ben Fry in ‘Visualizing Data‘:

Graphs can be a powerful way to represent relationships between data, but they are also a very abstract concept, which means that they run the danger of meaning something only to the creator of the graph. Often, simply showing the structure of the data says very little about what it actually means, even though it’s a perfectly accurate means of representing the data. Everything looks like a graph, but almost nothing should ever be drawn as one.

There is a tendency when using graphs to become smitten with one’s own data. Even though a graph of a few hundred nodes quickly becomes unreadable, it is often satisfying for the creator because the resulting figure is elegant and complex and may be subjectively beautiful, and the notion that the creator’s data is “complex” fits just fine with the creator’s own interpretation of it. Graphs have a tendency of making a data set look sophisticated and important, without having solved the problem of enlightening the viewer.

I totally disagree. Look at this simple plot:

pareto_convergence_r050a01

Can anyone tell me how, simply showing this plot, one is enlightened if I don’t tell how it was done, and what is interesting to look at? It however appears very simple: only one curve, something that you are used to see since the time you discovered this kind of drawing in primary school. And even if I give some insights on how I made it and the context of the work, I’m still, as the creator, the only one able to deeply understand the information that can be extracted because I know the process that built the underlying data. To criticize my conclusions, you will need to learn as much as I did and you will need to get the same data and apply the same manipulations. Depending on the curation, reformatting, filtering or whatever the algorithms you used to capture, extract and use some data, each action has an impact on the meaning carried on by the data. Graph visualization is no exception, and is like any plot except that you can’t hide the structural complexity without explicit filtering.

Let’s enumerate all the dimensions used in a graph visualization: x+y coordinates, size of nodes, color of nodes, thickness of edges. Well, it is not easy to read on 5 dimensions. But is the “simple” plot a better deal? You have x+y coordinates, so 2 dimensions only (we might also have used colors and dot sizes as well, and get 4 dimensions). So you might think that you and your readers can interpret it easily and reliably. You are all wrong because of the hidden dimension: scaling.

Here you see a plot in a log-lin scale, that mean the y-axis is in a logarithm scale, while the x-axis is in a normal scale. I found this visual pattern interesting on these data because of my research question, because I understand the meaning on the process that made them, and because I found it in this particular scale. Plotted in lin-lin scale, I can find less information. Or maybe should I use a cumulative function to plot my data? Maybe an inverse cumulative? Etc. An exploration of both data and projection techniques is required.

By doing one projection, I focus on something very particular on the data, and I still need other plots and statistical tests a) to decide whether it supports an hypothesis I have in mind, or b) if I can find something new, something unexpected. The distortion of vision is therefore at the same time an issue and a tool to better dig inside the data. I could also make very wrong conclusions, even on analyzing this simple drawing, so why external readers should be more protected this way? There is a balance to find between a drawing that looks simple to read so conclusions appear obvious (even if they are not and you might be wrong), and the opposite one that looks too complex to read so little conclusions will be made, if any. Hence this is a fallacy to argue that graphs are meaningful only for their creator, because it is the case for any plot taken solely, and it is a hard job to enter into the work of somebody else anyway.

So graph visualization is not naturally worse compared to any data drawing: we just don’t teach how to read them in primary school. Do you remember the first time you saw a plot? I guess you find it really abstract. Most of the people don’t really know what to look at on a graph, and produce visualizations that don’t show something in particular. I personally think that it is a good thing, because put in context graph visualization is very young compared to other data drawings, and a language of networks that combine layout algorithms and visual variables is still in the making. Moreover, after meeting and discussing with people publishing such visuals, it seems that they already use it in a pragmatic way: by showing their complexity, graphs communicate to the reader that a) data might contain interesting information (“so please, read until the end!”), b) they made things and propose some findings but it was hard and many other things could be done (“hey, let’s try by yourself!”). It is useless to discover the secrets of the universe if nobody listen to you. Before enlightening the viewer, one should attract the viewer enough so that he/she will take the time to read, and graphs are useful for that need.

But drawing graphs as graphs is not only useful to communicate. Their primary use for researchers is exploratory analysis when the study is not focused on the sole structure of the data, but when elements in context matter because you have a prior knowledge on them, and your questions are related to another perspectives (say, sociology). I take the example of our work at Sciences-Po, where we teach the mapping of controversies to students that will become the future decision makers of companies or public policies. Part of the controversies in the public space are expressed on the Web. The dynamics of the discussions and the hyperlink structure of the Web makes this field particularly hard to investigate. We successfully use graph visualization of websites to help the students to orientate in this space, to assist and justify the classification of websites, and to assert the position of the actors of a controversy. This is just one case among others where there is currently no viable alternative to graph drawing and it’s synoptical property (see the whole without reduction of data).

Finally, the different usages of graph drawing are growing as it becomes mainstream and more people are acculturated. I trust on the people to innovate and progressively learn how to read and extract information. Just practice.

GSoC mid-term: new Visualization API

My name is Vojtech Bardiovsky and I am working on the new Visualization API. This is done together with the new visualization engine based on shaders.

API design

The aim of the project was to design a clean and usable API for the new engine. It exposes only as much as necessary, but enough to make customization of visualization possible. The following four API classes are all services and can be retrieved through ‘Lookup’.

Visualization controller

This is the most important class in the API and can be used to retrieve the ‘Camera’, ‘Canvas’ used for visualization display, and very importantly the instance of active ‘VizModel’ and ‘VizConfig’ classes that both contain many settings that help controlling the visualization. It will also allow making direct changes to visualization like setting the frame rate or centering camera according to different parameters. The ‘Camera’ class can be used to get data about its position or to make actions such as translation or zooming.

Event and Selection managers

The Event manager can be used to register listeners to user events exactly as in the old engine. This is very important for the tools. The selection manager provides methods to retrieve all currently selected nodes or edges, to select nodes or edges and to control the selection state of the UI (dragging, rectangle selection, etc).

Motion manager

Apart from listening to all user induced events and their most basic handling (selection, translation, zoom), this class provides information about current mouse position in both screen and world coordinates.

New features

There are many changes the new engine will bring and although it is not finished yet, there already are some new user-side features.

Complex selection

In the old visualization engine, only rectangular and direct (one node) selection were possible. New API will allow to implement any reasonable shape. At the moment it supports rectangles, ellipses and polygons.

Thanks to the selection shape variability and changes in the mouse event system, it is possible to make incremental/decremental selections using Shift and Ctrl keys. Opposed to only one node at the time, the whole selection can be dragged and moved now.

Background image

It is now possible to change and configure the background image. Settings are similar to the CSS properties such as ‘background position’ or ‘background repeat’.

Node shapes

It is possible to have different shapes for every node in graph. Basic shapes include ‘circle’, ‘triangle’, ‘square’, etc., but also up to 8 custom images that can be imported by user. Nodes can have their shapes defined in the import file or set them directly through the context menu.

Better 3D

Work has been done on a better way to control the scene in the 3D. Graphs are not naturally suited for 3D, for example adding new nodes or moving them will never be perfectly intuitive. But for displaying the graph, some enhancements can be done.

Current status

The engine is still under development, but the API is slowly closing to its final state. Next step for the API will be to include as many configuration possibilities as the engine will allow. The underlying data structures will be optimized for performance.
As the project consists of two parts, API and engine, Antonio Patriarca, the mentor for this GSoC project and implementor of the engine will write an article about rendering details in the near future.

(The rendering pipeline for edges is not fully finished, so the images shown are not the actual new look of gephi.)

Explore the Marvel Universe Social Graph

This week end at the data in sight hackathon in San Francisco, one of the winning team worked with Gephi and the cool Marvel dataset provided by Infochimps.

From Friday evening to Sunday afternoon, Kai Chang, Tom Turner, and Jefferson Braswell were tuning their visualizations and had a lot of fun exploring Spiderman or Captain america ego network. They came with these beautiful snapshots and created a zoomable web version using the Seadragon plugin. The won the “Most aesthetically pleasing visualization” category, congratulations to Kai, Tom and Jefferson for their amazing work!

The datasets have been added to the wiki Datasets page, so you can play with it and maybe calculate some metrics like centrality on the network. The graph is pretty large, so be sure to increase you Gephi memory settings with > 2GB.

GSoC mid-term: Shader Engine

primo_piano-100x100

My name is Antonio Patriarca. The aim of my Shader Engine project is to implement a new visualization engine in Gephi based on modern graphics card capabilities. The current one is in fact designed to be very portable and only uses legacy features which are not optimal on modern hardware.

The OpenGL API, which Gephi uses, has changed a lot in the last years to follows the evolution of the hardware. Several parts of the first versions are now considered deprecated and they are difficult to implement efficiently in the drivers. For this reason, it is often necessary to redesign the graphics engines based on old versions to get the best from modern hardware.

In the old days, graphics primitives were rendered using a fixed pipeline implemented in hardware. The programmers had to set several states in order to control the rendering logic. Several tricks has to be invented to achieve not supported effects. At each new OpenGL version, several new states were introduced to give more freedom and power to the users and the pipeline soon became very complex and difficult to manage. This is how the current Gephi visualization engine is still implemented.

Inspired by the success of the RenderMan Shading Language, graphics card manufactures finally introduced some programmable stages to the graphics pipeline. These stages give the ability to precisely and easily define the rendering logic without taking care of a large number of states. At first, the only way to implement the programs, called shaders, running in these programmable stages was using an assembly language. But the GLSL language was soon designed to simplify shader implementation in OpenGL and other similar high level languages were introduced in the other APIs as well. The number of the programmable stages of the graphics pipeline are increased since their first introduction and modern GPUs are now often used as general purpose stream processors. The new visualization engine will use the shaders to render the nodes and edges of the graph in a more efficient way, while achieving an higher quality image.

gpu-300x296

Current Gephi Visualization Module issues

It is useful to discuss the problems of the current architecture to better understand the new engine design. The current Gephi Visualization Module was designed to be very portable and only use features available in the first OpenGL version. It is therefore based on the fixed pipeline and it sends the geometry to the GPU using the immediate mode or display lists. The new engine will still support legacy hardware, but it will also includes a more modern pipeline. Some issue will be only solved in the new pipeline.

Each renderable object (nodes, edges and labels) implements a common interface which take care of how it is rendered and responds to changes in the graph. In the rendering loop, the engine iterates over each edge, node and label and calls the corresponding method to render it. This is not optimal on modern graphics cards for several reasons.
A modern GPU is composed by several cores which runs the same program (they may execute different parts of it though) in parallel. Each time a state change, the GPU may stall waiting for each core to terminate before updating the state. In the current design, several states change between consecutive renderable objects causing the GPU to be idle most of the time. Moreover, each renderable object is composed by a small number of polygons and it is unable to use all the available cores. The best way to render objects using a modern graphics card is to sort them by state change and render them in batches. This is the strategy which will be implemented in the new visualization engine.

The OpenGL API is optimized to render 3D polygonal meshes. Therefore, the only ways to draw vector objects were to approximate them using polygons or textures. They both have issues when the objects are bigger enough which can be solved using additional memory. The current engine draw circles and spheres using a polygonal approximation. Shaders gives to the programmers a lot of additional freedom and it is now possible to draw general objects rendering a single polygon. The shapes generated using shaders looks good regardless of the size of the objects. The engine will use this method where possible.

New Visualization Module design

node_new-100x100The new visualization engine will be composed by three parts: the viewer, the controller and the data manager. The viewer controls the drawable object and the main rendering loop. The controller updates both the camera and the selection area and handles all the window and mouse events. The data manager updates the graph information in the engine and decides what to draw each frame. Each part of the engine runs in a different thread. Therefore, it shouldn’t freeze when the graph is modified as it sometimes happen in current engine.

The rendering system of the new visualization engine will use concrete and immutable representations of the renderable objects. Each data manager frame, after the graph has been updated, each node, edge and label data is inserted in a render batch which is then inserted in the specific render queue. All the render queues (one for each renderable object type) are sent with the current camera to the Viewer for rendering. The Viewer then updates the current render queues and wait the following display method call.

The rendering logic for each renderable object will be defined in the renderers. Each renderer will be only able to render a specific type of object and it will render the entire render queue. There will be no way to directly render a single renderable object in the new visualization engine. The viewer will maintain a single renderer for each object type and it will then use them to render the current render queues. Two renderers will be implemented for each renderable object (one using the OpenGL 2.0 version and one using the OpenGL 1.2 version) and the viewer will decide what renderer to use based on the installed graphics card and user preferences. A detailed description of all the renderers will be soon published in the specification.

New features

The majority of the changes in the new visualization engine will be invisible to the user, but there will be also some new features. The most important ones are the following (some of them will also requires some works in the other Gephi modules):

  • Different node shapes for each node. It will be possible to define a different shape for each node and therefore use node shapes to distinguish between the different groups of nodes.
  • Images as node shape. It will be possible to load images to use instead of the predefined node shapes.
  • Wider choice of 2D node shape. Several node shapes will be supported in addition to circles and rectangles.
  • Starting and ending color for edges. It will be possible to define a different color for the starting and ending part of the edges and use gradients to define edge directions.
  • Clusters implemented as halos or depth-aware borders around nodes. Clusters will be supported rendering additional halos or borders of increasing depth of the cluster color around each node in the cluster.
  • Per primitive anti aliasing (PPAA). Shaders allows the implementation of an anti aliasing strategy independent to multisampling. The new engine will implement these strategies to achieve a better image quality. In the figure above it is possible to observe the PPAA technique applied to the transition between the inner part and the border of a node in an old prototype without multisampling. The effect is slightly exaggerated.

Current status and future work

The engine is currently implemented as a standalone application independent of the rest of Gephi. The engine has been implemented in this way to be able to focus on the most important parts of the engine from the start. Gephi wasn’t in fact independent enough from the current visualization engine to immediately substitutes it with the new one. When the project will terminate, the standalone application will not have all the desired features and it will not be ready to be included in Gephi. The basic rendering system logic and the general engine architecture will be probably implemented in time though.

When the GSoC project will terminate the interaction layer with Gephi will be implemented and some parts of the code will be adapted to work with it. The project will be rewritten as a Gephi module and the required interfaces with the rest of the application will be implemented.

New Tutorial: Visualization in Gephi

A new tutorial is available about Visualization in Gephi. It will guide you to the basic and advanced visualization settings in Gephi and introduce selection, interaction and tools.

Gephi has a powerful and customizable visualization engine but sometimes capabilities are not obvious and the richness of some features remains hidden. For instance text drawing is essential for visualization efficiency and needs to me controlled in the best way, in particular for large graphs.

This tutorial explains also in details some of the important tools, including:

  • Shortest Path
  • Heatmap
  • Edit

Book store: Theory & Practice

Gephi has now its own book store!

It’s a great place for those who want to discover the key theories beyond networks. It has also an “Information Visualization” and “Programming” section for those who want to master the subject and join the Gephi team. All these books give valuable information for understanding what is guiding the people who are developing Gephi and how concepts were put in practice.

The Network Science section refers to the science beyond networks. It describes where networks are in nature, society or organizations and helps understand their properties and patterns. Newcomers can starts with Linked by Albert-Laslo Barabasi, the major reference, from 2001. You can also directly jump to Bursts, Barabasi’s new book released few days ago.

Social network theory views a network as actors who are connected by a set of relationships and is referenced as Social Network Analysis (SNA). As people increasingly use social networking websites (e.g. Facebook, YouTube, LinkedIn etc.), Social Network Analysis brings tools to study patterns of communication and communities. Social Network Analysis by Wasserman & Faust is a major reference.

These books are for all audience, so researchers would find a clear state of the art of the domain with The Structure and Dynamics of Networks or Dynamical Processes of on Complex Networks.

 

Data Visualization and Human Computer interaction (HCI) are at the base of Gephi. Learn about how visualization and interaction enhance understanding and knowledge discovery of complex data. Information Visualization or Visual Analytics make reference to this domain as well.

One can easily find the roots of the Visual Analytics in the book Readings in Information Visualization: Using Vision to Think by Stuart Card, Jock Mackinlayis and Ben Shneiderman. Exploratory Data Analysis started with John Tukey, and has recently been extended by Andrienko.

The last stone is added with the knowledge of efficient programming, in particular how to design a modular software based on services with Practical API Design: Confessions of a Java Framework Architect. And as the human factor is central, take a look at The Mythical Man-Month: Essays on Software Engineering. Mathieu is a great fan 😉

 

Also we foster you to go beyond with more references at the Reader’s circle and of course send us book suggestion.

Nodes groups visualization

Loic Fricoteaux, student in Computer Science worked on a Gephi research project about visual representation of nodes groups. Next big move of Gephi will be supporting clustering and hierarchy navigation. Thus this project helps to keep network readibility within cluster exploration. We expect presence of this functionality for the first 0.7 release, later this year.

Implicit surface system on a sample graph
Implicit surface system on a sample graph

The purpose of this module is to enable users to highlight nodes groups in a graph, providing a better visibility for them. For instance, it can be useful to reveal/highlight patterns or just to highlight meaningful nodes aggregates in a network.

To visually represent nodes groups, each node of a group has a circular (by default) influence area which decreases as long as the distance from the associated node increases. A thresholding of the summation of all these influence areas defines a surface (not necessary connected) enclosing every nodes in a smoothness manner. The final rendering can be hugely parameterized to match user preferences (size of influence areas, color, antialiasing, …).

Computing surface

In order to allow a real-time rendering for many nodes groups in big dynamic graphs, an optimized algorithm has been specifically developed to fill the corresponding enclosing surfaces while keeping a handsome rendering.

Label Adjust

The Label Adjust functionality is a special type of algorithm. It is available through the Spatialization menu but instead of working with nodes position it works with labels. The aim is to automatically avoid label overlapping.

Gephi is built to produce readable maps, which can be published or printed. By default, if a network has more than 1000 nodes it becomes hard to read and even more if labels are displayed. With the Label Adjust algorithm, the boring work when you manually move each node of the network vanished.

When running, the algorithm slightly moves nodes where labels are overlapping. For instance with long labels like URLs this functionality is really time-saving, and it is easy to use. Display labels as you want (font, size, color, …) and start the algorithm. It automatically stops when its detect no more label overlapping, but you can also stop it by hand.

Here is a small demo video of the feature running. Needless to say the algorithm is designed for larger networks.

http://vimeo.com/moogaloop.swf?clip_id=2242916&server=vimeo.com&show_title=1&show_byline=1&show_portrait=0&color=&fullscreen=1

This functionality is important when exporting map results in Gephi. The standard process of publishing network maps in Gephi would be something like that:
1. Spatialize the network, using for instance Force Atlas algorithm.
2.Use filters to set nodes color and size depending of the network data.
3.Display labels and set text settings.
4.Use Label Adjust to makes all labels readable.
5.Export.

Performance and scalability

With this article and some following I’ll focus on the application design and explain technical points I think relevant to understand our approach.

Today’s subject is performance and scalability in the visualization. Although other modules need high-quality performances, the visualization of thousands nodes and edges remains the major challenge. For a visualization-centered software like Gephi it is a key feature we attach great important.

What you can find in other network visualization software is either a poor visualization module or a stunning aspect but not efficient. For instance Pajek has a very efficient core and you can achieve a lot with it but problems starts when you want to visualize your network. With GUESS you are able to produce nice maps but the render engine starts suffering seriously over 2000 nodes. Gephi tries to combine an efficient render engine with looking good results.

In 2007 when we started designing the current version of Gephi we had in mind we want to create a new generation of network visualization software and hence we made some choices I will try to explain here.

Use multi-core
Already in 2007 and even more now multi-core processors impose new rules in software development. It brings appealing features but also some risks. However technology starts to be mature in this, all current Top 10 video games has been thought multi-thread from the beginning. Multi-core brings performance but does it bring scalability as well? I would say YES for Gephi because no matter how many processor you have, what can be parallelized will be parallelized. Graphics card are not able to parallelize yet but we count this would be the case in the future.

Use GPU
You may notice we got some inspiration from video games development. When using the graphic card features, Gephi’s render engine let the processor free for other computing and allows using GPU acceleration to speed up rendering. Apart allowing 3D graphs, many drawings are speed up by the GPU in Gephi. I would say the only problem is compatibility, due to the high number of different graphic cards on the market.

Architecture
The visualization package architecture is a compromise between flexibility and performances. In 3D engine design it is quite impossible to have both in the same time. Hence our engine has flexibility where it doesn’t harm efficiency.

These choices allow good performance for visualizing, and I would say it is only the beginning. Currently, up to 50,000 nodes can be visualized and even more but this depends on edges number and how your graph is spatialized. Indeed we use techniques to avoid computing of parts of the graphs out of the screen:

Octree cubes partition Octree cubes on a 3D graph

The graph is cut in fixed volumes in a structure called Octree. It is easy for the render engine to determine which cubes are hidden and which are visible. Only 3D objects in visible cubes are computed. As a consequence performances don’t depend on how much nodes you have in your network but how many you are currently visualizing. So even with huge graphs, zooming in and exploring parts of it remains fast.

Besides the current 3D engine, which is intended to work on all configurations a new one will be developed in 2009. Using the last features of graphic card, networks size limit around 200,000 nodes may be reached.