Gephi 0.9 released: Play with network data again

splash

We’re proud to announce the release of the next major version of Gephi! This 0.9.0 version has been more than three years in the making but today brings an exciting new life to this project, and the graph/network analytics community at large. You can download it here for Windows, Mac OS X and Linux.

Gephi is the leading graph visualization software – known as the “Photoshop for networks” and is open-source and free. It has been downloaded more than a million times and is used by many scholars and data scientists around the world. This new release brings new features in the area of dynamic networks (i.e. network over time) and major compatibility and performance improvements.

Since the last release in 2013, users were facing compatibility issues with Java, which have been resolved with this release. Development had slow down three years ago but had never stopped. In fact, in March 2013 the time had come to think about what Gephi 1.0 would look like and realize it needed a new core. This was by far the most complex project the team had to overcome but developers had a long-term vision and know that future developments will now rely on a robust and extensible core, with world-class performances.

The world is increasingly complex and interconnected. Gephi’s purpose is to unfold this complex relational data in a way anyone can understand them. It allows you to visualize graph data as a map and create the visualizations to support your narratives. State-of-the-art algorithms make readable layouts, highlighting communities or influential nodes. Visual tools tweak colors and shapes to reveal hidden patterns in the data, helping solving complex problems. More and more network-maps are pictured in online, offline press and other communication media. They spread from science to business, art and activism. People are increasingly exposed to them and learn how to read them. Gephi aims to accelerate this commoditization process by providing free and easy-to-use tools.

What’s new in Gephi 0.9?

The list is too long! The complete changelog for this version can be found on GitHub’s release page.

Next steps

There are a few immediate next steps coming up right after this release. Following-up on the recent plugin development announcement we’ll get in touch with plugin developers and start migrating plugins to this version. There’s more than 80 plugins to update!

Then, we’ll identify and resolve new issues that appeared with this version. A future Gephi 0.9.1 release will come next year to address those.

A Gephi Toolkit release will also be made very soon so developers can update their application built on top of Gephi’s modules. In the meantime, we’re interested in users’ feedback and want to hear from you on Twitter or Facebook. Issues can directly be reported on GitHub as well, where the developers are.

Finally, thanks to all contributors and the community for supporting this project!

GSoC mid-term: Attributes Disk Store

empty

My name is Ernesto Aneiros and during this Google Summer of Code I am working on the Attributes Disk Store.

The problem

In Gephi, Attributes are the data that is associated with nodes and edges. As graphs grow larger and larger, attributes occupy more memory even though many times they are not essential to the end-user when he is only applying transformations or algorithms to the graphs. These attributes can be of different types, from simple Java primitives (byte, char, int, String, etc) to Gephi’s internal data types (lists of primitives or versioned data). The idea for the project was to have a combined memory/disk cache system to partially off-load these attributes to disk. The system should have a well-designed cache system to handle heavy read access on the most-accessed elements.

The Solution (1st iteration)

Lucene is one of the most popular text searching engines and a flagship Java open source project. Lucene is capable of handling and indexing millions of records while remaining performant, and when the idea for the Attributes API cache was born, Lucene was first considered for the role of the data store, and as added bonus Gephi will get full-text search capabilities with almost no extra effort. When analyzing the problem, the following criteria were developed to judge a possible data store:

  1. Reliable (resistant to corrupted disk data, failed transactions, unexpected errors)
  2. Fast
  3. Transparent (minimum complexity exposed to the end-user)

While Lucene complies with items 2 and 3, the approach when dealing with corrupted indices in Lucene is to rebuild from scratch therefore failing item 1. This doesn’t pose a problem to Lucene because in the context where it is supposed to be used (indexing of external information), input data is always available separately from the index and can be accessed if needed. In Gephi, however, this is not the case. Once Attributes are loaded from disk they remain in memory until saved back to file. If an error occurs during a disk store transaction the end-user can end up losing a day’s work, certainly not acceptable.

The Solution (2nd iteration)

After Lucene was ruled out as a contender for a data store, several options were considered, including using embedded SQL databases and using a combination of Ehcache plus BerkeleyDb. Both options bring a lot to the table and embedded databases in Java have achieved impressive results in performance when compared to other mainstream database systems (see projects H2 and HSQL for example). Ehcache + BerkeleyDb however win when complexity is considered since they introduce almost no translation layers between Gephi and the cache. Both solutions are good fits for the problem but in the end the balance tilted in favor of Ehcache + BDB because the complexity consideration.

Optimizing Ehcache and BerkeleyDB

Even though Ehcache provides a great deal of functionality and features, it was relatively easy getting up to speed with it. The documentation provided online was very complete with code samples available and detailed explanations. In almost no time an in-memory cache was up, running and being tested. Traditionally cache sizes have been specified as the amount of max elements that they can hold. In the 2.5 BETA of Ehcache a new feature was introduced that allowed sizing the caches by memory consumed instead of elements held. For our project this is a killer feature since we can now expose a single option to the user, letting him specify how much memory the cache should consume. Even though using the new feature proved a little more complicated than expected we obtained great feedback from the Ehcache community, specially from alexsnaps and Mike Allen, which helped us to solve the issues we were having.

BerkeleyDB on the other hand, is a very complex piece of software. With years of development under the belt, BDB has evolved to be a very robust and flexible database. In fact, it is so flexible that can be used as full blown database supporting queries, a simple key/value datastore or with a front-end that exposes a Java collections map that greatly simplifies its use. All of this flexibility does not come free though, configuring and optimizing BerkeleyDB requires delving into details about transactions, buffers, log file sizes and BDB internals. However the tools are there and the information provided is quite good, especially the FAQ and the optimization section.

Integration with Gephi

Since ease of use and transparency are important considerations for the end-user of Gephi, only the minimal configuration options are exposed in the preference panel of the disk cache, but an Advanced tab provides more control for those who want it.

general_settings_tab
The General settings tab, where cache can be enabled or disabled and the memory usage configured.


The advanced settings tab allows a more advanced user to configure several of BerkeleyDB’s options.

The Disk Store in Action


Memory consumption without the disk store. It reaches 400MB.


Memory consumption with the store, after load it drops below 400 MB. Note how load time increased due to disk operations, a trade-off to consider when using the store.

Known issues

The project is still in development. Being memory saving the main goal of the disk store project, results are not good enough yet because of several reasons.

While BerkeleyDB provides a very convenient way of storing bytes in disk, it is still a database oriented software and therefore it is not the most suitable solution for out project because of large memory usage to caching data, building and maintaining its index (features desirable for databases but not for this project).

Trying to reduce BerkeleyDB memory usage with its settings will produce quite different results in different systems or even in the same system. The benchmark above shows not bad results but it is not always the case. A better control of maximum heap growth can be observed but still with memory usage peaks that prevent better saving.

The conclusion is that it is a priority to replace BerkeleyDB with other disk persistence system or create one specifically designed for Gephi disk store.

It is also known that graphs with more complex data like strings or lists will always benefit more from a disk storage system than graphs with simple data like integers or booleans. An idea is to always store simple data in memory because indexing in in the disk is going to need as much memory anyway, or even more.

On the other hand, Gephi works and was designed with in-memory data structures in mind. Adding a cache/disk store to the system is bound to create integration issues with other parts of the codebase. For example the GEXF file importer tends to load large portions of the graph file to memory while parsing it, which is not so good in memory constrained environments and using the cache here will not make a difference. One of these issues is regarding the handling of data in files with .gephi format. Due to the way that .gephi files are imported, some integration problems still need to be debugged in the disk store to work properly.

Looking to the future

This GSOC project is only scratching the surface of what a memory + disk cache system can achieve. In the future BerkeleyDB could be replaced with other persistence provider, and it doesn’t necessarily has to persist locally to the disk. For example replacing BerkeleyDB with a datastore like Cassandra, or maybe some RDBMS.

Conclusion

While the Data Store API introduced by this project is still taking its first steps and can be significantly evolved, it has helped ironing out many issues and has paved the way for bigger and better improvements. Working during this summer has been a great experience and I have been able to share with great mentors like Eduardo Ramos, who knows the Gephi codebase in and out. I hope the work of all of Gephi’s GSOC’ers becomes the starting point for many new features and enhancements that the community will surely appreciate. Happy coding and see you next summer!

Performance and scalability

With this article and some following I’ll focus on the application design and explain technical points I think relevant to understand our approach.

Today’s subject is performance and scalability in the visualization. Although other modules need high-quality performances, the visualization of thousands nodes and edges remains the major challenge. For a visualization-centered software like Gephi it is a key feature we attach great important.

What you can find in other network visualization software is either a poor visualization module or a stunning aspect but not efficient. For instance Pajek has a very efficient core and you can achieve a lot with it but problems starts when you want to visualize your network. With GUESS you are able to produce nice maps but the render engine starts suffering seriously over 2000 nodes. Gephi tries to combine an efficient render engine with looking good results.

In 2007 when we started designing the current version of Gephi we had in mind we want to create a new generation of network visualization software and hence we made some choices I will try to explain here.

Use multi-core
Already in 2007 and even more now multi-core processors impose new rules in software development. It brings appealing features but also some risks. However technology starts to be mature in this, all current Top 10 video games has been thought multi-thread from the beginning. Multi-core brings performance but does it bring scalability as well? I would say YES for Gephi because no matter how many processor you have, what can be parallelized will be parallelized. Graphics card are not able to parallelize yet but we count this would be the case in the future.

Use GPU
You may notice we got some inspiration from video games development. When using the graphic card features, Gephi’s render engine let the processor free for other computing and allows using GPU acceleration to speed up rendering. Apart allowing 3D graphs, many drawings are speed up by the GPU in Gephi. I would say the only problem is compatibility, due to the high number of different graphic cards on the market.

Architecture
The visualization package architecture is a compromise between flexibility and performances. In 3D engine design it is quite impossible to have both in the same time. Hence our engine has flexibility where it doesn’t harm efficiency.

These choices allow good performance for visualizing, and I would say it is only the beginning. Currently, up to 50,000 nodes can be visualized and even more but this depends on edges number and how your graph is spatialized. Indeed we use techniques to avoid computing of parts of the graphs out of the screen:

Octree cubes partition Octree cubes on a 3D graph

The graph is cut in fixed volumes in a structure called Octree. It is easy for the render engine to determine which cubes are hidden and which are visible. Only 3D objects in visible cubes are computed. As a consequence performances don’t depend on how much nodes you have in your network but how many you are currently visualizing. So even with huge graphs, zooming in and exploring parts of it remains fast.

Besides the current 3D engine, which is intended to work on all configurations a new one will be developed in 2009. Using the last features of graphic card, networks size limit around 200,000 nodes may be reached.