Jan 312013
 

The movements for open data and open access converge in the best form with the creation of web based repositories where researchers can freely upload their datasets: the datasets can be viewed, cited, and downloaded by anyone. Great!

Two major players are emerging: figshare and datahub.io (aka CKAN)

Contrary to many other repositories, they are not specific to a scientific field – anyone is welcome to upload. They are international, serving researchers based in any country. And they made a big effort in usability – for instance, it is not necessary to fill in long forms or get any form of pre-approval: you just create an account and upload your data. Again, great!

Ckan figshare
 
Yet, these 2 major players, figshare and datahub.io (aka CKAN), still lack some important features which slow down their adoption. In my case, this is to the point that I chose not to use figshare, and can use datahub.io only in a limited way. Let’s summarise the situation:

Strengths of figshare:

- attributes a doi url to your datasets. Your datasets can get referenced with a much respected url, as in “http://dx.doi.org/10.6084/m9.figshare.154972“. Brilliant!

- great website: the user interface is engaging.

- strong community engagement: figshare has just announced partnerships with PlosOne to host all the data of PlosOne publications. Also, they launched an “advisor scheme” which is basically recruiting evangelists / community supporters / you name them to spread the word about figshare. Great stuff, it is much needed to grow a community and a knowledge base around figshare.

Weaknesses of figshare:

- individual files inside the datasets don’t get unique identifiers. Let’s say you create a dataset on figshare, and upload two files in it. These two files won’t get unique, separate urls ( you don’t get http://doi.org/file1 and http://doi.org/file2). This is a major impairment, for three reasons.

  • It means you can’t refer to files individually, say, if you would like to cite them in publications. For instance, if you dataset is a collection of pictures, you won’t be able to refer to each of these pictures individually – they simply don’t exist as independent resources on figshare.
  • Second, you can’t add separate metadata to each file individually. That’s just wrong – each file may have an author, a date of creation, etc., etc., that needs to be referenced.
  • Third, in the absence of a url for each file, the manipulation of these files through programmatic means ( = through an API) becomes much more problematic. You have to handle each file by performing operations on the dataset, which is the only entity that has a stable reference. I don’t have a specific use case, but it seems like not a good model (might slow things down or make them more complex than they need to, also, what happens if you’d like to ascribe a file to 2 different datasets?).

[note: and is it a corner case I describe here? Certainly not. Pictures in a collection is just one example. The point is, you can’t limit from the start, in the design of your application, what is a sensible use case for datasets or not. Scientists come with data of all stripes. The need for the granularity, cite-ability and permanence of resources should be assumed, not ignored.

Strengths of datahub.io:

- great website: I personally don’t like it as much as figshare, but by a tiny margin. Very intuitive to use (this is not a trivial question – this is where everything starts).

- attributes unique urls to each of your files! Yeah! (see the weaknesses of figshare to see why it is crucial)

Weaknesses of datahub.io:

- datahub.io is actually just the web interface to a core platform called CKAN. This platform is complex and its target users are people with programming skills. It can be downloaded, installed and run on private or institutional servers to service a repository. Great! Except that, in my honest opinion, having these two services and two audiences (the website datahub.io for users, and the CKAN platform for admins) creates confusion in key places. First, the online documentation provided is mainly for programmers using CKAN, and it is hard to find a specific doc that addresses only the needs of the users of datahub.io.

- Second, and here again these are just personal impressions, I got the feeling that users of the website version of CKAN were not that strong a community, compared to the institutional users of the CKAN platform. This impression is formed from having posted a relatively simple question on the CKAN mailing list, followed by the same question on Stackoverflow, which received so far no answer. To be clear, people at CKAN were immensely helpful and did try to help, it is just that the knowledge base seems stronger around CKAN than datahub.io

Which one did I choose?

Because I think that you should start with the right data model, as down the road it is the hardest to fix an incorrect one, I have chosen datahub.io to host my datasets. I just hope that they will grow a stronger community around it (maybe by differentiating it better with CKAN, which is a different product after all). Or that figshare will reconsider its data model?

Want to help?

So, as I said I have this unanswered, supra easy question about using the API for datahub.io. Earn 100 reputation points easily on Stackoverflow (and be assured of my eternal gratitude), go answer it!

 

I am Clement Levallois, a social scientist and data visualization specialist currently based at Erasmus University Rotterdam, The Netherlands.

Did you like this post? Visit the website of the consultancy connected to this blog (www.exploreyourdata.com), and share it on Twitter!

Jan 292013
 

The background

- I am a user of MongoDB with Morphia, enjoying the simplicity of its query language as compared to SQL.
- Doing network analysis, I am thinking of moving from MongoDB to one of the graphDBs out there.
- Coding in Java, Neo4J is an obvious candidate for such a db.
- Then I look at Cypher, a query language used with Neo4J, and I find it looks like a bloated SQL syntax.

My argument against Cypher

If we move away from relational db and SQL, let’s take this opportunity to invent a query language which is intuitive to use!

I mean, MongoDB has completely dropped the ”Select From Where” logic, and that’s such a relief! Why would Cypher go back to this?

So, I ranted about that on Twitter:

and:

I received many polite and interesting feedback, basically asking that I would elaborate and suggest something instead of just criticizing. As a first precision, by “UX and UI constraints” I mean that a query language should be thought of just like a visual interface: the flow of information, the pleasure to use, the intuitiveness of manipulation should be as paramount as when designing a web page. You want people to adopt your product, and for that you make it easy for them to use it. You think I am exagerating? Just look at how MongoDB’s supra easy javascript client is rapidly making it the natural database for web apps. That’s not due to MongoDB’s technical backend, but to the fact that it speaks the language of its users – javascript. So, shouldn’t Neo4J speak the language of its users too?

A proposal

On its blog Neo4J presents this example of a db:

from this blog post on Neo4j Blog

1st query

Look for “a city where someone from Neo Technology lives that speaks English and has Three as his operator in the city that he lives in.”

Cypher version:

start

neo=node:node_auto_index(name=”Neo Technology”),
english=node:node_auto_index(name=”English”),
three=node:node_auto_index(name=”3″)

match

person-[:LIVES_IN]->city-[:LOCATED_IN]->country,
person-[:HAS_AS_HOME_OPERATOR]->three
-[:OPERATES_IN]->country,
person-[:SPEAKS]->english,
person-[:WORKS_FOR]->neo

return

city.name, person.name

What about this alternative?

(note: this is directly inspired from the Morphia query syntax)

query = db.find(person.class);
query.field(“speaks”).contains(“English”);
query.field(“worksFor”).equal(“Neo Technology”);
query.field(“cityLivesIn.operatorBrands”).contains(“Three”);
List<Person> listPersons = query.asList();
City cityQueried = listPersons.get(0).livesIn();

2nd query

Look for “two people in the same countries but on different home operators that call, mail or text each other”

Cypher version:

start

country=node:node_auto_index(name=”Country”)



match

samecountry-[:IS_A]->country,
person-[:LIVES_IN]-()-[:LOCATED_IN]-samecountry,
otherperson-[:LIVES_IN]-()
-[:LOCATED_IN]-samecountry,
person-[:HAS_AS_HOME_OPERATOR]->operator,
otherperson-[:HAS_AS_HOME_OPERATOR]->otheroperator
where
otherperson-[:CALLS|TEXTS|EMAILS]-person
AND
operator<>otheroperator


return
distinct person.name, samecountry.name;

what about this alternative?

(note: this is directly inspired from the Morphia query syntax)

query = db.find(person.class);
query.or(
field(“personsTexted.countryLivesIn”).equal(countryLivesIn).field(“personsTexted.operatorBrands”).notEqual(operatorBrands);
field(“personsCalled.countryLivesIn”).equal(countryLivesIn).field(“personsCalled.operatorBrands”).notEqual(operatorBrands);
field(“personsEmailed.countryLivesIn”).equal(countryLivesIn).field(“personsEmailed.operatorBrands”).notEqual(operatorBrands); )
Set<Person> setPersons = new HashSet();
setPersons.addAll(query.asList());

These 2 examples are not meant as perfect solutions, in particular the second example makes some repetitions that should be avoided. I guess the basic argument is, let’s not reinvent a new language, instead let’s build on the languages already there to build intuitive queries.



Note: the Java classes used in the 2 examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
 class Person {
private Set<Language> languagesSpeaks;
private City cityLivesIn;
private Company companyWorksIn;
 
public Set<Language> speaks() {
return languagesSpeaks;
}
 
public City livesIn() {
return cityLivesIn;
}
 
public Company worksIn() {
return companyWorksIn;
}
}
 
class City {
 
private Set<Operator> operatorBrands;
 
public Set<OperatorBrands> operatorBrands() {
return operatorBrands;
}
}
 class Person {
private Set<Language> languagesSpeaks;
private City cityLivesIn;
private Company companyWorksIn;

public Set<Language> speaks() {
return languagesSpeaks;
}

public City livesIn() {
return cityLivesIn;
}

public Company worksIn() {
return companyWorksIn;
}
}

class City {

private Set<Operator> operatorBrands;

public Set<OperatorBrands> operatorBrands() {
return operatorBrands;
}
}

 

I realize that this proposal for a Java / Javascript style syntax for queries is probably opening lots of technical issues. Sure but these issues should come second, not as primary constraint on the design of the language.
Enough said, what’s your view on that? Let’s continue the conversation in the comments or on twitter (@seinecle)

Jan 182013
 

I receive more and more frequently emails asking for some general advice on network analysis (sometimes called “linked data analysis”) and visualization.

Readers of this blog might be interested in my recommendations.

[disclaimer to network analysts and dataviz -ers reading: hey, there is so much more out there, agreed! That's just a personal and partial list...]

start

Tools for linked data analysis:

Excel for very basic statistical counts,
Pajek (http://pajek.imfm.si/doku.php) for manipulation of the network, if it is huge (like, if you would like to filter out some nodes from a network with 100,000 nodes)
R (http://www.r-project.org/) for more advanced statistics,
still R for modeling, with this package: http://statnet.csde.washington.edu/index.shtml
UCINET (https://sites.google.com/site/ucinetsoftware/home) for network-oriented statistics and modeling, though the network should remain small I think (20,000 nodes or less?)

Tools for viz of linked data:

NodeXL (http://nodexl.codeplex.com/) if you are more comfortable with Excel based software (but less powerful IMHO – disclaimer: I am a member of the Gephi Community support team)
For other forms of viz, it all depends on your intentions. If your final product should be seen from an Internet browser, then look at Javascript libraries like D3 (http://d3js.org), Sigmajs (http://sigmajs.org), ProcessingJs (http://processingjs.org) and so many others.
The best flexibility would be afforded by having a data visualizer with programming skills on board: that would give you access to the wonders of Processing (www.processing.org) and other forms of data visualization. Processing is quite popular so this is not a stretch to think you could find a resource person for it.
Finally, but that’s actually a first step: how do you get from your data, probably in the form of a csv file or something containing thousands of lines of data, to a network?
Surprisingly, there is not much (or none, even) around to do it. For this reason I created Eonydis, a small program to transform transactional data (such as a financial transaction between A and B, happening on day x) into a dynamic network. A dynamic network is simply a network which contains information about time. Download it  here:

People who can help

=> If you are primarily interested in network analysis, not the viz:
subscribe to the mailing list specializing in social networks (anyone can) and post a question asking for help, it will surely return proposals, especially if you have a budget!
Try also this LinkedIn group on social network analysis:
=> If you are primarily interested in the viz:

- you can get help on specific issues on the forum of Gephi, which is generally quite active (I’m on it! ;-) ):http://forum.gephi.org

- you can contract professional dataviz specialists: a very focused place to post your request is this Google group:
- finally, I have a consultancy which can help you define the specs of your project:
* which technologies to use to maximize impact and lower barriers (cost, maintenance, compatibility on devices, …),
* which dataviz agency / free lancer would fit best your project,
* suggestions of possible extensions – maybe that your datasets are even richer than you thought?

Books that could be on your shelves:

These books are not free – but let’s imagine you borrow them from your local library?
The reference book for network analysis is still the one by Wasserman and Faust:
But it’s a kind of technical reference book. To get you started you might prefer this textbook which takes NodeXL as a primary tool, and focuses on social networks, but that will still give you the essentials for linked data in general:
For visualization, Andy Kirk is a trusted person in the dataviz community and he has just a book out ( on dataviz in general):
Another book, also by a reference in the community (not focused on linked data):

http://books.google.nl/books?id=CB9XRIv9oigC&dq=Visualize-This-FlowingData-Visualization-Statistics

Training

In the new trend of Massive Open Online Courses (MOOCs) you have two relevant (free) courses by the best scholars in their field:
Lada Adamic on Coursera: https://www.coursera.org/course/sna
and
Katy Borner from University of Indiana: http://ivmooc.cns.iu.edu/

Other lists of tools:

I found these two lists on data analysis and visualization particularly useful:

Next

Was this post helpful? Follow me on Twitter for more frequent news on these topics and more! => @seinecle.

Clement
Jun 052012
 

Network visualizations are more engaging when they show in real time the graph expanding in space, progressively revealing the structure in the data.

The problem

For networks with more than a couple of thousand nodes and edges, the rendering of this real-time dynamics can be very slow, making the whole thing a boring thing to watch, not an insightful experience.

Developers are working hard at solving this problem, taking the direction of GPGPU computing to speed up things. GPGPU is an acronym meaning that the graphic card of the computer is accessed to perform the computations for the layout, and this indeed accelerates things immensely. That’s the solution chosen by Paul-Antoine Bittner who developed Parallel Force Atlas for Gephi (test video here), by Andrei Kashcha who developed VivaGraphJS, a javascript library doing blazingly fast visualizations of networks in the web browsers (check here), and by Gephi which has a project that made progresses in this direction, too.

Yet, these solutions are still too slow at the moment for graphs above 10,000 nodes and edges – the visualization gets saccadic, we loose the smooth movements that should be part of the experience. Try this:

 

Solutions exist for a quicker layout, like OpenOrd, but I don’t find them attractive because they have a non intuitive mechanism – how are you going to explain this whole “liquide phase” stuff the use? Or we can wait for improvements in GPGPU algos. Some amazing ones seem to be already available (see here), but I did not see them used yet in the open source community. But here is another road.

A proposal to speed up the layout

Basic idea:

- Force directed algorithms produce a layout where nodes sitting in the same region of the space tend to belong to the same community. Why not use this information to move nodes belonging to the same community to the same region in space directly?

 Detailed steps:

1. Run a community detection algo on the graph
this should be an algo which detects overlapping communities (a node can belong to several clusters).
an example of such algo is the clique percolation method.
the higher the number of communities detected, the better

2. Consider each community as one big super-node
set the radius of this super-node relative to the number of nodes in the community.
nodes belonging to several communities are considered as super-nodes by themselves

3. Run a force-directed algo to spatialize these super-nodes.

4. Run a force-directed algo to spatialize nodes within super-nodes.

Steps 3 and 4 run in parallel.

 

Advantages:

- Computations are reduced immensely: the position of each node does not need to be compared with the position of every other node, which is the real speed killer in traditional  force directed layouts. Here, the position of each node is only compared with the nodes from the same community.
- The layout still feels very natural, because at the macrolevel (communities) and microlevel (inside communities), the logic of force-directed algo still applies.

Inconvenient:

- The spatialization becomes dependent from the clustering algo. Indeed, but is it a problem?

I would be curious to see an implementation of this approach!

 

Mar 232012
 

There is an important problem in network visual analysis, and here is a tool to solve it.

The problem

The problem is, the visual exploration of networks is most insightful when the network shows some interesting structure, and often there is not much structure to be seen. By “structure”, I mean that the network shows different regions, each with different densities, maybe also key players not only in the center but also elsewhere, and basically anything which shows diversity or interesting irregularities in the network.

But often, representing linked data as a network does not exhibit much structure. Just as an example, here is the network of twitter users interested in “data visualization”, made by Mortiz Stefaner:

 

 

As you see, this viz makes it hard to understand how this community is structured. We do find key players (big nodes), there are vague sub-regions in the network that can be distinghushed, but that’s very unconclusive.

(my point is not to criticize this particular viz. This problem occurs everywhere. If you are a biologist working on protein networks, or a consultant drawing the social network of an organization, or working with semantic networks, you will surely be familiar with this problem!)

 

 

 

 The solution

Let’s reconsider the linked data we have: people being connected to people, in this case people following or mentioning others on Twitter (but imagine any other scenario, like proteins being linked to genes, etc.). Instead of representing these connections, let’s represent only the connections between people who share many connections. Simply: 2 twitter users will be connected not if one follow the other, but if they both follow in common a high proportion of other twitter users.*

As an application, I used data provided by Jeff Clark:

person A, person B, 5000
person C, person B, 120
person B, person D,  234

(meaning, person A mentions person B with a frequency of 5000 (arbitrary scale), etc.)

I wrote a program called “Gaze” which takes these data and identifies which pairs of persons mention most frequently the same other persons. The resulting network looks like this:

 

(click here for a beautifully interactive version of this viz)

Sub regions of the network now clearly appear, and distinct communities can be spotted. There would be much more to say about the parameters which can be modified to achieve this, but I’ll mention just one. 2 persons are linked if they frequently mention the same persons in their tweets. But how “frequently” exactly? Well, that’s simply a parameter you can change, from “almost never the same persons” to “almost always the same persons”. This gives very interesting insights, since you can observe the consequences of your hypothesis on the structure of the network (with Gephi and its “filter” function, these changes in parameters can be observed instantaneously on the viz).

 

 

 

 

The tool

- The software “Gaze” can be found on the software page of Clement Levallois (yours truly), here.

- A Youtube tutorial is available here (turn on the volume, and make it full screen and HD): here.

- The source code for Gaze can be checked on Github.

 

If you liked this post, you can follow me on Twitter, check my academic profile or suggest cool collaboration projects.

Clement Levallois

[EDIT March 25: the map has been updated, after a bug fix in the software. Previous version was incorrect]

*technically, this is simply a similarity measure, very common in the field of information retrieval. I use the cosine similarity. The basic idea of using a similarity measure was suggested by my work in scientometrics, where the viz of Rafols and al. rely heavily on it.

 

 

Feb 102012
 

Visualizations allow the naked eye to detect patterns and get insights about a network which could be missed by a purely statistical or numerical approach.

The most popular software package use 2-D layouts to map the networks. In the following short video (3 minutes), I show how the extra degree of freedom offered by 3D layouts can make a true difference (turn up the volume for audio commentary).

 

 

 

 

Oct 182011
 

The question is simple. A typical email in my inbox looks like:

- From: George Clooney <g.clooney@como.it>(GC)

- To: Clement Levallois (CL) <info@exploreyourdata.com>

- Cc: Matt Damon (MD) <md@hollywood.com>, Angelina Jolie (AJ)<angelina@brangelina.com>

- Subject: Party

- Main text: “Hi Clement! What a great week-end. See you again soon in Leiden. George.”

This can be represented visually as a network:

Email sender and recipients as a network

Network extracted from one email.

Say, you have 1000 emails in your inbox. If you could batch process them just like the example above, you would get a complex network of all persons involved in the email correspondence you receive. Pretty informative! How to do that?

The short version: here is the workflow I found most convenient to analyze emails from my Yahoo account:

1. Export emails from email client

- Thunderbird is the email client installed on my computer, where I manage the emails from my Yahoo account. Using a Thunderbird extension available to download (ImportExportTools), I could export emails of my Thunderbird Inbox to a folder on my c: drive, choosing in the “eml” format.
- note 1: at this step, I could very conveniently apply filters (“export just emails which contain an attachment”, or “just emails from last month”, and such).
- note 2: Outlook, Outlook Express, Lotus Notes, Entourage, Eudora… all  have +/- simple procedures to export their emails into the eml format. Not tested, free solutions I found: a general tool here, check out here for specially for Outlook and .pst files, and for Lotus Notes here).

2. Download, install, open Gephi.

- Gephi is free, open source and available here: www.gephi.org
- In Gephi, select the menu “File”, then “Import Spigot”
- Select the folder where you exported the eml files in the previous step.

- Done! Here is the result:

Social network of my personal email communications (visualized with Gephi). The green node in the middle is my personal email address. Labels are available, but hidden here!

 

The long version: different alternatives available 

  • In the last few days I received many suggestions through SOCNET,  a mailing list on social network analysis. Suggestions included  (in no particular order): InflowAtlas.tiTouchGraphOrgNet and Trampoline Systems. I did *not*  test these software because they are commercial, often very expensive solutions, and I was after a free, open source solution.
  • A special note on ORA, developed by CASOS at Carnegie Mellon. It has a commercial license but is free for research purposes. This software has the capacity to treat emails to extract networks: http://www.springerlink.com/content/d3362152400w1570/. So I spent a lot of time trying to use it, and in the process I received generous support by Terrill Frantz from the CASOS team – I thank him for that! But I could not yet get things to work properly, so I won’t report on it yet.

In the end, I focused on two solutions:

1. Gephi

I am a member of the Gephi Community Support team, but to my great shame I had to be reminded by Gephi’s core architect  Mathieu Bastian that Gephi had an operational email import function!

Gephi is a free, open source software to explore networks visually: www.gephi.org. It works on PC, Mac and Linux.

+++ positive aspects +++
=> You can download your emails directly  from yahoo, gmail, or any email account for which you have pop3 or IMAP configurations (you know, the parameters you need to fill in to get your emails on Outlook, Lotus, Eudora,  Thunderbird, etc.)

=> You can import emails directly from Outlook, Lotus, Thunderbird…. as long as you export them in eml format.

=> Different versions of the same name can be merged. This is convenient since they are the same person! (Clement Levallois <clevallois@rrr.nl> and Clement Levallois <clementlevallois@fffffff.fr> can be merged into one single “Clement Levallois” node).

— negative aspects —
=> The import function has filters which can be applied to the emails being imported. Like, “import only emails from last month”. And many other, interesting filters. The problem is, filters do not work! I filled a bug report and people interested in this feature can follow progresses on debbuging here : https://bugs.launchpad.net/gephi/+bug/873588

2. NodeXL

NodeXL is a free, open source application which works inside Excel (2007 or 2010), to import, analyze and visualize networks: http://nodexl.codeplex.com/

I received excellent support by members of the developers team while exploring the import of emails – thanks to them! I think it works only on Windows, but somebody found a hack to have it work on Mac (I did not test it). NodeXL has an import function to create networks from all kind of data, which I encourage you to try (including Youtube, Flickr, and Twitter, and more). And NodeXL has import functions for emails too.

+++ Positive aspects +++

=> You can install a plugin to NodeXL which will allow you to import emails directly from an Exchange server (you know, the email solution that many companies use). I could not test it because the plugin works for Exchange 2007  and 2010 only, while my organization runs Exchange 2003. The plugin can be downloaded freely at: http://exchangespigot.codeplex.com/. Follow the installations instructions then open NodeXL and do Import => “From Exchange User’s Email Network”. To my knowledge NodeXL is the only application with this nice feature.

=> You can import your emails simply by letting NodeXL search them for you on your computer, if you use an Windows email client (Outlook, Outlook Express or Windows Mail). If you use Thunderbird or another non-Windows application, you’ll have to export your emails in the eml format into a folder first, which Windows Search will then index and retrieve. However this can be tricky since you must make sure that 1. Windows Search is installed (by default on Windows 7, but you have do download it on Vista and XP) and 2. you have to make sure that Windows Search is actually indexing the folder where you exported your emails (not always guaranteed!).

When you are sure that Windows Search indexed the folder with your eml emails, open NodeXL, do Import => “From Email network” and NodeXL will find these emails automatically.

=> You can also filter the import of these emails according to many criteria (more recent than… subject must include… etc).

— negative aspects —

=> You cannot import emails directly from the server where the emails are (except if you use Exchange). No direct access to Yahoo, Gmails, and such: this is quite inconvenient.

=> The quality of the visual representation of the network is not that great. I’ve found that many people use NodeXL for its neat import functions (especially Twitter imports), but then export the resulting network to a graphml file, which they open with Gephi.

Conclusion

Despite the diversity of tools advertising an “import emails to social networks” function, just two open source & free solutions turned out to work – after much work to surmount difficulties: Gephi and NodeXL. So I expect that this tutorial will serve many social network analysts who were looking for solutions.  And I am of course open to corrections and suggestions!

Did you like this post? Visit the website of the consultancy connected to this blog (www.exploreyourdata.com), and share it with Twitter!

Aug 062011
 

Yesterday I sent a tweet on this topic, and I just want to elaborate a bit here.

I am currently working on a set of longitudinal monthly data, covering a couple of years. The data was provided to me “as is”, with the simple mission to draw a visualization of it. After 2 days of scripting I finally get the visualization to work: a network of relational data, fully dynamic (topology and attributes). I used Gephi, as always my favorite.

Not knowing what to expect, I looked for the first time at the data by simply scrolling the timeline to see the network animate through months and years. At some point, I see an enormous burst of activity: nodes multiply, edges flow in all directions. Did my script go wrong somewhere? Is the phenomenon described by the data that much volatile? I make this observation to the client, who is also a bit puzzled. Then suddenly: “oh indeed, we had this major change in coding procedure at this point of time. We should correct for that and give you the updated dataset. Nice that you spotted it!”

Cool. But the point is the following: with data visualization, I was able to spot this coding error as easily as if it were an elephant in a corridor (as the French say!). Would it have been so easy and so quick to spot in the case of an econometric analysis, which is 99% of the analytical treatments that datasets receive today?

I’m sure the error would have been left unnoticed for a long time, with a lot of work and misguided advice being derived from the situation. Scary. How sensitive an econometric analysis will be to detecting coding errors in a dataset will depend on the type of coding error of course.

If the error is in the number of entities in the data – say, 150 items in the database become 300 at some point in time because of a coding error duplicating them, then a careful econometric analysis should be able to spot it (because in regressions,  some strange thing will happen to coefficients at this point of time). Not even sure though, since a statistical analysis relies on statistical reports, and never gives the analyst to directly peer into the original data. So that spotting coding errors requires an attentive analyst knowledgeable about the dataset s/he is studying, and we know it is far from always being the case.

But if the error is in the relations between entities (say, double the number of transactions is recorded between all entities at one point in time), then it might get even more tricky to spot it in a statistical report. Again, that is child play with dataviz, since the sudden change in the structure of the data caused by this error is simply obvious – you literally see it.

So, it certainly sounds deadpan serious to credit a beautiful thing like dataviz for its good performance at detecting coding errors and improving the cleanliness of datasets. But analysts will know the world of difference it makes.

< Did you like this InSight? To discover our team, follow the link: ExploreYourData.com>