The question is simple. A typical email in my inbox looks like:
- From: George Clooney <email@example.com>(GC)
- To: Clement Levallois (CL) <firstname.lastname@example.org>
- Cc: Matt Damon (MD) <email@example.com>, Angelina Jolie (AJ)<firstname.lastname@example.org>
- Subject: Party
- Main text: “Hi Clement! What a great week-end. See you again soon in Leiden. George.”
This can be represented visually as a network:
Say, you have 1000 emails in your inbox. If you could batch process them just like the example above, you would get a complex network of all persons involved in the email correspondence you receive. Pretty informative! How to do that?
The short version: here is the workflow I found most convenient to analyze emails from my Yahoo account:
1. Export emails from email client
- Thunderbird is the email client installed on my computer, where I manage the emails from my Yahoo account. Using a Thunderbird extension available to download (ImportExportTools), I could export emails of my Thunderbird Inbox to a folder on my c: drive, choosing in the “eml” format.
- note 1: at this step, I could very conveniently apply filters (“export just emails which contain an attachment”, or “just emails from last month”, and such).
- note 2: Outlook, Outlook Express, Lotus Notes, Entourage, Eudora… all have +/- simple procedures to export their emails into the eml format. Not tested, free solutions I found: a general tool here, check out here for specially for Outlook and .pst files, and for Lotus Notes here).
2. Download, install, open Gephi.
- Gephi is free, open source and available here: www.gephi.org
- In Gephi, select the menu “File”, then “Import Spigot”
- Select the folder where you exported the eml files in the previous step.
- Done! Here is the result:
The long version: different alternatives available
- In the last few days I received many suggestions through SOCNET, a mailing list on social network analysis. Suggestions included (in no particular order): Inflow, Atlas.ti, TouchGraph, OrgNet and Trampoline Systems. I did *not* test these software because they are commercial, often very expensive solutions, and I was after a free, open source solution.
- A special note on ORA, developed by CASOS at Carnegie Mellon. It has a commercial license but is free for research purposes. This software has the capacity to treat emails to extract networks: http://www.springerlink.com/content/d3362152400w1570/. So I spent a lot of time trying to use it, and in the process I received generous support by Terrill Frantz from the CASOS team – I thank him for that! But I could not yet get things to work properly, so I won’t report on it yet.
In the end, I focused on two solutions:
Gephi is a free, open source software to explore networks visually: www.gephi.org. It works on PC, Mac and Linux.
+++ positive aspects +++
=> You can download your emails directly from yahoo, gmail, or any email account for which you have pop3 or IMAP configurations (you know, the parameters you need to fill in to get your emails on Outlook, Lotus, Eudora, Thunderbird, etc.)
=> You can import emails directly from Outlook, Lotus, Thunderbird…. as long as you export them in eml format.
=> Different versions of the same name can be merged. This is convenient since they are the same person! (Clement Levallois <email@example.com> and Clement Levallois <firstname.lastname@example.org> can be merged into one single “Clement Levallois” node).
— negative aspects —
=> The import function has filters which can be applied to the emails being imported. Like, “import only emails from last month”. And many other, interesting filters. The problem is, filters do not work! I filled a bug report and people interested in this feature can follow progresses on debbuging here : https://bugs.launchpad.net/gephi/+bug/873588
NodeXL is a free, open source application which works inside Excel (2007 or 2010), to import, analyze and visualize networks: http://nodexl.codeplex.com/
I received excellent support by members of the developers team while exploring the import of emails – thanks to them! I think it works only on Windows, but somebody found a hack to have it work on Mac (I did not test it). NodeXL has an import function to create networks from all kind of data, which I encourage you to try (including Youtube, Flickr, and Twitter, and more). And NodeXL has import functions for emails too.
+++ Positive aspects +++
=> You can install a plugin to NodeXL which will allow you to import emails directly from an Exchange server (you know, the email solution that many companies use). I could not test it because the plugin works for Exchange 2007 and 2010 only, while my organization runs Exchange 2003. The plugin can be downloaded freely at: http://exchangespigot.codeplex.com/. Follow the installations instructions then open NodeXL and do Import => “From Exchange User’s Email Network”. To my knowledge NodeXL is the only application with this nice feature.
=> You can import your emails simply by letting NodeXL search them for you on your computer, if you use an Windows email client (Outlook, Outlook Express or Windows Mail). If you use Thunderbird or another non-Windows application, you’ll have to export your emails in the eml format into a folder first, which Windows Search will then index and retrieve. However this can be tricky since you must make sure that 1. Windows Search is installed (by default on Windows 7, but you have do download it on Vista and XP) and 2. you have to make sure that Windows Search is actually indexing the folder where you exported your emails (not always guaranteed!).
When you are sure that Windows Search indexed the folder with your eml emails, open NodeXL, do Import => “From Email network” and NodeXL will find these emails automatically.
=> You can also filter the import of these emails according to many criteria (more recent than… subject must include… etc).
— negative aspects —
=> You cannot import emails directly from the server where the emails are (except if you use Exchange). No direct access to Yahoo, Gmails, and such: this is quite inconvenient.
=> The quality of the visual representation of the network is not that great. I’ve found that many people use NodeXL for its neat import functions (especially Twitter imports), but then export the resulting network to a graphml file, which they open with Gephi.
Despite the diversity of tools advertising an “import emails to social networks” function, just two open source & free solutions turned out to work – after much work to surmount difficulties: Gephi and NodeXL. So I expect that this tutorial will serve many social network analysts who were looking for solutions. And I am of course open to corrections and suggestions!
Did you like this post? Visit the website of the consultancy connected to this blog (www.exploreyourdata.com), and share it with Twitter!