New EDRM Enron Email Data Set

The EDRM Enron v1 Data Set Cleansed of Private, Health and Financial Information

The Enron v1 data set previously hosted by EDRM ( has served for many years as an industry-standard collection of email data for electronic discovery training and testing. Since this data set was originally made available by FERC, it has been an open secret that it contained many instances of private, health and financial data about the company’s former employees.

Cleansing the data

Nuix specialists cleansed the EDRM Enron data set of private information. We identified and removed more than 10,000 items of information including:

  • 60 containing credit card numbers, including departmental contact lists that each contained hundreds of individual credit cards
  • 572 containing Social Security or other national identity numbers—thousands of individuals’ identity numbers in total
  • 292 containing individuals’ dates of birth
  • 532 containing information of a highly personal nature such as medical or legal matters.

Many items contained multiple instances and types of information. This included departmental contact list spreadsheets with dates of birth, credit card numbers, Social Security numbers, home addresses and other private details of dozens of staff members.

In removing these items and making the cleansed data set available to the community, we hope to protect the privacy of hundreds of individuals.

Nuix is also pleased to offer the legal and investigator community the methodology we used for identifying personal and financial data in corporate data sets.

  • Download our case study, “Removing PII from the EDRM Enron Data Set: Investigating the prevalence of unsecured financial, health and personally identifiable information in corporate data” for a detailed methodology. Download here

Download the cleansed EDRM Enron v1 data set

What risks lie in your data?

Although the EDRM Enron data set is more than 10 years old, most organizations still face significant risks relating to private information stored in their systems.

  • Using Nuix Investigator tools and the methodology outlined in our case study, you can identify inappropriately stored private, health and financial data and take immediate steps to remediate the risks involved.
  • Nuix also offers information governance products and solutions to locate and remediate these risks in emailfile shares, and archives.


These files may contain personally identifiable information, in spite of efforts to remove that information. If you find PII that you think should be removed, please notify us at

34 comments to New EDRM Enron Email Data Set

  • TheOldHag

    The TREC legal track references the Enron collection found here at as the reference collection corresponding to the legal track. The legal track includes a number of files containing what I believe are relevance judgements along with mappings from doc id to sdoc number which are poorly explained. Apparently, these sdoc numbers are supposedly in the email headers. However, there only appears to be message ids and that is it and I cannot map any of those message ids to anything in these mapping files or relevant judgement files.

    It is great that EDRM makes this reference collection available. It would be extremely helpful if there was guidance in deciphering the scramble that is the TREC legal track with respect to this reference collection.

  • can we have some other way like ftp address to download the whole PST files at once?

  • i don’t see any of the file list or the download button.

  • Mark

    In the dataset “EDRM Enron Email Data Set v2”, for each custodian, there are subdirs like \text_000, \text_001, etc. They consist entirely of .txt files. Some of these files are the .eml files without the embedded attachments and the others are the actual attachments to those .eml files (but converted to text files which is really useful).
    However, there are some custodians that this does not appear to have been performed for (no \text_nnn subdirs for them). Do you plan to finish those custodians? These are the ones I ran into:

    Thank you.

  • Ryan

    First, thank you for posting this data, it is extremely helpful to use in testing.

    There seems to be an issue with many of the recipient fields in the PST version. In cases where the Exchange contact address format was used rather than the SMTP address (e.g., <Tag TagName="#To" TagDataType="Text" TagValue="Williams III, Bill “) the PST version separates this out into 2 different recipients split by the comma. So instead of getting back the correct recipient you end up with 2 incorrect recipients:
    1) Williams III
    2) Bill
    Which makes it very difficult to effectively map the recipients across the dataset.

    It seems to be correct in the XML version. Any chance of an update on these PSTs?

    Thanks again for making this data available.

    • Aaron

      This same problem was present in the v1 set and has never been corrected.

      This problem is present in the XML zip archive in the EML version of the messages. That is, there is no native-file version of these messages that is correct.

      In both the EML and the PST files, the headers (on many messages) look like this:
      To: ,"Anna"
      which is clearly incorrect.

      It almost appears as though these files were rendered to text and then incorrectly reinterpreted to create these native files, although this is not the case on every message.

      Because a significant portion of the messages have these incorrect headers, these files are effectively useless for testing processing through the EDRM. No processing of native files that uses meta-data will produce useful or meaningful test results.

  • Robert Lauriston

    Are there versions of these files with the formatting intact?

  • @Theresa: For the EDRM Enron Data Set v2, there’s no mapping needed between the EDRM Message-IDs and the original Message-IDs because the EDRM data set uses the original Message-IDs where available. In cases where no original Message-ID is present, a Message-ID was generated such as the @PMZL04 example you mentioned. The EDRM Message-IDs can be considered authoritative because of this.

    Additionally, it appears that none of the @thyme Message-IDs are original, but were created for the CALO Enron Email Data Set. In that data set (and its derivatives), it appears that all messages have an @thyme Message-ID.

    If you want to use the original Message-IDs, I recommend using the EDRM Message-IDs.

    If you want to correlate the EDRM and CALO data sets, a mapping file would be useful but I’m not aware of one yet. However, if a mapping was made available, we would be happy to link to it or host it.

    Hope this helps.

    • Theresa Wilson

      John – Thank you very much for your reply and the helpful information! I didn’t realize that the @time Message-IDs weren’t the original ones. I guess the next step for me would be to ask the creators of the CALO data set if they have a mapping. Thanks again!

  • Theresa Wilson

    I’ve just started looking at this dataset, and I noticed that the Message-IDs (Example: 00000000DC16F437217B604BB0C11906781A15A0046D2200@PMZL04) in the .eml file
    do not match at all the Message-IDs in other Enron data sets that have been made
    available (Example: 11972760.1075842947758.JavaMail.evans@thyme). Is there a
    mapping some where of EDRM Message-IDs to the original Message-IDs?

  • The data set was prepared using ZL Unified Archive. A ZL system was set up to archive the Enron email files made available from FERC via Lockheed Martin. The email messages were archived into individual custodian accounts on the Unified Archive system. Once in the system, two sets of emails were created. First, a PST set was created by exporting each custodian mailbox using that format. Then the EDRM XML set was created by importing the email into ZL Discovery Manager and using the EDRM XML export capability to export the email for each custodian.

  • Chung

    Is there any documentation on how these files were prepared?


Leave a Reply