New EDRM Enron Email Data Set

The EDRM Enron v1 Data Set Cleansed of Private, Health and Financial Information

The Enron v1 data set previously hosted by EDRM (www.edrm.net) has served for many years as an industry-standard collection of email data for electronic discovery training and testing. Since this data set was originally made available by FERC, it has been an open secret that it contained many instances of private, health and financial data about the company’s former employees.

Cleansing the data

Nuix specialists cleansed the EDRM Enron data set of private information. We identified and removed more than 10,000 items of information including:

  • 60 containing credit card numbers, including departmental contact lists that each contained hundreds of individual credit cards
  • 572 containing Social Security or other national identity numbers—thousands of individuals’ identity numbers in total
  • 292 containing individuals’ dates of birth
  • 532 containing information of a highly personal nature such as medical or legal matters.

Many items contained multiple instances and types of information. This included departmental contact list spreadsheets with dates of birth, credit card numbers, Social Security numbers, home addresses and other private details of dozens of staff members.

In removing these items and making the cleansed data set available to the community, we hope to protect the privacy of hundreds of individuals.

Nuix is also pleased to offer the legal and investigator community the methodology we used for identifying personal and financial data in corporate data sets.

  • Download our case study, “Removing PII from the EDRM Enron Data Set: Investigating the prevalence of unsecured financial, health and personally identifiable information in corporate data” for a detailed methodology. Download here

Download the cleansed EDRM Enron v1 data set

What risks lie in your data?

Although the EDRM Enron data set is more than 10 years old, most organizations still face significant risks relating to private information stored in their systems.

  • Using Nuix Investigator tools and the methodology outlined in our case study, you can identify inappropriately stored private, health and financial data and take immediate steps to remediate the risks involved.
  • Nuix also offers information governance products and solutions to locate and remediate these risks in emailfile shares, and archives.

PII

These files may contain personally identifiable information, in spite of efforts to remove that information. If you find PII that you think should be removed, please notify us at mail@edrm.net.

34 comments to New EDRM Enron Email Data Set

  • Bob

    Thank you for making these available!

    When attempting to unzip two of these I’m repeatedly encountering problems. The zips I’m having a problem with are:

    edrm-enron-v2_kaminski-v_xml_1of2.zip
    edrm-enron-v2_kaminski-v_xml_2of2.zip

    Are there any known issues with these? The error I receive is:

    “! \edrm-enron-v2_kaminski-v_xml_2of2.zip: The archive is corrupt” and
    “! \edrm-enron-v2_kaminski-v_xml_1of2.zip: The archive is corrupt”

    Thanks again!!

  • Olivier

    Thank you so much for your work.
    These datasets are very useful for my project.

  • B Jano

    it’s probably worth noting here that all of the v2 files together are 116GB (37GB for PST, 79GB for XML).

    if you use a downloading tool to limit your bandwidth to (say) 50KB/s (400Kb/s) so you don’t destroy your employer’s internet connection or network proxies, that translates into about a month. divide appropriately if you choose to use more bandwidth or download only PST or XML.

  • William Webber

    Thanks for making this data available. Would it be possible to calculate and publish md5sums of these files, so that users can verify download integrity?

  • Please note that we have replaced two files and corrected the file name on the third. We replaced “edrm-enron-v2_reitmeyer-j_pst.zip” and “edrm-enron-v2_arnold-j_pst.zip”; neither was decompressing properly. We changed the name of “edrm-enron-V2-rodrique-r_pst.zip” to “edrm-enron-V2-rodrigue-r_pst.zip”.

  • Greg

    Thanks for making all of these files available for download. FYI, the file edrm-enron-v2_rodrique-r_pst.zip is missing and gives an error.

    • Greg,

      Earlier this week the folks preparing the files notified me that the file you cite was improperly named. It was supposed to be “rodrigue” with a “G” and not “rodrique” with a Q. I made the appropriate changes. That file is posted and you should be able to download it from this page.

      Thanks,

      George

  • Anyone have any advice for rewriting the domain enron.com to someother.com? We’d like to use the PST data in a testing environment, and would like to use routable email addresses. Thanks, Paul.

  • gaultz

    Thanks for these. (So these Enron emails are now under Creative Commons license??)

    Membering now…

    • We have released this collection under a Creative Commons Attribution 3.0 United States License. This collection is a reworking of the previously available data, with substantial effort required to transform it to the released version. Thank you.

  • SGS

    These types are dataset are very much useful to our environment.Thanks for intimation.

  • jg

    Thanks so much for putting this data out there.

Leave a Reply