Sentiment Analysis Research: Leveraging Email to Identify Insider Threats
The insider threat is a historic problem with challenges that evolve as quickly as technology. Today, Insider Threat/Risk Programs are adapting their tools to capture and analyze the computer use of every individual in their organization. To-date one of the least leveraged cyber sensors is email text. The reasoning includes: privacy concerns, the massive amounts of additional data, and a lack of ability to efficiently mine the data for potential risk indicators. Mature Insider Threat/Risk Programs are not interested in simple dirty word searches and general disgruntlement, as neither are specific enough for their needs. In response, sentiment analysis has been proposed as an alternative to identify potential insider threats.
There is vast research literature on sentiment analysis of text, including email, for purposes other than insider threat detection. Much of this research focuses on general positive and negative sentiment categories, which are too broad for insider threat analyses. At the time of the study, there were few tools available or in development claiming to identify insider threats via sentiment analysis of email text. The challenge for both vendors and consumers is determining the tools’ effectiveness when there is no ground truth of known bad data.
To help tackle the challenge, MITRE conducted a research study examining two important research questions: Can we proactively identify, via email content analysis, employees who might become insider threats (i.e., sufficient meaningful content)? Can current sentiment analysis of text tools efficiently identify these potential insider threats? The MITRE team identified a large, good quality email corpus (1086 individual users, 22,644 email threads), created an insider risk focused sentiment codebook, and created an easy annotating tool and database. The MITRE behavioral scientists and insider threat SMEs hand-coded the email corpus and then analyzed the performance of existing software tools against the human-coded corpus. We found the precision and recall performance of five text analysis tools to be below the needs of the insider threat community. The psycholinguistics necessary to identify specific kinds of negative sentiment was not advanced enough for deployment across Insider Threat/Risk Programs.