Description and Deliverables¶
The hypothetical HR department at the fictional Salifort Motors collected employee data to improve satisfaction. They requested data-driven suggestions based on an analysis of this data. The main question is: what factors are likely to make an employee leave the company?
The goal of this project is to analyze the data and build a model to predict employee attrition. By identifying which employees are likely to leave, it may be possible to determine the factors contributing to their departure. The model should be interpretable so HR can design targeted interventions to improve retention. Improving retention can reduce the costs associated with hiring and training new employees.
Stakeholders:
The primary stakeholder is the Human Resources (HR) department, as they will use the results to inform retention strategies. Secondary stakeholders include C-suite executives who oversee company direction, managers implementing day-to-day retention efforts, employees (whose experiences and outcomes are directly affected), and, indirectly, customers, since employee satisfaction can impact customer satisfaction.
Ethical Considerations:
- Ensure employee data privacy and confidentiality throughout the analysis.
- Avoid introducing or perpetuating bias in model predictions (e.g., not unfairly targeting specific groups).
- Maintain transparency in how predictions are generated and how they will be used in HR decision-making.
This page summarizes the first part of the project: exploratory data analysis.¶
Data Dictionary¶
The dataset contains 15,000 rows and 10 columns for the variables listed below.
Note: For more information about the data, refer to its source on Kaggle.
Variable |Description | -----|-----| satisfaction_level|Employee-reported job satisfaction level [0–1]| last_evaluation|Score of employee's last performance review [0–1]| number_project|Number of projects employee contributes to| average_monthly_hours|Average number of hours employee worked per month| time_spend_company|How long the employee has been with the company (years) Work_accident|Whether or not the employee experienced an accident while at work left|Whether or not the employee left the company promotion_last_5years|Whether or not the employee was promoted in the last 5 years Department|The employee's department salary|The employee's salary (U.S. dollars)
Descriptive Statistics¶
Initial Data Observations:
- The workforce displays moderate satisfaction and generally high performance reviews.
- Typical tenure is 3–4 years, with most employees (98%) not promoted recently.
- Workplace accidents are relatively rare (14%).
- Most employees are in lower salary bands and concentrated in sales, technical, and support roles.
- About 24% of employees have left the company.
- No extreme outliers, though a few employees have unusually long tenures or high monthly hours.
Data Wrangling
During initial data exploration, several basic data cleaning steps were taken. Columns were renamed to standardized snake_case format for consistency and easier coding. I confirmed there were no missing values, reducing the risk of bias or errors. Outliers were explored but not removed at this stage; they will be addressed as needed during modeling.
Most importantly, there were 3,008 duplicate rows in the dataset. Since it is highly improbable for two employees to have identical responses across all columns, these duplicate entries were removed from the analysis.
Start with a look at the first few rows of data:
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | Department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
Below are descriptive statistics for numerical data, followed by categorical data (department and salary) and summary observations.
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | |
|---|---|---|---|---|---|---|---|---|
| count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
| mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 |
| std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 |
| min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 |
Department value counts and percentages:
| Department | Count | Percent |
|---|---|---|
| Sales | 4140 | 27.60 |
| Technical | 2720 | 18.13 |
| Support | 2229 | 14.86 |
| IT | 1227 | 8.18 |
| Product Mng | 902 | 6.01 |
| Marketing | 858 | 5.72 |
| R and D | 787 | 5.25 |
| Accounting | 767 | 5.11 |
| HR | 739 | 4.93 |
| Management | 630 | 4.20 |
Salary value counts and percentages:
| salary | Count | Percent |
|---|---|---|
| Low | 7316 | 48.78 |
| Medium | 6446 | 42.98 |
| High | 1237 | 8.25 |
Observations from descriptive statistics¶
- satisfaction_level: Employee job satisfaction scores range from 0.09 to 1.0, with an average of about 0.61. The distribution is fairly wide (std ≈ 0.25), suggesting a mix of satisfied and dissatisfied employees.
- last_evaluation: Performance review scores are generally high (mean ≈ 0.72), ranging from 0.36 to 1.0, with most employees scoring above 0.56.
- number_project: Employees typically work on 2 to 7 projects, with a median of 4 projects.
- average_monthly_hours: The average employee works about 201 hours per month, with a range from 96 to 310 hours, indicating some employees work significantly more than others.
- time_spend_company: Most employees have been with the company for 2 to 10 years, with a median of 3 years. There are a few long-tenure employees (up to 10 years), but most are around 3–4 years.
- Work_accident: About 14% of employees have experienced a workplace accident.
- left: About 24% of employees have left the company (mean ≈ 0.24), so roughly one in four employees in the dataset is a leaver.
- promotion_last_5years: Very few employees (about 2%) have been promoted in the last five years.
- department: The largest departments are sales, technical, and support, which together account for over half of the workforce. Other departments are notably smaller.
- salary: Most employees are in the low (49%) or medium (43%) salary bands, with only a small proportion (8%) in the high salary band.
Summary:
The data shows a workforce with moderate satisfaction, generally high performance reviews, and a typical tenure of 3–4 years. Most employees have not been promoted recently, and workplace accidents are relatively uncommon. Most employees are in lower salary bands and concentrated in sales, technical, and support roles. There is a notable proportion of employees who have left. There are no extreme outliers, but a few employees have unusually long tenures or high monthly hours.
Outliers¶
Long-term employees (those with more than five years at the company) are statistical outliers in this dataset. These cases will be excluded from logistic regression models during Model Construction & Validation because logistic regression is particularly sensitive to outliers.
Employees with exceptionally high average monthly hours or an unusually high or low number of projects may also be considered outliers. However, these cases are not easily detected in overall summary statistics or aggregate plots, as their impact is masked by the larger population of typical employees. Aggregate statistics can hide important subgroup patterns, which will be explored further during EDA. These subgroups will remain in the data for model construction, as they may provide valuable insights into attrition risk.
Number of tenure outliers: 824
Outliers percentage of total: 6.87%
Exploratory Data Analysis¶
- Overworked and Miserable: These employees had low satisfaction but were assigned a high number of projects (6–7) and worked 250–300 hours per month. Notably, 100% of employees with 7 projects left.
- Underworked and Dissatisfied: These employees had low satisfaction and worked fewer hours and projects. They may have been fired. Alternately, they may have given notice or had already mentally checked out and were assigned less work.
Employees working on 3–4 projects generally stayed. Most groups worked more than a typical 40-hour workweek.
Attrition is highest at the 4–5 year mark, with a sharp drop-off in departures after 5 years. This suggests a critical window for retention efforts. Employees who make it past 5 years are much more likely to stay.
Both leavers and stayers tend to have similar evaluation scores. Notably, some employees with high evaluations still leave, often those who are overworked. This suggests that strong performance alone does not guarantee retention if other factors (like satisfaction or workload) are problematic.
Relationships Between Variables:
- Satisfaction level is the strongest predictor of attrition. Employees who left had much lower satisfaction than those who stayed.
- Number of projects and average monthly hours show a non-linear relationship: both underworked and overworked employees are more likely to leave, while those with a moderate workload tend to stay.
- Employee evaluation (last performance review) has a weaker relationship with attrition compared to satisfaction or workload.
- Tenure shows a moderate relationship with attrition: employees are most likely to leave at the 4–5 year mark, with departures dropping sharply after 5 years.
- Promotion in the last 5 years is rare, and lack of promotion is associated with higher attrition.
- Department and salary have only minor effects on attrition compared to satisfaction and workload.
- Work accidents are slightly associated with lower attrition, possibly due to increased support after an incident.
Distributions in the Data:
- Most variables (satisfaction, evaluation, monthly hours) are broadly distributed, with some skewness.
- Tenure is concentrated around 3–4 years, with few employees beyond 5 years.
- Number of projects is typically 3–4, but a small group has 6–7 projects (most of whom left).
- Salary is heavily skewed toward low and medium bands.
- There are no extreme outliers, but a few employees have unusually high tenure or monthly hours.
Ethical Considerations:
- Ensure employee data privacy and confidentiality.
- Avoid introducing or perpetuating bias in analysis or modeling.
- Be transparent about how findings and predictions will be used.
- Consider the impact of recommendations on employee well-being and fairness.
Note:
This data is clearly synthetic. It's too clean, and the clusters in the charts are much neater than what you’d see in real-world HR data.
Before starting, look at the distribution of employees who left versus those who stayed. The attrition breakdown is pretty normal for industry. But this imbalance matters when training a predictive model.
| Count | Percent | |
|---|---|---|
| Stayed | 10000 | 83.40 |
| Left | 1991 | 16.60 |
Overview Plots¶
Visualizing Feature Distributions and Relationships by Employee Attrition¶
These are overview plots that provide a broad look at the data. After these, we’ll focus on individual features in more detail. The goal here is to give an initial sense of the dataset’s structure and key patterns.
The following pairplots show the relationships between features, with the diagonal displaying each feature’s distribution.
Boxplots summarize the overall distribution of each feature. However, as noted earlier, aggregate plots can sometimes hide important subgroups or outliers.
Violin plots are especially useful here, as they reveal the presence of distinct subgroups. For example, in satisfaction_level, you can see the extremely miserable and somewhat dissatisfied employees, along with those who left for more typical reasons. In last_evaluation and average_monthly_hours, employees who left cluster at both extremes, while those who stayed are more evenly distributed. For number_project, leavers are concentrated at both the low and high ends, and for tenure, there is a noticeable spike in departures around the 4–5 year mark.
Finally, we include histograms for each feature, first normalized, to compare proportions between leavers and stayers...
...and then as raw counts:
Satisfaction Levels¶
These visualizations dramatically illustrate the two main clusters of employees who left.¶
There are two prominent clusters among employees who left: one group with very low satisfaction who worked long hours, and another group who worked fewer than 40 hours per week and reported moderate dissatisfaction. All of the employees who worked the longest hours left.
The pattern is similar among employees who left: those with very low satisfaction often received high evaluations, while those with moderate dissatisfaction tended to have relatively low evaluation scores.
| Mean | Median | |
|---|---|---|
| Stayed | 0.667 | 0.690 |
| Left | 0.440 | 0.410 |
Above: Employees who left were, on average, 22.7% less satisfied (mean) and 28% less satisfied (median) than those who stayed.
Below: Working long hours does not guarantee a high evaluation, nor does a strong evaluation ensure a reasonable workload. Among leavers, the percentage of highly evaluated employees working long hours closely mirrors the pattern seen in satisfaction levels. Many top performers are being driven away.
| Tenure | Left | Count | Percent |
|---|---|---|---|
| 2 | Stayed | 2879 | 98.93 |
| 2 | Left | 31 | 1.07 |
| 3 | Stayed | 4316 | 83.16 |
| 3 | Left | 874 | 16.84 |
| 4 | Stayed | 1510 | 75.31 |
| 4 | Left | 495 | 24.69 |
| 5 | Stayed | 580 | 54.61 |
| 5 | Left | 482 | 45.39 |
| 6 | Stayed | 433 | 79.89 |
| 6 | Left | 109 | 20.11 |
| 7 | Stayed | 94 | 100.00 |
| 7 | Left | 0 | 0.00 |
| 8 | Stayed | 81 | 100.00 |
| 8 | Left | 0 | 0.00 |
| 10 | Stayed | 107 | 100.00 |
| 10 | Left | 0 | 0.00 |
A band of employees with low satisfaction is especially evident at four years of tenure.
There is a clear grouping of leavers who consistently worked long hours (i.e., many in excess of a 60-hour work week). In fact, most employees at this company work above a standard 40-hour work week.
Number of Projects¶
The number of projects is a strong predictor of attrition. Employees at both the low and high extremes are more likely to leave, and notably, all employees with 7 projects left the company.
| Number of Projects | Left | Count | Percent |
|---|---|---|---|
| 2 | Left | 857 | 54.17 |
| 2 | Stayed | 725 | 45.83 |
| 3 | Left | 38 | 1.08 |
| 3 | Stayed | 3482 | 98.92 |
| 4 | Left | 237 | 6.43 |
| 4 | Stayed | 3448 | 93.57 |
| 5 | Left | 343 | 15.36 |
| 5 | Stayed | 1890 | 84.64 |
| 6 | Left | 371 | 44.92 |
| 6 | Stayed | 455 | 55.08 |
| 7 | Left | 145 | 100.00 |
| 7 | Stayed | 0 | 0.00 |
There are no notable outliers among employees who stayed. Among those who left, both overworked and underworked patterns are evident, along with a group who appear to have left for more typical reasons. For leavers with many projects (6 or 7), the IQR for monthly hours was very high, entirely above 240 hours/month. In fact, the IQR for almost every group was above a typical 40-hour work week (~167 hours / month). Interestingly, a few employees with 7 projects reported unusually low monthly hours, which may reflect data anomalies or unique circumstances.
Among employees who left, dissatisfaction is most evident for those assigned a very high number of projects. Conversely, those with fewer projects also show signs of lower satisfaction, possibly indicating disengagement.
There are no clear patterns linking number of projects, salary, and attrition. However, the relatively small group of high-salaried employees tends to fall in the middle range for number of projects.
Salary & Promotions¶
Salary does not show a discernible relationship with attrition; the 'high' salary group is much smaller than the others by an order of magnitude, limiting its impact on overall trends.
Promotions were rare and, notably, all of the employees with the highest workload left.
| Salary | Left | Count | Percent |
|---|---|---|---|
| High | Left | 48 | 4.85 |
| High | Stayed | 942 | 95.15 |
| Low | Left | 1174 | 20.45 |
| Low | Stayed | 4566 | 79.55 |
| Medium | Left | 769 | 14.62 |
| Medium | Stayed | 4492 | 85.38 |
| Promotion Last 5 Years | Left | Count | Percent |
|---|---|---|---|
| No | Left | 1983 | 16.82 |
| No | Stayed | 9805 | 83.18 |
| Yes | Left | 8 | 3.94 |
| Yes | Stayed | 195 | 96.06 |
| No | Total | 11788 | 98.31 |
| Yes | Total | 203 | 1.69 |
Work Accident¶
Somewhat unexpectedly, having a work accident is associated with a lower likelihood of leaving. This could suggest that employees who experience an accident may receive increased support or attention from HR or the company, which encourages them to stay. However, this association could also be coincidental.
| Work Accident | Left | Count | Percent |
|---|---|---|---|
| No | Left | 1886 | 18.60 |
| No | Stayed | 8255 | 81.40 |
| Yes | Left | 105 | 5.68 |
| Yes | Stayed | 1745 | 94.32 |
Department¶
Department-level attrition closely matches the overall stay/leave split (83%/17%) and company-wide satisfaction levels, suggesting department itself is not a major factor. More granular data (e.g., by manager or team) might uncover specific problem areas, but nothing stands out in the current breakdown.
| Department | Left | Count | Percent |
|---|---|---|---|
| IT | Left | 158 | 16.19 |
| IT | Stayed | 818 | 83.81 |
| R and D | Left | 85 | 12.25 |
| R and D | Stayed | 609 | 87.75 |
| Accounting | Left | 109 | 17.55 |
| Accounting | Stayed | 512 | 82.45 |
| HR | Left | 113 | 18.80 |
| HR | Stayed | 488 | 81.20 |
| Management | Left | 52 | 11.93 |
| Management | Stayed | 384 | 88.07 |
| Marketing | Left | 112 | 16.64 |
| Marketing | Stayed | 561 | 83.36 |
| Product Mng | Left | 110 | 16.03 |
| Product Mng | Stayed | 576 | 83.97 |
| Sales | Left | 550 | 16.98 |
| Sales | Stayed | 2689 | 83.02 |
| Support | Left | 312 | 17.13 |
| Support | Stayed | 1509 | 82.87 |
| Technical | Left | 390 | 17.38 |
| Technical | Stayed | 1854 | 82.62 |
Correlation Matrix¶
The correlation matrix shows moderate correlation between some variables. Employee attrition (leaving) is most strongly and negatively correlated with satisfaction level, indicating that less satisfied employees are more likely to leave. There are moderate positive correlations between average monthly hours, last evaluation, and number of projects, as well as a moderate association between tenure and attrition.
Insights¶
The data suggests significant challenges with employee retention at this company. Two main groups of leavers emerge:
- Underutilized and Dissatisfied: Employees in this category worked on fewer projects and logged fewer hours than a typical full-time schedule, and reported lower satisfaction. These individuals may have been disengaged, assigned less work as they prepared to leave, or potentially subject to layoffs or terminations.
- Overworked and Burned Out: The second group managed a high number of projects (up to 7) and worked exceptionally long hours, sometimes nearing 80 hours per week. These employees exhibited very low satisfaction and rarely received promotions, suggesting that high demands without recognition or advancement led to burnout and resignation.
A majority of the workforce greatly exceeds the typical 40-hour work week (160–184 hours per month), pointing to a workplace culture that expects long hours. The combination of high workload and limited opportunities for advancement likely fuels dissatisfaction and increases the risk of turnover.
Performance evaluations show only a weak link to attrition; both those who left and those who stayed received similar review scores. This indicates that strong performance alone does not guarantee retention, especially if employees are overworked or lack opportunities for growth.
Other variables, such as department, salary, and work accidents, do not show strong predictive value for employee churn compared to satisfaction and workload. Overall, the data points to issues with workload management and limited career progression as the main factors driving employee turnover at this company.