Exercise - Data exploration part 1 - Check for outliers
You still have some lingering missing values, but let's set those missing values aside for a moment. Otherwise, you might be tempted to impute replacement values, and possible outliers in the dataset might skew your replacements.
Outliers are data values so far outside the distribution of other values that they bring into question whether they even belong in the dataset. Outliers often arise from data errors or other undesirable noise. You'll always need to check for and deal with possible outliers before you analyze the data.
A quick way to identify outliers is to use the pandas describe()
function:
player_df.describe()
Output
Row | ID | points | possessions | team_pace | GP | MPG | TS% | AST | TO | USG | ORR | DRR | REBR | PER |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 42.000000 | 42.000000 | 42.000000 | 42.000000 | 36.000000 | 37.00000 | 42.000000 | 42.000000 | 42.000000 | 42.000000 | 42.000000 | 42.000000 | 42.000000 | 33.000000 |
mean | 23.738095 | 1592.380952 | 1346.323810 | 106.526190 | 59.861111 | 35.27027 | 0.596357 | 26.221429 | 13.083333 | 30.478571 | 4.280952 | 14.850000 | 8.497619 | 22.594242 |
std | 13.826555 | 351.876707 | 214.503887 | 6.676791 | 5.576239 | 1.91135 | 0.029284 | 5.199275 | 2.038282 | 4.919079 | 2.074025 | 7.277538 | 4.995876 | 7.255338 |
min | 1.000000 | 183.000000 | 147.300000 | 89.700000 | 48.000000 | 30.50000 | 0.511000 | 14.800000 | 9.200000 | 16.500000 | 0.000000 | 0.900000 | -2.900000 | 8.710000 |
25% | 12.250000 | 1390.750000 | 1283.650000 | 102.400000 | 55.750000 | 34.10000 | 0.572250 | 23.275000 | 11.350000 | 26.975000 | 2.925000 | 8.475000 | 5.225000 | 20.120000 |
50% | 23.500000 | 1680.000000 | 1369.800000 | 106.800000 | 60.500000 | 35.40000 | 0.604000 | 27.200000 | 13.500000 | 30.650000 | 4.650000 | 16.800000 | 9.150000 | 24.230000 |
75% | 35.750000 | 1826.250000 | 1444.050000 | 111.950000 | 64.000000 | 36.90000 | 0.618750 | 30.550000 | 14.775000 | 34.850000 | 5.900000 | 20.325000 | 12.875000 | 28.380000 |
max | 46.000000 | 2062.000000 | 1542.000000 | 118.800000 | 69.000000 | 38.80000 | 0.645000 | 33.200000 | 16.200000 | 36.700000 | 8.300000 | 29.400000 | 16.400000 | 34.260000 |
Here you see, for example, that the mean for all 42 players is 1592.38 points. But look at the numbers for min (183), 25% (1390.75), 50% (1680.0), 75% (1826.25), and max (2062). Here, the min points (183) might be an outlier. You can use box plots to visualize the values and determine possible outliers.
Create box plots for columns
The traditional tool for probing for outlying data values is the box plot. The box in box plot refers to a box drawn around the range of data from the 25th percentile to the 75th percentile. (These percentiles demarcate important quarters of the data. Their range is called the interquartile range.) This box is the middle 50% of the data values for a given variable (a column in a DataFrame). You use another line to mark the median of the data, which is the 50th percentile.
The box plot is also called a box-and-whisker plot because you draw a T shape above and below the box to encompass the maximum and minimum values of the data, excluding outliers. This last part is important for your purposes because it lets you graphically identify outliers.
Ideally, you would produce the box plots for your columns in a single matrix that you can easily scan. Unfortunately, no single function produces multiple box plots, so you'll write a for
loop instead.
Because of how the Seaborn library in Python works, you need to explicitly state the cell in the matrix where you want to render each box plot. Use the Python floor-division operator (//
) to divide the 13 columns of interest (you don't need to look at ID
) into rows. Use the modulo operator (%
) to derive the column.
First, import the Matplotlib and Seaborn libraries into your notebook:
import matplotlib.pyplot as plt
import seaborn as sns
Note
If you get an error when you import matplotlib
or seaborn
, you might need to install the libraries. For more information, see the Matplotlib installation guide and the Seaborn installation guide.
You should see no output, indicating that the libraries were successfully imported. Now you can:
- Create a list of the column names, excluding
ID
. Use the list to find specific values within each row. - Create a matrix of subplots so you have one figure that shows all 13 columns.
- Add padding around the subplots to make them easier to read.
- Create a box plot based on the data in each column, across all of the rows.
# Create a list of all column names, except for ID.
cols = list(player_df.iloc[:, 1:])
# Create a 3x5 matrix of subplots.
fig, axes = plt.subplots(3, 5, figsize=(18, 11))
# Create padding around subplots to make the axis labels readable.
fig.tight_layout(pad=2.0)
# Loop over the columns of the DataFrame and create a box plot for each one.
for i in range(len(cols)):
sns.boxplot(ax=axes[i//5, i%5], y=player_df[cols[i]])
Sure enough, you can see two outliers in points
and possessions
. These outliers are represented as diamonds near the bottom of the first two box plots that visualize the points
and possessions
columns. The values are so far outside the range of the other data that they likely represent garbage data and not a seriously underperforming player.
Here, subject matter expertise becomes important. In this case, that expertise is knowledge about basketball. The likelihood of a professional player scoring only one point in their career is extremely small. Similarly, the likelihood of a player having 1,400 fewer possessions than the average number of possessions is also extremely small. So you can be confident that this outlying data is incorrect.
In a different dataset, the factors might differ. So stopping to carefully evaluate outliers is critical to the data science journey.
Identify and remove the rows that contain the outlier values
You could dig further into the data if you needed to. Was some of the data mistyped or corrupted in some way? With enough subject matter expertise or familiarity with how the data was gathered, you could attempt to correct the erroneous values. But you could also risk introducing your own bias or preconceived notions about the data into the dataset. Those accidents could ultimately hurt your analysis. Because you have 42 records remaining in the data, you're probably safest removing the offending row or rows.
The box plot tells you only that the outlying values exist, however. You now need to figure out the rows in which the values occur.
Because both values occur below the minimum value of their respective columns, you can use the idxmin()
method on both columns. This method returns the index number of the row in which a minimum value occurs. (You'll want to check each value separately.)
# Identify the index number of the row that has the lowest value in 'points'.
points_outlier = player_df['points'].idxmin()
points_outlier
35
# Identify the index number of the row that has the lowest value in 'possession'.
possession_outlier = player_df['possessions'].idxmin()
possession_outlier
35
Fortunately, the outliers are both on the same row. You can now use the drop()
function again to manually remove the row.
You could manually enter the index number for the row. But this manual change would introduce the possibility of mistyping. Instead, use the variable you created to house the minimum value.
# Drop the row that has the outlying values for 'points' and 'possessions'.
player_df.drop(player_df.index[points_outlier], inplace=True)
# Check the end of the DataFrame to ensure that the correct row was dropped.
player_df.tail(10)
Output
Row | ID | points | possessions | team_pace | GP | MPG | TS% | AST | TO | USG | ORR | DRR | REBR | PER |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
31 | 36 | 2062.0 | 1505.7 | 111.5 | NaN | 37.0 | 0.620 | 29.8 | 15.6 | 36.2 | 4.9 | 23.9 | 14.7 | 27.86 |
32 | 37 | 1845.0 | 1435.7 | 113.1 | 69.0 | 36.9 | 0.634 | 33.2 | 14.0 | 36.5 | 4.1 | 21.5 | 16.4 | 34.26 |
33 | 38 | 1778.0 | 1526.4 | 109.3 | 66.0 | 34.9 | 0.612 | 30.6 | 15.9 | 35.9 | 5.5 | 18.8 | 13.7 | 28.65 |
34 | 39 | 1901.0 | 1444.1 | 109.7 | 67.0 | 36.5 | 0.609 | 27.2 | 14.8 | 35.5 | 5.0 | 21.8 | 8.9 | 20.12 |
36 | 41 | 2030.0 | 1431.0 | 112.3 | 68.0 | 37.0 | 0.618 | 32.5 | 15.3 | 34.5 | 5.7 | 15.7 | 13.2 | 30.07 |
37 | 42 | 1631.0 | 1465.7 | 110.1 | 66.0 | 37.5 | 0.613 | 28.4 | 14.4 | 35.7 | 6.5 | 20.7 | 14.0 | 28.40 |
38 | 43 | 1828.0 | 1507.2 | 112.7 | 64.0 | 36.5 | 0.618 | 31.3 | 14.0 | 34.9 | 5.9 | 21.3 | 14.5 | NaN |
39 | 44 | 1821.0 | 1443.7 | 118.8 | 66.0 | 36.6 | 0.609 | 27.3 | 13.5 | 35.8 | 7.0 | 23.8 | 11.5 | 22.96 |
40 | 45 | 1740.0 | 1443.9 | 114.1 | 68.0 | 37.1 | 0.611 | 26.6 | 15.2 | 29.3 | 8.3 | 17.7 | 11.1 | 21.22 |
41 | 46 | 1993.0 | 1459.0 | 112.5 | NaN | 36.9 | 0.627 | 30.4 | 15.0 | 33.7 | 6.3 | 19.3 | 14.1 | 28.76 |
Like the head()
function, the tail()
function shows the last five values of a DataFrame. And, sure enough, you've removed the row you needed to remove (for example, the row where the index is 35).
You can reset the index for the DataFrame again to ensure accuracy within the data:
# Renumber the DataFrame index to reflect the dropped rows.
player_df.reset_index(drop=True, inplace=True)
If you execute player_df.tail(10)
again, you'll see the indexes in order now until row 40.
© 2021 Warner Bros. Ent. All Rights Reserved.
Need help? See our troubleshooting guide or provide specific feedback by reporting an issue.