Exercise - Data cleansing part 1 - Find missing values
Basketball is full of statistics. You'll use some of them to make decisions about the hypothetical basketball team in this module.
To make good decisions, you need good data. Throughout this module, you'll use a dataset that's based on publicly available basketball stats. You'll also use stats we created based on Tune Squad characters. All of the data is anonymized.
As fans, we can't influence the methodology for gathering data. This detail introduces some questions:
- Were best practices used to gather the data?
- Were biases introduced in the data?
- Is there data that wasn't gathered that could influence our findings?
Because we're fans of the sport, and not necessarily the scientists who gather the data, you'll explore the data by looking for missing values, outlying values, and patterns.
Data scientists commonly find themselves in this sort of situation. Collecting data can take years. So when you're first starting out in an area, you might have to rely on previously collected data to guide initial findings. As you know from the data science lifecycle, the data science process is iterative. If you discover patterns and insights from the current data, you can perhaps influence the data that's collected in the future!
Explore the data
Before you jump into modifying the data, you can begin to explore it. Explore it by opening the CSV file in Visual Studio Code. Or explore it by using common pandas functions:
# Print out the first five rows of the player_df DataFrame.
player_df.head()
Output
ID | points | possessions | team_pace | Unnamed: 4 | Unnamed: 5 | GP | MPG | TS% | AST | TO | USG | ORR | DRR | REBR | PER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1893.0 | 1251.8 | 97.8 | NaN | NaN | 63.0 | 33.9 | 0.569 | 17.2 | 11.5 | 26.1 | 4.7 | 23.3 | 7.8 | 10.90 |
1 | 2 | 1386.0 | 1282.5 | 110.5 | NaN | NaN | 58.0 | 32.5 | 0.511 | 24.8 | 9.7 | 26.9 | 6.1 | 0.9 | 10.7 | 27.30 |
2 | 3 | 1405.0 | 1252.3 | 105.8 | NaN | NaN | 55.0 | 36.3 | 0.605 | 25.7 | 13.9 | 28.1 | 4.5 | 4.9 | 1.8 | NaN |
3 | 4 | 1282.0 | 1235.9 | 100.7 | NaN | NaN | 54.0 | 37.6 | 0.636 | 29.5 | 11.0 | 22.3 | 4.8 | 4.6 | 5.6 | 22.35 |
4 | 5 | 1721.0 | 1254.0 | 105.7 | NaN | NaN | 59.0 | 30.5 | 0.589 | 22.8 | 9.9 | 24.6 | 1.2 | 8.4 | 12.1 | 28.38 |
The output shows several ambiguous column headings. Let's break down those headings:
Column name | Description |
---|---|
ID | Unique identifier for each player in the dataset |
points | Total points scored by a player in a season |
possessions | Total possessions by a player in a season |
team_pace | Average number of possessions a team uses per game |
Unnamed: 4 | Unknown data |
Unnamed: 5 | Unknown data |
GP | Games played by a player in a season |
MPG | Average minutes played by a player per game |
TS% | True shooting percentage, the player's shooting percentage, taking into account free throws and three-pointers |
AST | Assist ratio, the percentage of a player's possessions that end in an assist |
TO | Turnover ratio, the percentage of a player's possessions that end in a turnover |
USG | Usage rate, the number of possessions a player uses per 40 minutes |
ORR | Offensive rebound rate |
DRR | Defensive rebound rate |
REBR | Rebound rate, the percentage of missed shots that a player rebounds |
PER | Player efficiency rating, the measure of a player's per-minute productivity on the court |
Many of these data points make sense in the context of basketball. But even without a technical understanding of each of these columns, you can still start to cleanse the data. So if you're not well versed in these terms, don't worry! You can still gain many insights into the data.
The important thing to understand is that each of these columns lists data that can be counted during a game, except for the PER column. PER (player efficiency rating) is a calculation based on all of the other player stats. It determines "how good" a player is.
You can use the PER column to predict how effective a player is during a game. As of this writing, the basketball player who has the highest PER is Michael Jordan. His score is 27.91. The player who has the second-highest rating, and the highest rating of currently active professional basketball players, is LeBron James. His score is 27.49.
The calculation of PER isn't perfect. Some fans and data scientists might choose to evaluate players differently. But in this module, you'll use PER as the measurement to help make decisions about players before and during games.
Find missing values
Missing values are common in organically collected datasets. To look for missing values, use the built-in isna()
function in pandas DataFrames. By default, this function flags each occurrence of a NaN
value in a row in the DataFrame. Earlier you saw at least two columns that have many NaN
values, so you should start here with your cleansing.
NaN
stands for "not a number." It's a special floating-point value that represents an undefined value. It's different from, say, using ''
or 0
, because NaN
literally has no value. You imported this value from empty cells in the CSV file.
Because you're most interested in seeing how many NaN
values are in each row of the DataFrame, pair the isna()
function with the DataFrame sum()
function. A nice thing about pandas is that you can tack on functions one after another to create powerful, concise code. For example:
# Total up the number of NaN values in each row of the DataFrame.
player_df.isna().sum()
ID 0
points 3
possessions 3
team_pace 3
Unnamed: 4 46
Unnamed: 5 46
GP 7
MPG 6
TS% 1
AST 1
TO 1
USG 1
ORR 1
DRR 1
REBR 1
PER 10
dtype: int64
You know that you have 46 rows of data, because you can see that in the CSV file and in the printout for the DataFrame:
# Print out the information about the DataFrame.
player_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 46 non-null int64
1 points 43 non-null float64
2 possessions 43 non-null float64
3 team_pace 43 non-null float64
4 Unnamed: 4 0 non-null float64
5 Unnamed: 5 0 non-null float64
6 GP 39 non-null float64
7 MPG 40 non-null float64
8 TS% 45 non-null float64
9 AST 45 non-null float64
10 TO 45 non-null float64
11 USG 45 non-null float64
12 ORR 45 non-null float64
13 DRR 45 non-null float64
14 REBR 45 non-null float64
15 PER 36 non-null float64
dtypes: float64(15), int64(1)
memory usage: 5.9 KB
There are 46 rows, and you see that the two unnamed columns have 0 non-NaN
values (which means all 46 rows have NaN
values for these two columns). So you can deduce that at least two columns can be eliminated from the DataFrame.
© 2021 Warner Bros. Ent. All Rights Reserved.
Need help? See our troubleshooting guide or provide specific feedback by reporting an issue.