Exercise - Data cleansing part 2 - Drop columns and rows - Training

10 minutes

One of the quickest ways to cleanse data is to drop columns and rows that don't add value to your data-discovery goals. In the previous unit, you discovered two columns that have only NaN values for each row. They were unnamed columns, so they were probably included in the original dataset by mistake. Those columns are straightforward to drop.

Drop columns

To drop columns, you'll use the dropna() method. Like isna(), dropna() looks for NaN values. But it goes a step further and removes either the rows or the columns that contain NaN values. Start by setting some parameters to the method:

By default, dropna() removes rows, so specify that you want to remove columns by using the axis parameter.
The dropna() method usually returns a new DataFrame. Use the inplace parameter to tell it to drop these columns in the original player_df DataFrame.
You also want dropna() to remove only columns in which all of the values are missing. So set the how parameter to 'all'.

# Drop columns that have no values.
player_df.dropna(axis='columns', inplace=True, how='all')
player_df.isna().sum()

ID              0
points          3
possessions     3
team_pace       3
GP              7
MPG             6
TS%             1
AST             1
TO              1
USG             1
ORR             1
DRR             1
REBR            1
PER            10
dtype: int64

You successfully removed the empty columns. You see that if you call .info() again:

player_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ID           46 non-null     int64  
 1   points       43 non-null     float64
 2   possessions  43 non-null     float64
 3   team_pace    43 non-null     float64
 4   GP           39 non-null     float64
 5   MPG          40 non-null     float64
 6   TS%          45 non-null     float64
 7   AST          45 non-null     float64
 8   TO           45 non-null     float64
 9   USG          45 non-null     float64
 10  ORR          45 non-null     float64
 11  DRR          45 non-null     float64
 12  REBR         45 non-null     float64
 13  PER          36 non-null     float64
dtypes: float64(13), int64(1)
memory usage: 5.2 KB

You now have only 14 columns instead of 16. Most of the rows have data for each column.

Drop rows

You should also address the possibility that an entire row is missing values. That is, one of the players in the DataFrame might have no stats. You can try to address this missing row as you did last time, by using dropna(). But this time, use the method's default row-based behavior.

# Drop rows that have no values.
player_df.dropna(inplace=True, how='all')
player_df.isna().sum()

ID              0
points          3
possessions     3
team_pace       3
GP              7
MPG             6
TS%             1
AST             1
TO              1
USG             1
ORR             1
DRR             1
REBR            1
PER            10
dtype: int64

Hmmm. No change. So you know that no row has all NaN values. That's great!

Now look at the DataFrame itself. It's large enough that you wouldn't want to print all of the rows all the time. But it's not so large that it's unmanageable to print it here once.

# Show the entire DataFrame.
player_df

Output

	ID	points	possessions	team_pace	GP	MPG	TS%	AST	TO	USG	ORR	DRR	REBR	PER
0	1	1893.0	1251.8	97.8	63.0	33.9	0.569	17.2	11.5	26.1	4.7	23.3	7.8	10.90
1	2	1386.0	1282.5	110.5	58.0	32.5	0.511	24.8	9.7	26.9	6.1	0.9	10.7	27.30
2	3	1405.0	1252.3	105.8	55.0	36.3	0.605	25.7	13.9	28.1	4.5	4.9	1.8	NaN
3	4	1282.0	1235.9	100.7	54.0	37.6	0.636	29.5	11.0	22.3	4.8	4.6	5.6	22.35
4	5	1721.0	1254.0	105.7	59.0	30.5	0.589	22.8	9.9	24.6	1.2	8.4	12.1	28.38
5	6	1004.0	1322.4	102.1	57.0	36.3	0.574	20.3	13.5	31.0	1.2	20.5	3.4	9.83
6	7	1920.0	1207.6	109.8	55.0	37.0	0.573	26.0	13.1	30.7	0.0	10.3	-2.6	8.71
7	8	1353.0	1348.1	112.2	57.0	NaN	0.574	24.4	11.3	26.3	5.5	5.8	-2.2	NaN
8	9	NaN	NaN	NaN	54.0	31.7	0.577	24.6	14.4	35.7	2.5	14.1	-4.6	4.68
9	10	1468.0	1400.9	93.7	61.0	35.6	0.547	22.9	12.2	22.7	5.8	6.4	-2.9	NaN
10	11	1856.0	1303.8	93.4	55.0	33.6	0.563	28.9	14.7	26.7	1.4	17.3	4.3	9.67
11	12	1058.0	1231.5	99.4	63.0	32.8	0.550	25.1	12.9	27.2	2.1	12.1	11.4	22.97
12	13	1452.0	1246.1	103.6	NaN	31.7	0.584	32.4	14.7	16.5	3.2	19.0	4.1	NaN
13	14	1438.0	1499.6	108.1	60.0	35.0	0.569	28.6	12.2	36.2	4.9	6.0	7.4	27.57
14	15	1782.0	1329.0	104.6	51.0	NaN	0.590	26.5	9.9	29.7	3.1	29.4	5.5	9.38
15	16	1276.0	1257.5	89.7	58.0	32.8	0.572	15.5	16.2	23.8	3.4	15.4	6.0	9.49
16	17	1486.0	1343.6	100.6	NaN	34.9	0.603	26.2	11.1	36.7	3.0	14.3	5.3	NaN
17	18	1326.0	1542.0	107.8	52.0	31.8	0.583	25.3	11.7	30.6	3.3	9.9	5.3	19.78
18	19	1535.0	1336.2	104.0	56.0	35.4	0.636	27.2	10.8	29.9	0.7	5.6	7.9	30.44
19	20	NaN	NaN	NaN	57.0	34.0	0.526	28.3	11.7	25.6	3.1	4.7	1.0	11.57
20	21	1276.0	1354.9	101.2	50.0	NaN	0.560	21.4	12.4	28.6	2.6	9.3	10.3	24.23
21	22	1784.0	1287.1	101.3	59.0	33.8	0.601	27.2	12.4	21.8	2.9	5.2	9.9	25.15
22	23	1729.0	1205.0	103.8	61.0	NaN	0.645	20.6	13.1	31.5	4.0	12.0	5.2	NaN
23	24	1639.0	1287.4	105.2	52.0	35.4	0.570	32.7	15.0	26.6	1.6	5.0	9.4	25.12
24	25	NaN	NaN	NaN	59.0	34.9	0.578	16.9	11.8	28.7	2.8	14.1	3.0	12.27
25	26	1370.0	1384.7	103.3	NaN	34.2	0.571	14.8	9.9	33.1	2.4	16.3	3.7	15.04
26	27	1497.0	1410.9	98.3	57.0	34.1	0.613	15.4	10.0	29.3	3.8	5.7	6.7	25.35
27	28	1313.0	1420.5	105.0	48.0	35.1	0.569	19.0	10.9	29.2	4.6	17.7	5.2	NaN
28	29	1464.0	1353.4	103.3	63.0	34.4	0.595	17.9	9.2	29.3	1.0	8.7	4.1 20.49
29	30	1645.0	1415.7	109.8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
30	31	2049.0	1434.0	110.0	64.0	38.8	0.619	31.5	14.9	35.5	8.3	17.6	12.8	28.44
31	32	1795.0	1481.8	112.1	62.0	35.4	0.608	31.9	14.5	32.0	6.5	22.5	12.9	23.34
32	33	1805.0	1509.9	108.6	64.0	35.4	0.622	27.9	13.9	36.0	5.9	27.7	12.2	22.41
33	34	1743.0	1422.4	112.9	64.0	36.3	0.619	30.9	15.6	34.5	5.9	18.9	14.8	NaN
34	35	1963.0	1539.1	117.4	NaN	NaN	0.633	32.3	16.2	34.0	5.9	19.8	13.1	27.16
35	36	2062.0	1505.7	111.5	NaN	37.0	0.620	29.8	15.6	36.2	4.9	23.9	14.7	27.86
36	37	1845.0	1435.7	113.1	69.0	36.9	0.634	33.2	14.0	36.5	4.1	21.5	16.4	34.26
37	38	1778.0	1526.4	109.3	66.0	34.9	0.612	30.6	15.9	35.9	5.5	18.8	13.7	28.65
38	39	1901.0	1444.1	109.7	67.0	36.5	0.609	27.2	14.8	35.5	5.0	21.8	8.9	20.12
39	40	183.0	147.3	118.1	63.0	37.5	0.623	31.2	13.5	34.7	6.3	18.7	13.6	29.81
40	41	2030.0	1431.0	112.3	68.0	37.0	0.618	32.5	15.3	34.5	5.7	15.7	13.2	30.07
41	42	1631.0	1465.7	110.1	66.0	37.5	0.613	28.4	14.4	35.7	6.5	20.7	14.0	28.40
42	43	1828.0	1507.2	112.7	64.0	36.5	0.618	31.3	14.0	34.9	5.9	21.3	14.5	NaN
43	44	1821.0	1443.7	118.8	66.0	36.6	0.609	27.3	13.5	35.8	7.0	23.8	11.5	22.96
44	45	1740.0	1443.9	114.1	68.0	37.1	0.611	26.6	15.2	29.3	8.3	17.7	11.1	21.22
45	46	1993.0	1459.0	112.5	NaN	36.9	0.627	30.4	15.0	33.7	6.3	19.3	14.1	28.76

Now you see that three rows are missing the same three values. Another row is missing 10 values. This information indicates two things:

The dataset likely comes from two datasets that were joined together, and both of these earlier datasets had missing rows.
You'll need to use critical thinking to determine how to remove these rows.

The dataset is small enough that you could manually drop the problem rows. But that shortcut wouldn't give you practice dealing with larger datasets, where manual removal isn't practical. So use the built-in pandas methods instead.

The how parameter in dropna() can be set to only 'any' or 'all'. Neither of those settings will get you what you need. Instead, use the thresh parameter.

The thresh parameter refers to threshold. This parameter lets you set the minimum number of non-NaN values a row or column needs to avoid being dropped by dropna(). To remove specific rows from the DataFrame, set thresh to 12.

# Drop all rows that don't have at least 12 non-NaN values.
player_df.dropna(inplace=True, thresh=12)
player_df.isna().sum()

ID             0
points         0
possessions    0
team_pace      0
GP             6
MPG            5
TS%            0
AST            0
TO             0
USG            0
ORR            0
DRR            0
REBR           0
PER            9
dtype: int64

Now only three columns are missing values. But the rows have so few missing values that it's best to keep the rows and fill in the missing values with likely numbers. You'll address these rows in another unit.

Reset the DataFrame index

Because you've dropped rows, the index in the DataFrame is compromised. You see this problem if you print out the first 10 rows of the DataFrame:

# Print the first 10 rows of the player_df DataFrame.
player_df.head(10)

Output

	ID	points	possessions	team_pace	GP	MPG	TS%	AST	TO	USG	ORR	DRR	REBR	PER
0	1	1893.0	1251.8	97.8	63.0	33.9	0.569	17.2	11.5	26.1	4.7	23.3	7.8	10.90
1	2	1386.0	1282.5	110.5	58.0	32.5	0.511	24.8	9.7	26.9	6.1	0.9	10.7	27.30
2	3	1405.0	1252.3	105.8	55.0	36.3	0.605	25.7	13.9	28.1	4.5	4.9	1.8	NaN
3	4	1282.0	1235.9	100.7	54.0	37.6	0.636	29.5	11.0	22.3	4.8	4.6	5.6	22.35
4	5	1721.0	1254.0	105.7	59.0	30.5	0.589	22.8	9.9	24.6	1.2	8.4	12.1	28.38
5	6	1004.0	1322.4	102.1	57.0	36.3	0.574	20.3	13.5	31.0	1.2	20.5	3.4	9.83
6	7	1920.0	1207.6	109.8	55.0	37.0	0.573	26.0	13.1	30.7	0.0	10.3	-2.6	8.71
7	8	1353.0	1348.1	112.2	57.0	NaN	0.574	24.4	11.3	26.3	5.5	5.8	-2.2	NaN
9	10	1468.0	1400.9	93.7	61.0	35.6	0.547	22.9	12.2	22.7	5.8	6.4	-2.9	NaN
10	11	1856.0	1303.8	93.4	55.0	33.6	0.563	28.9	14.7	26.7	1.4	17.3	4.3	9.67

You see that the index counts 0 through 10, skipping 8. The row that had the index of 8 was dropped because it had more than two NaN values. In the 14 columns, the rows that had three or more NaN values didn't meet the threshold of 12 you set when you dropped rows earlier.

To fix this problem, reset the index for the DataFrame. This fix saves you from problems down the road when you're working with the DataFrame. While you're at it, take a look at the now smaller DataFrame.

# Renumber the DataFrame index to account for the dropped rows.
player_df.reset_index(drop=True, inplace=True)
player_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ID           42 non-null     int64  
 1   points       42 non-null     float64
 2   possessions  42 non-null     float64
 3   team_pace    42 non-null     float64
 4   GP           36 non-null     float64
 5   MPG          37 non-null     float64
 6   TS%          42 non-null     float64
 7   AST          42 non-null     float64
 8   TO           42 non-null     float64
 9   USG          42 non-null     float64
 10  ORR          42 non-null     float64
 11  DRR          42 non-null     float64
 12  REBR         42 non-null     float64
 13  PER          33 non-null     float64
dtypes: float64(13), int64(1)
memory usage: 4.7 KB

You see that you now have only 42 players in the DataFrame. Only three columns are missing some values.

Continue

Exercise - Data cleansing part 2 - Drop columns and rows

Drop columns

Drop rows

Reset the DataFrame index

Feedback