Real-World Datasets
The machine learning course referenced at the bottom of this page makes use of a few real-world datasets. These are generally useful for Machine Learning projects.
import pandas as pd
import sklearn
print(' Pandas version ' + str(pd.__version__))
print(' Sklearn version ' + str(sklearn.__version__))
Pandas version 0.25.3
Sklearn version 0.22.1
Crimes Dataset
Used for regression.
Originally taken from: https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized
Processed as shown below, the dataset contains:
- 88 continuous features (columns 0 through 87),
- 1 continuous label (column 88, “ViolentCrimesPerPop”), and
- 1994 entries.
import pandas as pd
crime_datafile = 'real-world-datasets/CommViolPredUnnormalizedData.txt'
crime = pd.read_table(crime_datafile,
sep=',',
na_values='?')
columns_to_keep = ([5, 6] + list(range(11,26))
+ list(range(32, 103)) + [145])
crime = (crime[crime.columns[columns_to_keep]]
.dropna()
.reset_index(drop=True))
crime.head()
population | householdsize | agePct12t21 | agePct12t29 | agePct16t24 | agePct65up | numbUrban | pctUrban | medIncome | pctWWage | ... | MedOwnCostPctInc | MedOwnCostPctIncNoMtg | NumInShelters | NumStreet | PctForeignBorn | PctBornSameState | PctSameHouse85 | PctSameCity85 | PctSameState85 | ViolentCrimesPerPop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 11980 | 3.10 | 12.47 | 21.44 | 10.93 | 11.33 | 11980 | 100.0 | 75122 | 89.24 | ... | 21.1 | 14.0 | 11 | 0 | 10.66 | 53.72 | 65.29 | 78.09 | 89.14 | 41.02 |
1 | 23123 | 2.82 | 11.01 | 21.30 | 10.48 | 17.18 | 23123 | 100.0 | 47917 | 78.99 | ... | 20.7 | 12.5 | 0 | 0 | 8.30 | 77.17 | 71.27 | 90.22 | 96.12 | 127.56 |
2 | 29344 | 2.43 | 11.36 | 25.88 | 11.01 | 10.28 | 29344 | 100.0 | 35669 | 82.00 | ... | 21.7 | 11.6 | 16 | 0 | 5.00 | 44.77 | 36.60 | 61.26 | 82.85 | 218.59 |
3 | 16656 | 2.40 | 12.55 | 25.20 | 12.19 | 17.57 | 0 | 0.0 | 20580 | 68.15 | ... | 20.6 | 14.5 | 0 | 0 | 2.04 | 88.71 | 56.70 | 90.17 | 96.24 | 306.64 |
4 | 140494 | 2.45 | 18.09 | 32.89 | 20.04 | 13.26 | 140494 | 100.0 | 21577 | 75.78 | ... | 17.3 | 11.7 | 327 | 4 | 1.49 | 64.35 | 42.29 | 70.61 | 85.66 | 442.95 |
5 rows × 89 columns
Breast Cancer Dataset
Used for classification.
The breast cancer dataset is included with scikit-learn as a “Bunch” object. Process as shown below to convert to a Pandas DataFrame.
The dataset contains:
-
30 continuous features (columns 0 through 29),
-
1 discrete label (binary, column 30, “Target”), and
- 0 indicates “malignant”
- 1 indicates “benign”
-
569 entries.
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer_bunch = load_breast_cancer()
cancer = pd.DataFrame(cancer_bunch['data'],
columns = cancer_bunch['feature_names'])
cancer['target'] = cancer_bunch['target']
cancer.head()
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0 |
5 rows × 31 columns
Fruits Dataset
Used for classification.
The dataset was originally created by Dr. Iain Murray from University of Edinburgh, though the version used here was modified slightly and formatted by professors at the University of Michigan.
The dataset contains:
- 4 continuous features (columns 0 through 3),
- 1 computer-readable discrete label (column 4, 4 possible values),
- 2 human-readable discrete labels (columns 5 and 6), and
- 59 entries.
import pandas as pd
fruit_datafile = 'real-world-datasets/fruit_data_with_colors.txt'
fruit = (pd.read_csv(fruit_datafile,
sep='\t')
[['mass','width','height','color_score',
'fruit_label',
'fruit_name','fruit_subtype']])
fruit.head()
mass | width | height | color_score | fruit_label | fruit_name | fruit_subtype | |
---|---|---|---|---|---|---|---|
0 | 192 | 8.4 | 7.3 | 0.55 | 1 | apple | granny_smith |
1 | 180 | 8.0 | 6.8 | 0.59 | 1 | apple | granny_smith |
2 | 176 | 7.4 | 7.2 | 0.60 | 1 | apple | granny_smith |
3 | 86 | 6.2 | 4.7 | 0.80 | 2 | mandarin | mandarin |
4 | 84 | 6.0 | 4.6 | 0.79 | 2 | mandarin | mandarin |
Detailed DataFrame Info
Crime
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Data columns (total 89 columns):
population 1994 non-null int64
householdsize 1994 non-null float64
agePct12t21 1994 non-null float64
agePct12t29 1994 non-null float64
agePct16t24 1994 non-null float64
agePct65up 1994 non-null float64
numbUrban 1994 non-null int64
pctUrban 1994 non-null float64
medIncome 1994 non-null int64
pctWWage 1994 non-null float64
pctWFarmSelf 1994 non-null float64
pctWInvInc 1994 non-null float64
pctWSocSec 1994 non-null float64
pctWPubAsst 1994 non-null float64
pctWRetire 1994 non-null float64
medFamInc 1994 non-null int64
perCapInc 1994 non-null int64
NumUnderPov 1994 non-null int64
PctPopUnderPov 1994 non-null float64
PctLess9thGrade 1994 non-null float64
PctNotHSGrad 1994 non-null float64
PctBSorMore 1994 non-null float64
PctUnemployed 1994 non-null float64
PctEmploy 1994 non-null float64
PctEmplManu 1994 non-null float64
PctEmplProfServ 1994 non-null float64
PctOccupManu 1994 non-null float64
PctOccupMgmtProf 1994 non-null float64
MalePctDivorce 1994 non-null float64
MalePctNevMarr 1994 non-null float64
FemalePctDiv 1994 non-null float64
TotalPctDiv 1994 non-null float64
PersPerFam 1994 non-null float64
PctFam2Par 1994 non-null float64
PctKids2Par 1994 non-null float64
PctYoungKids2Par 1994 non-null float64
PctTeen2Par 1994 non-null float64
PctWorkMomYoungKids 1994 non-null float64
PctWorkMom 1994 non-null float64
NumKidsBornNeverMar 1994 non-null int64
PctKidsBornNeverMar 1994 non-null float64
NumImmig 1994 non-null int64
PctImmigRecent 1994 non-null float64
PctImmigRec5 1994 non-null float64
PctImmigRec8 1994 non-null float64
PctImmigRec10 1994 non-null float64
PctRecentImmig 1994 non-null float64
PctRecImmig5 1994 non-null float64
PctRecImmig8 1994 non-null float64
PctRecImmig10 1994 non-null float64
PctSpeakEnglOnly 1994 non-null float64
PctNotSpeakEnglWell 1994 non-null float64
PctLargHouseFam 1994 non-null float64
PctLargHouseOccup 1994 non-null float64
PersPerOccupHous 1994 non-null float64
PersPerOwnOccHous 1994 non-null float64
PersPerRentOccHous 1994 non-null float64
PctPersOwnOccup 1994 non-null float64
PctPersDenseHous 1994 non-null float64
PctHousLess3BR 1994 non-null float64
MedNumBR 1994 non-null int64
HousVacant 1994 non-null int64
PctHousOccup 1994 non-null float64
PctHousOwnOcc 1994 non-null float64
PctVacantBoarded 1994 non-null float64
PctVacMore6Mos 1994 non-null float64
MedYrHousBuilt 1994 non-null int64
PctHousNoPhone 1994 non-null float64
PctWOFullPlumb 1994 non-null float64
OwnOccLowQuart 1994 non-null int64
OwnOccMedVal 1994 non-null int64
OwnOccHiQuart 1994 non-null int64
OwnOccQrange 1994 non-null int64
RentLowQ 1994 non-null int64
RentMedian 1994 non-null int64
RentHighQ 1994 non-null int64
RentQrange 1994 non-null int64
MedRent 1994 non-null int64
MedRentPctHousInc 1994 non-null float64
MedOwnCostPctInc 1994 non-null float64
MedOwnCostPctIncNoMtg 1994 non-null float64
NumInShelters 1994 non-null int64
NumStreet 1994 non-null int64
PctForeignBorn 1994 non-null float64
PctBornSameState 1994 non-null float64
PctSameHouse85 1994 non-null float64
PctSameCity85 1994 non-null float64
PctSameState85 1994 non-null float64
ViolentCrimesPerPop 1994 non-null float64
dtypes: float64(67), int64(22)
memory usage: 1.4 MB
Cancer
cancer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
mean radius 569 non-null float64
mean texture 569 non-null float64
mean perimeter 569 non-null float64
mean area 569 non-null float64
mean smoothness 569 non-null float64
mean compactness 569 non-null float64
mean concavity 569 non-null float64
mean concave points 569 non-null float64
mean symmetry 569 non-null float64
mean fractal dimension 569 non-null float64
radius error 569 non-null float64
texture error 569 non-null float64
perimeter error 569 non-null float64
area error 569 non-null float64
smoothness error 569 non-null float64
compactness error 569 non-null float64
concavity error 569 non-null float64
concave points error 569 non-null float64
symmetry error 569 non-null float64
fractal dimension error 569 non-null float64
worst radius 569 non-null float64
worst texture 569 non-null float64
worst perimeter 569 non-null float64
worst area 569 non-null float64
worst smoothness 569 non-null float64
worst compactness 569 non-null float64
worst concavity 569 non-null float64
worst concave points 569 non-null float64
worst symmetry 569 non-null float64
worst fractal dimension 569 non-null float64
target 569 non-null int64
dtypes: float64(30), int64(1)
memory usage: 137.9 KB
Fruit
fruit.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 7 columns):
mass 59 non-null int64
width 59 non-null float64
height 59 non-null float64
color_score 59 non-null float64
fruit_label 59 non-null int64
fruit_name 59 non-null object
fruit_subtype 59 non-null object
dtypes: float64(3), int64(2), object(2)
memory usage: 3.4+ KB