Real-World Datasets

20 Jan 2020

The machine learning course referenced at the bottom of this page makes use of a few real-world datasets. These are generally useful for Machine Learning projects.

import pandas as pd
import sklearn

print('    Pandas version ' + str(pd.__version__))
print('   Sklearn version ' + str(sklearn.__version__))

    Pandas version 0.25.3
   Sklearn version 0.22.1

Crimes Dataset

Used for regression.

Originally taken from: https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized

Processed as shown below, the dataset contains:

88 continuous features (columns 0 through 87),
1 continuous label (column 88, “ViolentCrimesPerPop”), and
1994 entries.

import pandas as pd

crime_datafile = 'real-world-datasets/CommViolPredUnnormalizedData.txt'
crime = pd.read_table(crime_datafile, 
                      sep=',', 
                      na_values='?')
columns_to_keep = ([5, 6] + list(range(11,26))
                   + list(range(32, 103)) + [145])
crime = (crime[crime.columns[columns_to_keep]]
         .dropna()
         .reset_index(drop=True))
crime.head()

	population	householdsize	agePct12t21	agePct12t29	agePct16t24	agePct65up	numbUrban	pctUrban	medIncome	pctWWage	...	MedOwnCostPctInc	MedOwnCostPctIncNoMtg	NumInShelters	NumStreet	PctForeignBorn	PctBornSameState	PctSameHouse85	PctSameCity85	PctSameState85	ViolentCrimesPerPop
0	11980	3.10	12.47	21.44	10.93	11.33	11980	100.0	75122	89.24	...	21.1	14.0	11	0	10.66	53.72	65.29	78.09	89.14	41.02
1	23123	2.82	11.01	21.30	10.48	17.18	23123	100.0	47917	78.99	...	20.7	12.5	0	0	8.30	77.17	71.27	90.22	96.12	127.56
2	29344	2.43	11.36	25.88	11.01	10.28	29344	100.0	35669	82.00	...	21.7	11.6	16	0	5.00	44.77	36.60	61.26	82.85	218.59
3	16656	2.40	12.55	25.20	12.19	17.57	0	0.0	20580	68.15	...	20.6	14.5	0	0	2.04	88.71	56.70	90.17	96.24	306.64
4	140494	2.45	18.09	32.89	20.04	13.26	140494	100.0	21577	75.78	...	17.3	11.7	327	4	1.49	64.35	42.29	70.61	85.66	442.95

5 rows × 89 columns

Breast Cancer Dataset

Used for classification.

The breast cancer dataset is included with scikit-learn as a “Bunch” object. Process as shown below to convert to a Pandas DataFrame.

The dataset contains:

30 continuous features (columns 0 through 29),
1 discrete label (binary, column 30, “Target”), and
- 0 indicates “malignant”
- 1 indicates “benign”
569 entries.

import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer_bunch = load_breast_cancer()
cancer = pd.DataFrame(cancer_bunch['data'],
                      columns = cancer_bunch['feature_names'])
cancer['target'] = cancer_bunch['target']

cancer.head()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 31 columns

Fruits Dataset

Used for classification.

The dataset was originally created by Dr. Iain Murray from University of Edinburgh, though the version used here was modified slightly and formatted by professors at the University of Michigan.

The dataset contains:

4 continuous features (columns 0 through 3),
1 computer-readable discrete label (column 4, 4 possible values),
2 human-readable discrete labels (columns 5 and 6), and
59 entries.

import pandas as pd

fruit_datafile = 'real-world-datasets/fruit_data_with_colors.txt'
fruit = (pd.read_csv(fruit_datafile,
                     sep='\t')
         [['mass','width','height','color_score',
           'fruit_label',
           'fruit_name','fruit_subtype']])
fruit.head()

	mass	width	height	color_score	fruit_label	fruit_name	fruit_subtype
0	192	8.4	7.3	0.55	1	apple	granny_smith
1	180	8.0	6.8	0.59	1	apple	granny_smith
2	176	7.4	7.2	0.60	1	apple	granny_smith
3	86	6.2	4.7	0.80	2	mandarin	mandarin
4	84	6.0	4.6	0.79	2	mandarin	mandarin

Detailed DataFrame Info

Crime

crime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Data columns (total 89 columns):
population               1994 non-null int64
householdsize            1994 non-null float64
agePct12t21              1994 non-null float64
agePct12t29              1994 non-null float64
agePct16t24              1994 non-null float64
agePct65up               1994 non-null float64
numbUrban                1994 non-null int64
pctUrban                 1994 non-null float64
medIncome                1994 non-null int64
pctWWage                 1994 non-null float64
pctWFarmSelf             1994 non-null float64
pctWInvInc               1994 non-null float64
pctWSocSec               1994 non-null float64
pctWPubAsst              1994 non-null float64
pctWRetire               1994 non-null float64
medFamInc                1994 non-null int64
perCapInc                1994 non-null int64
NumUnderPov              1994 non-null int64
PctPopUnderPov           1994 non-null float64
PctLess9thGrade          1994 non-null float64
PctNotHSGrad             1994 non-null float64
PctBSorMore              1994 non-null float64
PctUnemployed            1994 non-null float64
PctEmploy                1994 non-null float64
PctEmplManu              1994 non-null float64
PctEmplProfServ          1994 non-null float64
PctOccupManu             1994 non-null float64
PctOccupMgmtProf         1994 non-null float64
MalePctDivorce           1994 non-null float64
MalePctNevMarr           1994 non-null float64
FemalePctDiv             1994 non-null float64
TotalPctDiv              1994 non-null float64
PersPerFam               1994 non-null float64
PctFam2Par               1994 non-null float64
PctKids2Par              1994 non-null float64
PctYoungKids2Par         1994 non-null float64
PctTeen2Par              1994 non-null float64
PctWorkMomYoungKids      1994 non-null float64
PctWorkMom               1994 non-null float64
NumKidsBornNeverMar      1994 non-null int64
PctKidsBornNeverMar      1994 non-null float64
NumImmig                 1994 non-null int64
PctImmigRecent           1994 non-null float64
PctImmigRec5             1994 non-null float64
PctImmigRec8             1994 non-null float64
PctImmigRec10            1994 non-null float64
PctRecentImmig           1994 non-null float64
PctRecImmig5             1994 non-null float64
PctRecImmig8             1994 non-null float64
PctRecImmig10            1994 non-null float64
PctSpeakEnglOnly         1994 non-null float64
PctNotSpeakEnglWell      1994 non-null float64
PctLargHouseFam          1994 non-null float64
PctLargHouseOccup        1994 non-null float64
PersPerOccupHous         1994 non-null float64
PersPerOwnOccHous        1994 non-null float64
PersPerRentOccHous       1994 non-null float64
PctPersOwnOccup          1994 non-null float64
PctPersDenseHous         1994 non-null float64
PctHousLess3BR           1994 non-null float64
MedNumBR                 1994 non-null int64
HousVacant               1994 non-null int64
PctHousOccup             1994 non-null float64
PctHousOwnOcc            1994 non-null float64
PctVacantBoarded         1994 non-null float64
PctVacMore6Mos           1994 non-null float64
MedYrHousBuilt           1994 non-null int64
PctHousNoPhone           1994 non-null float64
PctWOFullPlumb           1994 non-null float64
OwnOccLowQuart           1994 non-null int64
OwnOccMedVal             1994 non-null int64
OwnOccHiQuart            1994 non-null int64
OwnOccQrange             1994 non-null int64
RentLowQ                 1994 non-null int64
RentMedian               1994 non-null int64
RentHighQ                1994 non-null int64
RentQrange               1994 non-null int64
MedRent                  1994 non-null int64
MedRentPctHousInc        1994 non-null float64
MedOwnCostPctInc         1994 non-null float64
MedOwnCostPctIncNoMtg    1994 non-null float64
NumInShelters            1994 non-null int64
NumStreet                1994 non-null int64
PctForeignBorn           1994 non-null float64
PctBornSameState         1994 non-null float64
PctSameHouse85           1994 non-null float64
PctSameCity85            1994 non-null float64
PctSameState85           1994 non-null float64
ViolentCrimesPerPop      1994 non-null float64
dtypes: float64(67), int64(22)
memory usage: 1.4 MB

Cancer

cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
mean radius                569 non-null float64
mean texture               569 non-null float64
mean perimeter             569 non-null float64
mean area                  569 non-null float64
mean smoothness            569 non-null float64
mean compactness           569 non-null float64
mean concavity             569 non-null float64
mean concave points        569 non-null float64
mean symmetry              569 non-null float64
mean fractal dimension     569 non-null float64
radius error               569 non-null float64
texture error              569 non-null float64
perimeter error            569 non-null float64
area error                 569 non-null float64
smoothness error           569 non-null float64
compactness error          569 non-null float64
concavity error            569 non-null float64
concave points error       569 non-null float64
symmetry error             569 non-null float64
fractal dimension error    569 non-null float64
worst radius               569 non-null float64
worst texture              569 non-null float64
worst perimeter            569 non-null float64
worst area                 569 non-null float64
worst smoothness           569 non-null float64
worst compactness          569 non-null float64
worst concavity            569 non-null float64
worst concave points       569 non-null float64
worst symmetry             569 non-null float64
worst fractal dimension    569 non-null float64
target                     569 non-null int64
dtypes: float64(30), int64(1)
memory usage: 137.9 KB

Fruit

fruit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 7 columns):
mass             59 non-null int64
width            59 non-null float64
height           59 non-null float64
color_score      59 non-null float64
fruit_label      59 non-null int64
fruit_name       59 non-null object
fruit_subtype    59 non-null object
dtypes: float64(3), int64(2), object(2)
memory usage: 3.4+ KB

These notes were taken from the Coursera course Applied Machine Learning in Python. The information is presented by Kevyn Collins-Thompson, PhD, an associate professor of Information and Computer Science at the University of Michigan.