Real-World Datasets

The machine learning course referenced at the bottom of this page makes use of a few real-world datasets. These are generally useful for Machine Learning projects.

import pandas as pd
import sklearn

print('    Pandas version ' + str(pd.__version__))
print('   Sklearn version ' + str(sklearn.__version__))
    Pandas version 0.25.3
   Sklearn version 0.22.1

Crimes Dataset

Used for regression.

Originally taken from: https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized

Processed as shown below, the dataset contains:

  • 88 continuous features (columns 0 through 87),
  • 1 continuous label (column 88, “ViolentCrimesPerPop”), and
  • 1994 entries.
import pandas as pd

crime_datafile = 'real-world-datasets/CommViolPredUnnormalizedData.txt'
crime = pd.read_table(crime_datafile, 
                      sep=',', 
                      na_values='?')
columns_to_keep = ([5, 6] + list(range(11,26))
                   + list(range(32, 103)) + [145])
crime = (crime[crime.columns[columns_to_keep]]
         .dropna()
         .reset_index(drop=True))
crime.head()
population householdsize agePct12t21 agePct12t29 agePct16t24 agePct65up numbUrban pctUrban medIncome pctWWage ... MedOwnCostPctInc MedOwnCostPctIncNoMtg NumInShelters NumStreet PctForeignBorn PctBornSameState PctSameHouse85 PctSameCity85 PctSameState85 ViolentCrimesPerPop
0 11980 3.10 12.47 21.44 10.93 11.33 11980 100.0 75122 89.24 ... 21.1 14.0 11 0 10.66 53.72 65.29 78.09 89.14 41.02
1 23123 2.82 11.01 21.30 10.48 17.18 23123 100.0 47917 78.99 ... 20.7 12.5 0 0 8.30 77.17 71.27 90.22 96.12 127.56
2 29344 2.43 11.36 25.88 11.01 10.28 29344 100.0 35669 82.00 ... 21.7 11.6 16 0 5.00 44.77 36.60 61.26 82.85 218.59
3 16656 2.40 12.55 25.20 12.19 17.57 0 0.0 20580 68.15 ... 20.6 14.5 0 0 2.04 88.71 56.70 90.17 96.24 306.64
4 140494 2.45 18.09 32.89 20.04 13.26 140494 100.0 21577 75.78 ... 17.3 11.7 327 4 1.49 64.35 42.29 70.61 85.66 442.95

5 rows × 89 columns

Breast Cancer Dataset

Used for classification.

The breast cancer dataset is included with scikit-learn as a “Bunch” object. Process as shown below to convert to a Pandas DataFrame.

The dataset contains:

  • 30 continuous features (columns 0 through 29),
  • 1 discrete label (binary, column 30, “Target”), and

    • 0 indicates “malignant”
    • 1 indicates “benign”
  • 569 entries.

import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer_bunch = load_breast_cancer()
cancer = pd.DataFrame(cancer_bunch['data'],
                      columns = cancer_bunch['feature_names'])
cancer['target'] = cancer_bunch['target']

cancer.head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

Fruits Dataset

Used for classification.

The dataset was originally created by Dr. Iain Murray from University of Edinburgh, though the version used here was modified slightly and formatted by professors at the University of Michigan.

The dataset contains:

  • 4 continuous features (columns 0 through 3),
  • 1 computer-readable discrete label (column 4, 4 possible values),
  • 2 human-readable discrete labels (columns 5 and 6), and
  • 59 entries.
import pandas as pd

fruit_datafile = 'real-world-datasets/fruit_data_with_colors.txt'
fruit = (pd.read_csv(fruit_datafile,
                     sep='\t')
         [['mass','width','height','color_score',
           'fruit_label',
           'fruit_name','fruit_subtype']])
fruit.head()
mass width height color_score fruit_label fruit_name fruit_subtype
0 192 8.4 7.3 0.55 1 apple granny_smith
1 180 8.0 6.8 0.59 1 apple granny_smith
2 176 7.4 7.2 0.60 1 apple granny_smith
3 86 6.2 4.7 0.80 2 mandarin mandarin
4 84 6.0 4.6 0.79 2 mandarin mandarin

Detailed DataFrame Info

Crime

crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Data columns (total 89 columns):
population               1994 non-null int64
householdsize            1994 non-null float64
agePct12t21              1994 non-null float64
agePct12t29              1994 non-null float64
agePct16t24              1994 non-null float64
agePct65up               1994 non-null float64
numbUrban                1994 non-null int64
pctUrban                 1994 non-null float64
medIncome                1994 non-null int64
pctWWage                 1994 non-null float64
pctWFarmSelf             1994 non-null float64
pctWInvInc               1994 non-null float64
pctWSocSec               1994 non-null float64
pctWPubAsst              1994 non-null float64
pctWRetire               1994 non-null float64
medFamInc                1994 non-null int64
perCapInc                1994 non-null int64
NumUnderPov              1994 non-null int64
PctPopUnderPov           1994 non-null float64
PctLess9thGrade          1994 non-null float64
PctNotHSGrad             1994 non-null float64
PctBSorMore              1994 non-null float64
PctUnemployed            1994 non-null float64
PctEmploy                1994 non-null float64
PctEmplManu              1994 non-null float64
PctEmplProfServ          1994 non-null float64
PctOccupManu             1994 non-null float64
PctOccupMgmtProf         1994 non-null float64
MalePctDivorce           1994 non-null float64
MalePctNevMarr           1994 non-null float64
FemalePctDiv             1994 non-null float64
TotalPctDiv              1994 non-null float64
PersPerFam               1994 non-null float64
PctFam2Par               1994 non-null float64
PctKids2Par              1994 non-null float64
PctYoungKids2Par         1994 non-null float64
PctTeen2Par              1994 non-null float64
PctWorkMomYoungKids      1994 non-null float64
PctWorkMom               1994 non-null float64
NumKidsBornNeverMar      1994 non-null int64
PctKidsBornNeverMar      1994 non-null float64
NumImmig                 1994 non-null int64
PctImmigRecent           1994 non-null float64
PctImmigRec5             1994 non-null float64
PctImmigRec8             1994 non-null float64
PctImmigRec10            1994 non-null float64
PctRecentImmig           1994 non-null float64
PctRecImmig5             1994 non-null float64
PctRecImmig8             1994 non-null float64
PctRecImmig10            1994 non-null float64
PctSpeakEnglOnly         1994 non-null float64
PctNotSpeakEnglWell      1994 non-null float64
PctLargHouseFam          1994 non-null float64
PctLargHouseOccup        1994 non-null float64
PersPerOccupHous         1994 non-null float64
PersPerOwnOccHous        1994 non-null float64
PersPerRentOccHous       1994 non-null float64
PctPersOwnOccup          1994 non-null float64
PctPersDenseHous         1994 non-null float64
PctHousLess3BR           1994 non-null float64
MedNumBR                 1994 non-null int64
HousVacant               1994 non-null int64
PctHousOccup             1994 non-null float64
PctHousOwnOcc            1994 non-null float64
PctVacantBoarded         1994 non-null float64
PctVacMore6Mos           1994 non-null float64
MedYrHousBuilt           1994 non-null int64
PctHousNoPhone           1994 non-null float64
PctWOFullPlumb           1994 non-null float64
OwnOccLowQuart           1994 non-null int64
OwnOccMedVal             1994 non-null int64
OwnOccHiQuart            1994 non-null int64
OwnOccQrange             1994 non-null int64
RentLowQ                 1994 non-null int64
RentMedian               1994 non-null int64
RentHighQ                1994 non-null int64
RentQrange               1994 non-null int64
MedRent                  1994 non-null int64
MedRentPctHousInc        1994 non-null float64
MedOwnCostPctInc         1994 non-null float64
MedOwnCostPctIncNoMtg    1994 non-null float64
NumInShelters            1994 non-null int64
NumStreet                1994 non-null int64
PctForeignBorn           1994 non-null float64
PctBornSameState         1994 non-null float64
PctSameHouse85           1994 non-null float64
PctSameCity85            1994 non-null float64
PctSameState85           1994 non-null float64
ViolentCrimesPerPop      1994 non-null float64
dtypes: float64(67), int64(22)
memory usage: 1.4 MB

Cancer

cancer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
mean radius                569 non-null float64
mean texture               569 non-null float64
mean perimeter             569 non-null float64
mean area                  569 non-null float64
mean smoothness            569 non-null float64
mean compactness           569 non-null float64
mean concavity             569 non-null float64
mean concave points        569 non-null float64
mean symmetry              569 non-null float64
mean fractal dimension     569 non-null float64
radius error               569 non-null float64
texture error              569 non-null float64
perimeter error            569 non-null float64
area error                 569 non-null float64
smoothness error           569 non-null float64
compactness error          569 non-null float64
concavity error            569 non-null float64
concave points error       569 non-null float64
symmetry error             569 non-null float64
fractal dimension error    569 non-null float64
worst radius               569 non-null float64
worst texture              569 non-null float64
worst perimeter            569 non-null float64
worst area                 569 non-null float64
worst smoothness           569 non-null float64
worst compactness          569 non-null float64
worst concavity            569 non-null float64
worst concave points       569 non-null float64
worst symmetry             569 non-null float64
worst fractal dimension    569 non-null float64
target                     569 non-null int64
dtypes: float64(30), int64(1)
memory usage: 137.9 KB

Fruit

fruit.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 7 columns):
mass             59 non-null int64
width            59 non-null float64
height           59 non-null float64
color_score      59 non-null float64
fruit_label      59 non-null int64
fruit_name       59 non-null object
fruit_subtype    59 non-null object
dtypes: float64(3), int64(2), object(2)
memory usage: 3.4+ KB