Basic Tableau

This note contains terminology and important basics for working with Tableau. These basics will be explored using two datasets. One of these is private company data that will not be shared in this section, and the other is publicly available data described below.

Dataset

When a company in the US wants to hire someone from outside the United States for a technical job, they are required to file an application to the US government to get a visa or green card for the foreign applicant. The purpose for these applications is to allow the US government to keep track of people entering and leaving the country for work. It also ensures that immigrants are not being taken advantage of, or adversely affecting US workers. The companies are also required to state the average amount an employee with similar skills and background is typically paid for that position, as well as how much the company is planning to pay that particular applicant.

The US Department of Labor Office of Foreign Labor Certification hosts a publicly available dataset of these applications, linked here. The team at Duke University pulled and cleaned a subset of this data to be used in this project. That cleaned data will be used in this post to explore various Tableau functionality with the purpose of analyzing salaries for data-related roles.

The table consists of 167,361 applications (rows) with 26 fields (columns). The file size is 23.8 MB. The 26 fields included in the dataset are the following:

CASE_NUMBER, CASE_STATUS, CASE_RECEIVED_DATE, DECISION_DATE, EMPLOYER_NAME, PREVAILING_WAGE_SUBMITTED, PREVAILING_WAGE_SUBMITTED_UNIT, PAID_WAGE_SUBMITTED, PAID_WAGE_SUBMITTED_UNIT, JOB_TITLE, WORK_CITY, EDUCATION_LEVEL_REQUIRED, COLLEGE_MAJOR_REQUIRED, EXPERIENCE_REQUIRED_Y_N, EXPERIENCE_REQUIRED_NUM_MONTHS, COUNTRY_OF_CITIZENSHIP, PREVAILING_WAGE_SOC_CODE, PREVAILING_WAGE_SOC_TITLE, WORK_STATE, WORK_POSTAL_CODE, FULL_TIME_POSITION_Y_N, VISA_CLASS, PREVAILING_WAGE_PER_YEAR, PAID_WAGE_PER_YEAR, JOB_TITLE_SUBGROUP, order

Analysis Plan

A breakdown for the overall analysis plan for this data is listed below. Recall that the structure for these plans is a hierarchy of increasingly complex analyses organized around a SMART goal.

SMART Goal: Visualize the salary data for data analyst roles based in California for the years 2010 through 2015.
Dependent Variable: “Salary per Year.”
Questions:

  1. Do specific sub-types of data-related jobs have higher or lower salaries than others?
    IV: “Job Title Subgroup” in salary_data
  2. Do salaries change based on visa type?
    IV: “Visa Class” in salary_data
  3. What states have the highest paying data-related salaries?
    IV: “Work State” in salary_data
  4. Are there some data-related jobs whose salaries are likely to increase over time more than others?
    IV: “Case Received Date” and “Job Title Subgroup” in salary_data
  5. How do offered salaries compare to the prevailing wage?
    DV: “Paid Wage Per Year,” “Prevailing Wage Per Year” in salary_data

Visualizations

This first, simple visualization shows the median salary for various job title classifications, as determined by Duke University staff based upon the job title submitted with each application. This visualization addresses the question: Do specific sub-types of data-related jobs have higher or lower salaries than others?

image-center


This second visualization, below, tests question 2, namely: Do salaries change based on visa type? The answer appears to be no, as attorneys generally are the most highly-compensated, and teachers the least, regardless of visa class.

It is not visible in the static image below, but the number of records and sample standard deviation have been added to the tool-tip in Tableau. Mousing over each data point reveals additional information including outliers and standard deviations. A more thorough analysis would take into outlier effects into account; attorneys appear to have a few high-side outliers and some of the sample sizes for these groups are very small.

image-center


This third visualization does not follow best practices, for the reasons outlined in the caption. Tableau was originally designed to only allow analysts to create visuals that followed best practices. Over time, however, as Tableau has popularized, it has allowed users more freedom to design their visuals as they deem fit. As a result, visuals like the one that follows can be created, but Tableau tends to push the user towards using best practices. In this case, Tableau does this by requiring that users add the second grouping, Visa Class, to the color property of the marks card while holding the shift key, which is effectively a forced add.

image-center

The following visualization was created with the intention of finding outlier salaries in the dataset. Under “Analysis,” “Aggregate Measures” was unchecked. Then, the mark shape was changed to a filled circle from a bar. The outliers are added to their own group, resulting in the data point coloration shown.

image-center

The following visualization considers question 3: What states have the highest paying data-related salaries? For this visualization, I consider the following states, which have relatively large tech industries: California, Washington, North Carolina, Colorado, Texas, New York, Massachusetts, Alabama. Maine does not have a particularly large tech presence but I include it as a control. Note that there are only columns where the dataset has data; for example, there do not appear to be any Data Scientist Visa Applications for the state of Maine in the dataset.

To create a more readable chart, I filtered out the less relevant professions, including Teacher, Assistant Professor, and Attorney. The appropriate conclusion to draw from this data is that certain parts of the country pay a premium for higher-tech positions such as data scientist and software engineer, but the salaries are relatively more consistent for more entry-level positions like data analyst and business analyst.

image-center

work-in-progress

Tableau Notes

  • Tableau automatically infers the data types of columns in an Excel sheet, but the inference is sometimes incorrect
  • “Measures” - continuous variables. Indicated by green pills.
  • “Dimensions” - discrete, or categorical, variables. Indicated by blue pills.
  • “Shelf” - row and column names are placed here
  • “Pills” - labels for data types that go onto shelves or onto the marks card
  • “Number of Records” - an automatically-generated measure that indicates the number of rows that correspond to a particular category
  • “Tooltip” - information that appears when hovering over a data point
  • “Marks Card” - two functions: 1. Defines anything that isn’t defined in the rows or the columns. 2. Used to define what variable corresponds to area in area charts, such as pie charts or bubble charts.
  • “Detail Property” - part of the marks card where details can be added to the tooltip
  • “Number of Records” - a measure that Tableau automatically generates, which counts the number of rows forming part of a particular category
  • Outliers - Extreme values in a dataset
  • “Filter” - an easy-to-use means of removing outliers
  • Tableau can only “Group” items based upon parameters that are pin the data model, so it is helpful to include a unique identifier for each data point (case number, in the images below). Grouped variables appear under Dimensions and can be used like any other variable, including as part of a Filter. Groups are a means of performing more fine-grained filtering than is possible by filtering on a single variable.
  • Groups can also be used in situations where the underlying data is messy. For example a “State” field that includes entries such as “CA” and “CALIFORNIA” could appropriately be grouped as “California.”
  • Dates in Tableau can be treated as both discrete and continuous variables.
  • Tableau automatically puts dates into “hierarchies,” which have the plus signs next to them in the pill. The standard hierarchy is Year, Quarter, Month, and sometimes week and day.

Statistics & Visualization Recommendations

Median versus Average

  • Median should be preferred over average if the distribution of the data is unknown

Standard Deviation Types

  Use Population Standard Deviation Use Sample Standard Deviation
Does population represent the general population well? Yes No
Do you care whether it represents the general population well? No! Yes!
  • Sample Standard Deviation is often referred to as simply the “Standard Deviation”
  • Use the Population standard deviation if your purpose is to describe your data set only, and you don’t care how your data relates to other datasets
  • Use the Sample standard deviation if your purpose is to interpret the data as if it does represent the general population

Bar Charts versus Line Charts

Data Analysts need to be able to determine the appropriate visual for different situations. One tradeoff that must be carefully considered is bar graphs versus line graphs. Line charts are best in two situations:

  1. Examining how things change over time. Our natural tendency is to think of time as a line or arrow, so visualizations that are consistent with that idea, like line graphs, are easier for our brains to process.
  2. Displaying how well two continuous variables relate to one another. When we want to see how two variables relate, the more data we see the better.

Dates as Dimensions or Continuous Variables

Tableau can treat dates as measures (continuous variables) or dimensions (discrete variables).

  Measures (continuous) Dimensions (discrete)
  [Values on Continuous Line] [Date Parts]
Advantages Best format for regression,
Years are connected on line graphs
Easy analysis and labeling of date parts,
Can see details in a box and whisker graph
Disadvantages Can’t see all details in a single graph Years are not connected on line graphs,
Some unintuitive effects when computing best fit lines

Work-in-progress

This content is taken from my notes on the Coursera course “Data Visualization and Communication with Tableau.” It is part of the “Excel to MySQL: Analytic Techniques for Business Specialization” specialization.

The specialization is sponsored by Duke University and this particular course is presented by Professor Jana Schaich Borg.