Cubit's Blog

Estimating White-Collar Workers Using Census Data

Are you curious about the number of white-collar workers in your area? Well, I recently embarked on a journey to find white-collar worker categories from the Census Bureau, and let me tell you, it was quite the adventure! In this blog post, I’ll take you through my process of estimating white-collar workers using the American Community Survey and the key variables.

Not sure what the American Community Survey is? No problem! you can check out this handy FAQ on our website here: What is the American Community Survey?

Does the ACS Estimate White Collar Workers?

Not exactly. My search began on the official Census Bureau website, census.gov. The Census Bureau’s American Community Survey collects data on the industry and occupation of workers in the labor force. However, they do not include a specific table or variable to identify white-collar workers. It seemed like my quest for white-collar worker categories had hit a roadblock right out of the gate.

Identifying Key Variables

While I couldn’t find exactly what I needed on the Census website, I did explore the alternative avenue of the American Community Survey’s Users Group, the perfect place to connect with fellow data enthusiasts who might have the answers I was looking for.
Here I found this promising reply listing table C24010 and the variables that could be used to estimate a “working class”, and thus help me identify the variables needed to get a “white collar” estimate.

After I downloaded the full 2005 documentation for table C24010 to review the actual variable descriptions, it turned out that a lot of the variables did not align exactly with what was described. So this search for white-collar categories wasn’t over yet.

The Answer

Moving on, I instead looked through the most recent 2021 documentation. Now (using the most generous interpretation of what a white-collar job is), I decided to use these variables to estimate white-collar workers:

Management, business, science, and arts occupations
Sales and office occupations

If you wanted to estimate blue-collar workers, you could then use the variables for:

Service occupations
Natural resources, construction, and maintenance occupations
Production, transportation, and material moving occupations

Using these categories, you can now estimate “white-collar” workers for your geography of choice. (*Remember to sum both male and female variables in the ACS table to get the total.)

As an example let’s look at Williamson County, TX. Williamson County has about 222,454 white-collar workers for 2021, making up about 72% of the employed population. Below you check out the highlighted variables used to get this total:

Where do these occupation categories come from?

For the occupation data, the Census Bureau uses the Standard Occupational Classification (SOC).

“The SOC is the federal government’s own regularly-updated system for classifying occupations, which are grouped according to the nature of the work performed. This system provides a mechanism for cross-referencing and aggregating occupation-related data collected by social and economic statistical reporting programs.”

Want to learn more about Census demographics, occupation data or anything else data-related?
We’re here to help. You can fill out the Custom Data Request form, or call us at 1-800-939-2130.

Using Code Interpreter to Analyze US Census Data

Photo by Headway on Unsplash.

Using Code Interpreter to Analyze US Census Data: The Good, the Impressive & the Ugly

Let’s kick the tires of ChatGPT’s Code Interpreter using the latest US Census’ American Community Survey data. I’ll share my favorite prompt, what impressed me most, and what Code Interpreter got flat wrong.

tl;dr

The Good: Code Interpreter can open data files and make pretty darn good guesses about what’s inside.
The Impressive: It can also produce simple weighted scoring models and adjust the weights.
The Ugly: But sometimes, it produces obviously wrong calculations.

My favorite prompt:

What’s Code Interpreter?

Code Interpreter is a (terribly named) beta feature of ChatGPT that lets you load data files and analyze the data.

If you want to follow along with me, you need a $20-a-month ChatGPT account. Then you need to turn on Code Interpreter under your Account and then in Settings and Beta.

Once Code Interpreter is on, you can upload data files using the + button.

The Good – Code Interpreter makes good guesses of what’s in a file.

I accidentally uploaded the entire zip file for our DemographicsByCitiesForTexas which has both a data file and a notes and citations file. Code Interpreter effortlessly unzipped the file and identified the data file versus the citations & notes file. It also cut off the human-readable headers and started working with the machine-readable headers – without me having to tell it to.

Furthermore, Code Interpreter successfully described what key columns were included in the file.

That said, it’s not all sparkles and unicorns. In the above example, Code Interpreter says that hhi_total is the total number of households. And this is correct. But when I was working with a different dataset, Code Interpreter said that hhi_total was the total household income – which is incorrect.

Lessons Learned

You can load data files that you aren’t familiar with into Code Interpreter and see if it can make heads or tails of them.

I may need to update the database headers in Cubit’s files to make it easier for AI tools to “understand” the fields.

Don’t assume that Code Interpreter will always “understand” the data fields even if it correctly “understood” the fields in a previous analysis.

Identifying the Highest Income Cities in Texas

Now let’s dig in! Can Code Interpreter can figure out the highest income cities in Texas using the most recent American Community Survey Census data? Yes, it produced a top ten list of cities based on the correct median household income column in the file. It even called out that the median income doesn’t go higher than $250,001.

But I’m not impressed yet as I can do the same thing with a simple sort in Excel. So now I want to see something that I can’t do out of the box in Excel, and that’s build a map of these high-income cities so I can see where they are clustered in Texas.

Visualizing the High-Income Cities on a Map

But Code Interpreter can’t build maps directly.

It did, however, suggest some tools to help visualize this data such as Python libraries – which doesn’t help me as I don’t know Python or Folium. Also, Code Interpreter clarifies that it needs coordinates for map building.

Lessons Learned

Code Interpreter can’t produce maps – bummer! But it can write code for other technologies to produce maps.

I need to think if we should add latitude/longitude data to our data files.

Locating the Top 10 Cities in Texas

So I still want to know where these high-income cities are in Texas. Can Code Interpreter help me do this without a map?

Code Interpreter uses its own data to locate each city and ignores the county data in the file that I provided. But this is only problematic for “Redfield CDP” as it doesn’t have data for this geography where as the file that I provided does.

Could a different prompt give us what we need? Maybe.

I asked Code Interpreter to provide a graph of the counts of cities with the max median income by county, and it provided a description of the graph and what data was considered. Tada! Ok, I now roughly know where these high income cities in Texas are located.

Show Me Something I Don’t Know.

I’m done exploring high-income cities in Texas, and I’m ready to be impressed. And what could be more impressive than Code Interpreter figuring out something about this dataset that I don’t already know? Here’s the prompt I use.

But the results were not as impressive as I hoped and included a distribution of Median Household income across the Texas cities, the top 10 counties by total population (even though the total populations in the file are only for cities?) and the distribution of population densities across the cities. Honestly, I’m underwhelmed.

I’m going to skip a bunch of stuff that didn’t work to get you straight into the good stuff.

The Impressive: Weighted Scoring Model

Sometimes, I need to identify geographies that have large populations AND large income AND {insert other variable here}. Let’s see if Code Interpreter can do this.

And it completely fails. I tried a bunch of different prompts and they all failed.

But…

I was explaining what I was trying to do to Sara of FromThePage, and she asked me how I’d solve this problem without Code Interpreter. I told her that I’d build a simple model and apply weights. And she brilliantly asked, “I wonder what Code Interpreter would do if you told it that?” Good point! So I did but this time using our Texas county dataset.

And that’s just what I wanted – a simple weighted model. But I don’t want Harris County to ALWAYS be at the top with its outlier population of 4 million people. So let’s see if Code Interpreter will tweak the weights.

This simple weighted model was the most interesting thing that I got Code Interpreter to do. I’ve been playing around with projections and change over time data, and I’m hopeful that I’ll get something even more impressive soon.

Lesson Learned

Code Interpreter can’t solve data problems for you – beyond simple sorts and graphs. To get it to do something impressive, you must already know the solution to your problem AND you must figure out exactly how to tell it to produce what you want. Alternatively, I could need more practice at prompt writing.

The Ugly: Obvious Calculation Errors

I was on the phone with a client who wanted to identify zips where many Hispanics live. And since I had already loaded demographics for Texas cities into Code Interpreter, I thought I’d see how well it would do.

First off, Code Interpreter had problems locating a “hispanic” column in the dataset when there’s a clearly named column: “race_and_ethnicity_hispanic”. It thinks it fixes the problem but ends up using the wrong universe which results in Hispanic percentages over 100% — which is impossible.

So this is dumb, but to be fair, Code Interpreter points out the error.

I tried to get Code Interpreter to fix the problem on its own, but it couldn’t.

When I pointed Code Interpreter to the right columns to use, then it corrected the calculation. But if I’m going to have to spell out columns, then I’ll probably just stick with a database or Tableau or {insert other data tool that I know better}.

Lessons Learned

Double-check all Code Interpreter calculations.

When you start getting results that are obviously wrong, reload the file and start over rather than trying to get Code Interpreter to find and fix the error.

And One Bonus Lesson Learned that Doesn’t Fit Anywhere Else

You could use Code Interpreter like a flow in Tableau Prep. You drop in standardized data, run a series of prompts, and get a standardized output in text or data visualizations.

Source: https://help.tableau.com/current/prep/en-us/prep_build_flow.htm

Conclusion

I’ve never incorporated a tool into my daily workflow as quickly as I have ChatGPT. Every day, I use it to do something a little different – be it writing email subject lines or rewriting this wordy blog post, or producing formulas for Google Sheets that all I need to do is to copy and paste and they work (mostly).

As you can see from the above post, I’m still a novice in terms of using Code Interpreter to analyze Census data. In fact, my favorite use cases for Code Interpreter aren’t when I’ve asked it to analyze Census data, but when I’ve asked it to analyze data for my business, Cubit.

For example, I wanted to know what days of the week were most popular for making purchases of one of our products. I was able to load product data into Code Interpreter, and it spit out the graph slightly faster than I could have built the same thing in Excel. But I didn’t have the spend my time fixing date format issues – Code Interpreter did this for me.

Also, I wanted to know what hours of the day I receive the most phone calls. Code Interpreter was able to clean up different time formats and produce the following graph – again slightly faster than I could have done AND saving me the brainpower from having to fix data format issues.

So my final lessons learned are:

Code Interpreter is fun to use with internal business data as makes simple graphs that I can use to answer simple questions.

I need to keep using Code Interpreter daily with Census data or internal data to improve my prompt writing and learn what it can and can’t do.

Wow! You’ve read to the end. Color me impressed. You, my friend, are EXACTLY the type of person that I want to hear from, and here’s where you can send me a message.

Population Growth by State 2020

On Monday, April 26, 2021, the Census Bureau released the Census 2020 population by state data, also known as apportionment data. These counts are used to divide up the seats in the U.S. House of Representatives among the 50 states. We can use this first Census 2020 data release to calculate population growth by state for 2020.

My partner, Anthony, built the data viz below so you can see how your state(s) of interest grew. The idea behind this visualization is that you can tell at a glance that “this state is growing [faster than | about the same as | slower than] the US or other states as well as itself.”

Population Growth by State 2020 Data Visualization

	2020 Residential Population	Percent Population Change
United States	331,449,281
Alabama	5,024,279
Alaska	733,391
Arizona	7,151,502
Arkansas	3,011,524
California	39,538,223
Colorado	5,773,714
Connecticut	3,605,944
Delaware	989,948
District of Columbia	689,545
Florida	21,538,187
Georgia	10,711,908
Hawaii	1,455,271
Idaho	1,839,106
Illinois	12,812,508
Indiana	6,785,528
Iowa	3,190,369
Kansas	2,937,880
Kentucky	4,505,836
Louisiana	4,657,757
Maine	1,362,359
Maryland	6,177,224
Massachusetts	7,029,917
Michigan	10,077,331
Minnesota	5,706,494
Mississippi	2,961,279
Missouri	6,154,913
Montana	1,084,225
Nebraska	1,961,504
Nevada	3,104,614
New Hampshire	1,377,529
New Jersey	9,288,994
New Mexico	2,117,522
New York	20,201,249
North Carolina	10,439,388
North Dakota	779,094
Ohio	11,799,448
Oklahoma	3,959,353
Oregon	4,237,256
Pennsylvania	13,002,700
Puerto Rico	3,285,874
Rhode Island	1,097,379
South Carolina	5,118,425
South Dakota	886,667
Tennessee	6,910,840
Texas	29,145,505
Utah	3,271,616
Vermont	643,077
Virginia	8,631,393
Washington	7,705,281
West Virginia	1,793,716
Wisconsin	5,893,718
Wyoming	576,851

2020 Census Data Release Update

Photo by Enayet Raheem on Unsplash.

Post updated February 2022.

When can you get your hands on 2020 Census data?

It’s complicated. Here’s a table with important dates.

Date	Data Type	Description
February 2021	Geography data	The boundaries (think outlines)
April 30, 2021	Apportionment count	Population by state
Aug 16 & Sep 30 2021	Redistricting count	Limited demographics by state. Aug = legacy format; Sept = “easy to use” format.
May & Aug 2023, TBD (details)	Demographic & Housing Characteristics File AND Detailed File	The GOOD stuff Staggered release by state for all demographics for small geographies.
March 17, 2022 (details)	American Community Survey 5 year estimates	Not 2020 Census but has important data that’s not in the 2020 Census like income

2020 Census Data Release Table

The Long Answer (that only my mom will read)

February 2021 – Spatial data

The US Census Bureau has already released the 2020 geographies, and we’re still waiting on the demographics. It’s like getting the tender, flaky cannoli shells first and then having to wait for the creamy ricotta filling.

source: flickr
2020 Census geographies are like cannoli shells in that they are not nearly as exciting without the demographic data filling!

April 30, 2021 – Apportionment count

We’ve received state level resident population (+ the overseas federal employees) in this data release which is used for determining seats in Congress. Here’s a new blog post all about the apportionment data called Population Growth by State 2020.

August 16 & September 30, 2021 – Redistricting data

This data release will be the first file that includes demographic and housing characteristics. The good news is that the redistricting dataset will be available for small geographies like Census blocks. The bad news is that this dataset will only include:

Population
Voting age
Race & ethnicity
Housing units & occupancy status
Group quarters << don’t worry if you don’t know what this is

Update: August 2021

Here’s how you get your hands on the 2020 Redistricting data.

If you’re skilled in working with large datasets…

You can access the data directly from the Census Bureau here. It’s in a legacy format (aka hard to work with), but an easier-to-work-with version is coming in September.

If you need State by State visualizations…

You can explore the data for your state here.

If you need 2020 Redistricting data for cities and counties and don’t want to DIY…

You can purchase a Demographics by Report that includes the new 2020 data. Details.

If you need 2020 Redistricting data for radiuses around a location…

Your radius reports now include a tab in the Excel file with the 2020 Redistricting data so you can compare it to the 2019 American Community Survey data. Details.

If you need data for zips/ZCTAs…

There are no 2020 Census redistricting data for zips/ZCTAs now. It will be released later, but the Census Bureau hasn’t said when yet. We could use the small Census geographies to create 2020 estimates for zips/ZCTAs if we hear from enough of our clients that this would be helpful. Or we could just wait until the Census Bureau releases the official data for zips/ZCTAs. Contact me if you have a preference.

If you need 2020 Redistricting data and don’t fit into any of the above categories …

You can get 2020 Redistricting data as a custom data pull. Let me know what you need here and I’ll provide you with a quote and turnaround time. I do love weird data requests, so send me something good!

May 2023 – Demographic Profile

This new dataset will include demographic & housing data for cities only (technically: places/minor civil divisions — but will it be all cities or just big cities?) and is supposed to be released “as soon after the release of the Redistricting product as possible.”

May 2023 – Demographic and Housing Characteristics File (DHC) & Detailed Demographic Housing Characteristics File

This is the dataset that you and I and everyone who isn’t doing redistricting really wants — the luscious filling for our cannoli – with all of the available 2020 Census demographics for large & small geographies.

source: wikimedia.org
Yes! I want the geography data cannoli shells stuffed full of the demographic data filling.

~~Rumor has it that this dataset will be released on a state by state basis and won’t be fully available until December 2021~~. The Detailed Demographic and Housing Characteristics File will be released in 3 separate data products:

Detailed DHC-A – planned release August 2023
Detailed DHC-B – release TBD
Supplemental-DHC (S-DHC) – release TBD

Data nerd aside: The DHC will include many of the demographic and housing tables previously included in the Summary Files. DHC subjects include:

Population:
- Age
- Sex
- Race & ethnicity
- Household & family type & relationship
Housing Units
- Occupancy status (occupied vs vacant)
- Tenure (owner vs renter)

But don’t forget that we’ll still have to use the American Community Survey for important data like income.

March 17, 2022 – American Community Survey 2020 5 Year Estimates

The American Community Survey (ACS) is the ongoing, annual survey of 3.5 milllion addresses that collects the social, economic, housing, and demographic characteristics of the nation’s population. The US Census Bureau will use ACS surveys collected in 2016, 2017, 2018, 2019 and 2020 to produce demographic estimates for small geographies like zip codes/ZCTAs for 2020. Historically, the ACS is released in December.

Update March 2021. The Census Bureau has communicated that the 2020 American Community Survey (ACS) will use the 2010 ZCTA boundaries rather than the 2020 ZCTA boundaries. As of right now, whenever the 2020 Census demographics are released for zips/ZCTAs, there won’t be 2020 income data for those same zips/ZCTAs. The Census Bureau is planning on using the updated 2020 ZCTA boundaries in the 2021 ACS release.

Update February 2022. The Census Bureau has communicated that the 2020 American Community Survey will be released on March 17, 2022 (details).

But I’m curious to learn if the 2020 ACS (to be released in 2021) or the 2021 ACS (to be released in 2022) will use the same geographies as the 2020 Census. I asked the Census Bureau this question and their reply is below:

“The 2020 ACS Data Release schedule will be posted within the next week or two. We are planning on updating the Geography Boundaries by Year page at the same time, which will tell you which boundaries will be used for each level of geography in the 2020 ACS data products. This geography boundaries by year page is usually posted at the same time as the data release in September, but we are posting it early this year because we have gotten a lot of questions about which boundaries will be used due to the 2020 Census.”

Below are some lovely graphics explaining how the ACS and the Decennial Census fit together and are different.

Now if you’ll excuse me, I’m hungry for cannoli – I can’t imagine why. Send me an email if you have any more burning questions about the 2020 Census, and I’ll reply after my cannoli run.

How to make money on Google Ads after all your campaigns start losing money

Summary: You can use Google Ads data to estimate demand for your business — which is cool because it’s harder to get data about demand than it is to get data about who lives where.

One of the most popular demographic data pulls that we do each day is a radius report which provides demographics for a radius (or a ring) around a location. The US Census Bureau doesn’t provide radius reports, so our clients who need them – small business owners, real estate folks and health care companies – purchase them from us as part of their marketing research for opening new locations or exploring real estate development projects.

Starting in 2014, I used to make $200-ish dollars a month in profit selling radius reports via Google Ads. Although $200 is nothing to brag about, it was a nice source of new clients that often bought future reports for a minimal marketing effort on my part.

Unfortunately, all easy things in business come to an end. (Or is this just for my business?)

Since February 2019, I’ve consistently lost money each month on my Google Ad campaigns except for 1 month when I turned them off in disgust. I’ve run experiments to improve my ads, testing different settings & geographic restrictions and even hired an expert — who lost way more money than I did (ha!). So it’s time to try something new.

Google Ads has DEMAND DATA

The good news is that Google Ads has something more valuable to me than $200 a month. It has data on the the exact search terms used by visitors who then went on to buy a radius report from me (conversions). I call this data demand data.

So based on my analysis of this demand data, my partner, Anthony, built the website Demographics By Radius for 5 mile radiuses around all US city centers. We provide these demographics for free hoping to attract customers that need demographics for a location other than the city center.

While he wasn’t the first, Adam Grant said “The more I help out, the more successful I become” in his book Give and Take: A Revolutionary Approach to Success. Hopefully, the free data on Demographics By Radius is a valuable tool, and if you find any bugs (gulp!) or have suggestions on how we could make it even more valuable, I’d love to know.

Well that’s great for you Kristen, but how do I use Google Ad data when I don’t have 6 years of demand data ready to analyze or an a developer partner to build me a website?

I’m so glad you asked. Let’s pretend that you want to open a new Pilates studio in Austin, Texas. First, you build a custom map to identify areas with lots of wealthy women — just keeping things simple. And by filtering on high median incomes ($84,00+) and population density (522+) by zip code, you identify 3 initial areas of interest.

Cedar Park
West Lake
Circle C

You overlay other Pilates businesses (blue dots) over your wealthy women areas. And you decide to exclude the wealthiest area of West Lake, because there are already so many competitors.

Next up, you need to decide between Circle C and Cedar Park. And before you dive into cost data like price per square foot for commercial real estate, it’s a good idea to explore the demand for Pilates in Circle C versus Cedar Park.

Here’s where Google Ads comes in. Open Google Ads and open Tools / Keyword Planner / Get search volumes and forecasts.

Then you’d enter your search terms like Pilates Cedar Park (90 monthly searches) versus Pilates Circle C (no data there’s so few searches). Click on Historical Metrics to see the following data:

Be curious here. You might compare pilates Austin (720 searches) with yoga Austin (1,600 searches). So there’s almost double the interest in yoga as pilates in Austin. You might also check other Texas cities like pilates corpus christi (70 searches) or pilates san marcos (40 searches) that might not have as many competitors.

Probably because my personal business is so driven by Google searches, I would lean on this data to help me pick a name for my studio. For example, you would want to use My Brand Name – Pilates South Austin Studio versus My Brand Name – Pilates Circle C Studio, because more people search for “pilates south austin.”

That said, I wouldn’t hang up a shingle only using Google Ad data for demand estimation. Ideally, you’d use this data give yourself permission to run a small Google Ad campaign or similar in both Circle C and Cedar Park. Maybe you’d set up a Coming soon to Circle C | Ceder Park landing page and ask for email opt-in. You could also give away free online classes in exchange for a telephone call survey. Better yet, you’d go talk to active people in parks in Circle C and Cedar Park about their interest in Pilates. Maybe start a class in the park and see how many people stop by to chat. These are just a few low cost examples of additional experiments that you can run to estimate demand for your new location.

Still with me? Wow! You’re my type of data-driven business owner for actually finishing a long-ish data blog post. Get more tips by signing up in the Monthly Email Newsletter section. I normally write about what’s available in different datasets rather than deeper dives into how to use 1 dataset. Was this deeper dive too much in the weeds and you prefer the normal “what’s out there” data overviews? Or did you find this how-to to be a helpful walk-through about pulling specific data to solve a specific problem? Let me know what you think. Let me know what you think.