Stata for Researchers: Working with Groups

Stata for Researchers: Working with Groups

This is part six of the Stata for Researchers series. For a list of topics covered by this series, see theIntroduction. If you're new to Stata we highly recommend reading the articles in order.

Tasks that require working with groups are common and can range from the very simple ("Calculate the mean mpg of the domestic cars and the foreign cars separately") to the very complex ("Model student performance on a standardized test, taking into account that the students are grouped into classes which are grouped into schools which are grouped into districts."). Fortunately, working with groups is one of Stata's greatest strengths. In this article we'll discuss tools for working with groups, and at the same time try to give you more experience using Stata's syntax to get useful results. In the next section (Hierarchical Data) we'll introduce a theoretical framework for thinking about grouped data, but it will make more sense if you've had some experience working with groups first.

We'll start by going through some basic tools that are used for working with groups, and some tricks for using them. In doing so we'll use one of the example data sets for this series,households.dta. Make sure you copied those files fromX:\SSCC Tutorials\StataResearchordownloaded them, and put them in a convenient location likeU:\StataResearch. Make sure your current working directory is set to that location:

cd U:\StataResearch

(or wherever you put them). Then start a do file:

clear all

capture log close

set more off

log using groups1.log, replace

use households

log close

You'll want to run your do file frequently in this section. Consider keeping a data browser window open so you can easily see what the do file does to your data.

This data set contains information on twenty fictional people who live in six different households. This data structure is one of the most common encountered at the SSCC. One variable that may require explanation isrel2head, or "relationship to the head of household." It is a categorical variable that takes on three values, with value labels applied. Typelabel listto see them. This is typical of real-world data (except real data usually have many more kinds of relationships).

The most important tool for working with groups isby. Recall that if you putbyvarlist:before a command, Stata will first break up the data set up into one group for each value of thebyvariable (or each unique combination of thebyvariables if there's more than one), and then run the command separately for each group. For further review, see the section onbyinUsage and Syntax. Here are some examples of things you can do withby.

Calculating Summary Statistics Over Groups

Find the average age of the adults in each household:

by household: sum age if age>=18

(You could get the same results more compactly withtab household if age>=18, sum(age))

Store the total household income in a new variable:

by household: egen householdIncome=total(income)

Note thathouseholdIncomeis the same for all the individuals living in a given household. That's because it's a characteristic of the household, not the individual. We'll talk more about this distinction inHierarchical Data.

Identifying Characteristics of a Group

Create an indicator for whether a household has children or not, regardless of number:

gen child=(age<18)

by household: egen hasChildren=max(child)

If a household has no children, the maximum value ofchildwill be zero. If it has any at all, the maximum will be one.

In this case,childis likely to be a useful variable in its own right. But if you didn't need it, you could do the whole process in one line with:

by household: egen hasChildren=max(age<18)

Now instead of finding the max of a variable, you're finding the max of an expression, but the result is the same: the maximum will be one for the entire household if the household has any children in it and zero otherwise.

Counting Observations that Meet a Condition

Find the number of children in each household:

by household:egen numChildren=total(child)

Here we take advantage of the fact that the total of an indicator variable is the number of observations for which the indicator variable is true. Again,total(child)could have beentotal(age<18).

Result Spreading

Suppose we need to store the mean age of the adults in each household as a variable. The obvious starting point would be:

by household: egen meanAdultAge=mean(age) if age>=18

However,meanAdultAgereceives a missing for all the children in the data set. That's because theifcondition does two things in this command: it controls which which observations are used in calculating the mean to be stored inmeanAdultAge, but also which observations that mean is stored in. If we need the household'smeanAdultAgeto be available in all the observations for that household (and we usually do), then we need to "spread" the result to the other observations.

by household: egen temp=mean(meanAdultAge)

drop meanAdultAge

rename temp meanAdultAge

All the observations in each household that have a value formeanAdultAgehave the same value. Thus themean()function returns that value—but it does so for all the observations in the household. (Recall that whenmean()encounters missing values it essentially ignores them and calculates the mean of the non-missing values.) Thus thetempvariable contains the proper value ofmeanAdultAgefor all observations, adults and children. We then drop the oldmeanAdultAgevariable and renametempmeanAdultAge. If we plan ahead we can save one line of code compared to the above:

by household: egen temp=mean(age) if age>=18

by household: egen meanAdultAge=mean(temp)

drop temp

This is sometimes called "spreading" a result: if you can find the right answer for some of the observation in a group, you can then spread it out to the others. You could do spreading with any of severalegenfunctions:min(),max(), etc., butmean()is perhaps the most intuitive.

Exercises

Create an indicator variable for childless households using thenumChildrenvariable you created earlier. Defend your choice whether or not to usebyin the process. (Solution)

Find the age of the youngest adult in each household at the time their first child was born. (Hint: this is a characteristic of the household, not an individual.) (Solution)

Find the mean household income of people in single-parent households and two-parent households. (Solution)

_n and _N

Most Stata commands are actually loops: do something to observation one, then do it to observation two and so forth. As Stata works through this loop, it tracks which observation it is working on with an internal variable called_n. You are welcome to use this variable in your commands:

l if _n==5

will only list observation five, because the condition_n==5is only true when Stata is working with observation five.

_nbecomes even more useful when combined withby.Suppose you wanted to list the first observation in each household:

by household: l if _n==1

It just so happens that the first observation is the head of household in every case, which is not unusual. But what if instead of havingrel2headyou only knew the head of household by their location in the household? Then you'd have to be very careful about sorting. Stata's default sort algorithm is not "stable," meaning that if you sort by household it may change the order of observations within the household. If the order of observations matters, you should add thestableoptionto anysortcommands. That way Stata will use a different sort algorithm that is slower but will not change the order of observations within a group. But having done that you can always identify the head of household with a combination ofby household:andif _n==1.

Another internal variable,_N, contains the number of observations in the data set. It is also the observation number of the last observation. You can use it in commands just like_n:

by household: l if _n==_N

This lists the last observation in each household.

Creating Within-Group Identifiers

Often you'll want to have a within-group identifier so you can always tell which observation is which, even after a mistaken sort. In this case the within-group identifier could logically be calledperson:

by household: gen person=_n

Thepersonvariable will correspond to the observation number of the person within their household in the current sort order. If you wanted a globally unique identifier, run the above command withoutby household:.

Finding the Size of a Group

Like_n,_Nhonors by groups. Thus_Ncontains the number of observations in thebygroup currently being worked on. You can easily find household size with:

by household: gen size=_N

Subscripts

Consider the command:

gen newIncome=income

In carrying it out, Stata looks at one observation at a time, and setsnewIncomefor that observation equal toincomefor the same observation. Subscripts allow you to look at the value of a variable for any observation you want. Try:

gen newIncome2=income[1]

income[1]means "the value of income for observation 1." ThusnewIncome2will be 60,000 for all observations (not that that is a useful result).

Spreading Characteristics of a Special Observation

Consider trying to identify the female-headed households:

by household: gen femaleHead=female[1]

Since the first person in each household is the head, the household has a female head if and only if the first person is female.

What if the head of household were last instead of first? Just change it to:

by household: gen femaleHead=female[_N]

What if the heads of household weren't in any particular place within the household? Usesortto make them the first person in the household:

sort household rel2head

by household: gen femaleHead=female[1]

What if the code for "head of household" weren't the lowest value ofrel2head? The following will always work:

gen isHead=(rel2head==1)

sort household isHead

by household: gen femaleHead=female[_N]

What if some households don't have a head, and you needfemaleHeadto be missing for those households? Do the above, but add anifcondition to the last line:

by household: gen femaleHead=female[_N] if isHead[_N]

This general method will work any time you need to pick out the characteristics of a special row within a group (the respondent to a survey, the month in which a subject graduated, etc.):

Create an indicator variable that is one for the special row and zero for all other rows

Sort by the group ID and the new indicator variable

The special row will be last and can be accessed with[_N]as long as you start withby

If you want the special observation to be first rather than last, you can usegsort(generalized sort):

gsort household -isHead

Withgsortyou can put a minus sign in front of a variable name and the observations will be sorted in descending order by that variable rather than ascending.

Checking Whether a Variable Varies within a Group

ThehouseholdIncomevariable should have the same value for all the individuals within a given household. You can check that with:

sort household householdIncome

by household: assert householdIncome[1]==householdIncome[_N]

Because the observations within a household are sorted byhouseholdIncome, the smallest value will be first and the largest value will be last. If the first and last values are the same, then you know all the values are the same.

Exercises

How could you check that every household has one and only one head of household? (Solution)

Create an indicator variable for whether a household's value ofagevaries. Use it tobrowsejust those households whoseagedoes vary. (Solution)

Calculations Based on an Observation's Neighbors

Subscripts can contain mathematical expressions, including_nand_N.

Start a new do file that loads the data set calledschools. This contains enrollment numbers for ten fictional (and not terribly plausible) schools. We'll define a student's peer group as everyone in her grade, the grade above her, and the grade below her. To find the size of each grade's peer group, type the following:

by school: gen peerGroup=students+students[_n+1]+students[_n-1]

The result is missing for grade one because it doesn't have a grade before it, and for grade twelve because it doesn't have a grade after it. Thusstudents[_n-1]orstudents[_n+1]give missing values for them. Fortunately Stata just returns a missing value in such cases rather than giving an "index out of bounds" error or something similar.

Exercises

What would happen if you left outby school:in:

by school: gen peerGroup=students+students[_n+1]+students[_n-1]

What would happen if some schools didn't have an observation for some grades? (Solution)

Implement an extended definition ofpeerGroupwhere the the peers of the first graders and the first and second graders, and the peers of the twelfth graders are the eleventh and twelfth graders (i.e. fill in the missing values). (Solution)

Panel Data

Panel data, or data with multiple individuals observed multiple times, can be treated like grouped data even though a "group" in this case is an individual. (This is why we introduce more general terminology inHierarchical Data.) Start another do file that loads a data set calledemployment. This consists of five people observed for twenty months, with each person's employment status recorded each month.

Identifying Spells

A typical person in this panel is employed for a while, then unemployed for a while, etc. Each period of continuous employment or unemployment is called a "spell" and a common first task with such data is to identify the spells.

Begin by identifying the months which start a new spell, i.e. the months where a person's employment status is different from what it was the previous month:

by person: gen start=(employed!=employed[_n-1])

For the first month in which a person is observed, the quantityemployed[_n-1]does not exist and is thus missing. Sinceemployedis never missing (how would you check that?) this guarantees that the first month a person is observed is marked as the start of a new spell.

Next comes something you should add to your bag of tricks:

by person: gen spell=sum(start)

Thesum()function finds running sums, i.e. the sum of a variable for all observations up to and including the current observation. Sincestartis one whenever a spell starts and zero otherwise,sum(start)for an observation is the number of spells which have started up to that point—and that serves as a splendid spell ID.

Once you've identified the spells, you can treat them as groups. However, these spell IDs only make sense within the context of a person (each person has their own spell number one). Thus the properbyisby person spell:, and the first time you use it you'll have to saybysort. But everything you've learned still applies. For example, finding the duration of a spell is exactly like finding the size of a household:

bysort person spell: gen duration=_N

Exercises

Think back to the command:

by person: gen start=(employed!=employed[_n-1])

What would happen if you omitted theby? (Solution)

Create variables containing the start month, start year, end month and end year for each spell. (Solution)

Find the mean spell length for each person. Make sure the mean is calculated over spells, not months. (Solution)

Next:Hierarchical Data

Previous:Statistics

Last Revised:1/4/2016

|Contact Us|RSS|

Stata for Researchers: Working with Groups

推荐阅读更多精彩内容