Stata for Researchers: Working with Groups

Stata for Researchers: Working with Groups

This is part six of the Stata for Researchers series. For a list of topics covered by this series, see theIntroduction. If you're new to Stata we highly recommend reading the articles in order.

Tasks that require working with groups are common and can range from the very simple ("Calculate the mean mpg of the domestic cars and the foreign cars separately") to the very complex ("Model student performance on a standardized test, taking into account that the students are grouped into classes which are grouped into schools which are grouped into districts."). Fortunately, working with groups is one of Stata's greatest strengths. In this article we'll discuss tools for working with groups, and at the same time try to give you more experience using Stata's syntax to get useful results. In the next section (Hierarchical Data) we'll introduce a theoretical framework for thinking about grouped data, but it will make more sense if you've had some experience working with groups first.

We'll start by going through some basic tools that are used for working with groups, and some tricks for using them. In doing so we'll use one of the example data sets for this series,households.dta. Make sure you copied those files fromX:\SSCC Tutorials\StataResearchordownloaded them, and put them in a convenient location likeU:\StataResearch. Make sure your current working directory is set to that location:

cd U:\StataResearch

(or wherever you put them). Then start a do file:

clear all

capture log close

set more off

log using groups1.log, replace

use households

log close

You'll want to run your do file frequently in this section. Consider keeping a data browser window open so you can easily see what the do file does to your data.

This data set contains information on twenty fictional people who live in six different households. This data structure is one of the most common encountered at the SSCC. One variable that may require explanation isrel2head, or "relationship to the head of household." It is a categorical variable that takes on three values, with value labels applied. Typelabel listto see them. This is typical of real-world data (except real data usually have many more kinds of relationships).

By

The most important tool for working with groups isby. Recall that if you putbyvarlist:before a command, Stata will first break up the data set up into one group for each value of thebyvariable (or each unique combination of thebyvariables if there's more than one), and then run the command separately for each group. For further review, see the section onbyinUsage and Syntax. Here are some examples of things you can do withby.

Calculating Summary Statistics Over Groups

Find the average age of the adults in each household:

by household: sum age if age>=18

(You could get the same results more compactly withtab household if age>=18, sum(age))

Store the total household income in a new variable:

by household: egen householdIncome=total(income)

Note thathouseholdIncomeis the same for all the individuals living in a given household. That's because it's a characteristic of the household, not the individual. We'll talk more about this distinction inHierarchical Data.

Identifying Characteristics of a Group

Create an indicator for whether a household has children or not, regardless of number:

gen child=(age<18)

by household: egen hasChildren=max(child)

If a household has no children, the maximum value ofchildwill be zero. If it has any at all, the maximum will be one.

In this case,childis likely to be a useful variable in its own right. But if you didn't need it, you could do the whole process in one line with:

by household: egen hasChildren=max(age<18)

Now instead of finding the max of a variable, you're finding the max of an expression, but the result is the same: the maximum will be one for the entire household if the household has any children in it and zero otherwise.

Counting Observations that Meet a Condition

Find the number of children in each household:

by household:egen numChildren=total(child)

Here we take advantage of the fact that the total of an indicator variable is the number of observations for which the indicator variable is true. Again,total(child)could have beentotal(age<18).

Result Spreading

Suppose we need to store the mean age of the adults in each household as a variable. The obvious starting point would be:

by household: egen meanAdultAge=mean(age) if age>=18

However,meanAdultAgereceives a missing for all the children in the data set. That's because theifcondition does two things in this command: it controls which which observations are used in calculating the mean to be stored inmeanAdultAge, but also which observations that mean is stored in. If we need the household'smeanAdultAgeto be available in all the observations for that household (and we usually do), then we need to "spread" the result to the other observations.

by household: egen temp=mean(meanAdultAge)

drop meanAdultAge

rename temp meanAdultAge

All the observations in each household that have a value formeanAdultAgehave the same value. Thus themean()function returns that value—but it does so for all the observations in the household. (Recall that whenmean()encounters missing values it essentially ignores them and calculates the mean of the non-missing values.) Thus thetempvariable contains the proper value ofmeanAdultAgefor all observations, adults and children. We then drop the oldmeanAdultAgevariable and renametempmeanAdultAge. If we plan ahead we can save one line of code compared to the above:

by household: egen temp=mean(age) if age>=18

by household: egen meanAdultAge=mean(temp)

drop temp

This is sometimes called "spreading" a result: if you can find the right answer for some of the observation in a group, you can then spread it out to the others. You could do spreading with any of severalegenfunctions:min(),max(), etc., butmean()is perhaps the most intuitive.

Exercises

Create an indicator variable for childless households using thenumChildrenvariable you created earlier. Defend your choice whether or not to usebyin the process. (Solution)

Find the age of the youngest adult in each household at the time their first child was born. (Hint: this is a characteristic of the household, not an individual.) (Solution)

Find the mean household income of people in single-parent households and two-parent households. (Solution)

_n and _N

Most Stata commands are actually loops: do something to observation one, then do it to observation two and so forth. As Stata works through this loop, it tracks which observation it is working on with an internal variable called_n. You are welcome to use this variable in your commands:

l if _n==5

will only list observation five, because the condition_n==5is only true when Stata is working with observation five.

_nbecomes even more useful when combined withby.Suppose you wanted to list the first observation in each household:

by household: l if _n==1

It just so happens that the first observation is the head of household in every case, which is not unusual. But what if instead of havingrel2headyou only knew the head of household by their location in the household? Then you'd have to be very careful about sorting. Stata's default sort algorithm is not "stable," meaning that if you sort by household it may change the order of observations within the household. If the order of observations matters, you should add thestableoptionto anysortcommands. That way Stata will use a different sort algorithm that is slower but will not change the order of observations within a group. But having done that you can always identify the head of household with a combination ofby household:andif _n==1.

Another internal variable,_N, contains the number of observations in the data set. It is also the observation number of the last observation. You can use it in commands just like_n:

by household: l if _n==_N

This lists the last observation in each household.

Creating Within-Group Identifiers

Often you'll want to have a within-group identifier so you can always tell which observation is which, even after a mistaken sort. In this case the within-group identifier could logically be calledperson:

by household: gen person=_n

Thepersonvariable will correspond to the observation number of the person within their household in the current sort order. If you wanted a globally unique identifier, run the above command withoutby household:.

Finding the Size of a Group

Like_n,_Nhonors by groups. Thus_Ncontains the number of observations in thebygroup currently being worked on. You can easily find household size with:

by household: gen size=_N

Subscripts

Consider the command:

gen newIncome=income

In carrying it out, Stata looks at one observation at a time, and setsnewIncomefor that observation equal toincomefor the same observation. Subscripts allow you to look at the value of a variable for any observation you want. Try:

gen newIncome2=income[1]

income[1]means "the value of income for observation 1." ThusnewIncome2will be 60,000 for all observations (not that that is a useful result).

Spreading Characteristics of a Special Observation

Consider trying to identify the female-headed households:

by household: gen femaleHead=female[1]

Since the first person in each household is the head, the household has a female head if and only if the first person is female.

What if the head of household were last instead of first? Just change it to:

by household: gen femaleHead=female[_N]

What if the heads of household weren't in any particular place within the household? Usesortto make them the first person in the household:

sort household rel2head

by household: gen femaleHead=female[1]

What if the code for "head of household" weren't the lowest value ofrel2head? The following will always work:

gen isHead=(rel2head==1)

sort household isHead

by household: gen femaleHead=female[_N]

What if some households don't have a head, and you needfemaleHeadto be missing for those households? Do the above, but add anifcondition to the last line:

by household: gen femaleHead=female[_N] if isHead[_N]

This general method will work any time you need to pick out the characteristics of a special row within a group (the respondent to a survey, the month in which a subject graduated, etc.):

Create an indicator variable that is one for the special row and zero for all other rows

Sort by the group ID and the new indicator variable

The special row will be last and can be accessed with[_N]as long as you start withby

If you want the special observation to be first rather than last, you can usegsort(generalized sort):

gsort household -isHead

Withgsortyou can put a minus sign in front of a variable name and the observations will be sorted in descending order by that variable rather than ascending.

Checking Whether a Variable Varies within a Group

ThehouseholdIncomevariable should have the same value for all the individuals within a given household. You can check that with:

sort household householdIncome

by household: assert householdIncome[1]==householdIncome[_N]

Because the observations within a household are sorted byhouseholdIncome, the smallest value will be first and the largest value will be last. If the first and last values are the same, then you know all the values are the same.

Exercises

How could you check that every household has one and only one head of household? (Solution)

Create an indicator variable for whether a household's value ofagevaries. Use it tobrowsejust those households whoseagedoes vary. (Solution)

Calculations Based on an Observation's Neighbors

Subscripts can contain mathematical expressions, including_nand_N.

Start a new do file that loads the data set calledschools. This contains enrollment numbers for ten fictional (and not terribly plausible) schools. We'll define a student's peer group as everyone in her grade, the grade above her, and the grade below her. To find the size of each grade's peer group, type the following:

by school: gen peerGroup=students+students[_n+1]+students[_n-1]

The result is missing for grade one because it doesn't have a grade before it, and for grade twelve because it doesn't have a grade after it. Thusstudents[_n-1]orstudents[_n+1]give missing values for them. Fortunately Stata just returns a missing value in such cases rather than giving an "index out of bounds" error or something similar.

Exercises

What would happen if you left outby school:in:

by school: gen peerGroup=students+students[_n+1]+students[_n-1]

What would happen if some schools didn't have an observation for some grades? (Solution)

Implement an extended definition ofpeerGroupwhere the the peers of the first graders and the first and second graders, and the peers of the twelfth graders are the eleventh and twelfth graders (i.e. fill in the missing values). (Solution)

Panel Data

Panel data, or data with multiple individuals observed multiple times, can be treated like grouped data even though a "group" in this case is an individual. (This is why we introduce more general terminology inHierarchical Data.) Start another do file that loads a data set calledemployment. This consists of five people observed for twenty months, with each person's employment status recorded each month.

Identifying Spells

A typical person in this panel is employed for a while, then unemployed for a while, etc. Each period of continuous employment or unemployment is called a "spell" and a common first task with such data is to identify the spells.

Begin by identifying the months which start a new spell, i.e. the months where a person's employment status is different from what it was the previous month:

by person: gen start=(employed!=employed[_n-1])

For the first month in which a person is observed, the quantityemployed[_n-1]does not exist and is thus missing. Sinceemployedis never missing (how would you check that?) this guarantees that the first month a person is observed is marked as the start of a new spell.

Next comes something you should add to your bag of tricks:

by person: gen spell=sum(start)

Thesum()function finds running sums, i.e. the sum of a variable for all observations up to and including the current observation. Sincestartis one whenever a spell starts and zero otherwise,sum(start)for an observation is the number of spells which have started up to that point—and that serves as a splendid spell ID.

Once you've identified the spells, you can treat them as groups. However, these spell IDs only make sense within the context of a person (each person has their own spell number one). Thus the properbyisby person spell:, and the first time you use it you'll have to saybysort. But everything you've learned still applies. For example, finding the duration of a spell is exactly like finding the size of a household:

bysort person spell: gen duration=_N

Exercises

Think back to the command:

by person: gen start=(employed!=employed[_n-1])

What would happen if you omitted theby? (Solution)

Create variables containing the start month, start year, end month and end year for each spell. (Solution)

Find the mean spell length for each person. Make sure the mean is calculated over spells, not months. (Solution)

Next:Hierarchical Data

Previous:Statistics

Last Revised:1/4/2016

©2009-2015 UW Board of Regents,University of Wisconsin - Madison

|Contact Us|RSS|

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 200,612评论 5 471
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 84,345评论 2 377
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 147,625评论 0 332
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,022评论 1 272
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 62,974评论 5 360
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,227评论 1 277
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,688评论 3 392
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,358评论 0 255
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,490评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,402评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,446评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,126评论 3 315
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,721评论 3 303
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,802评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,013评论 1 255
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,504评论 2 346
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,080评论 2 341

推荐阅读更多精彩内容

  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 9,298评论 0 23
  • 整理前: “把东西一个一个拿在手里,留下令你心动的东西,丢掉不心动的东西。这样就是判断是最简单又正确的方法。” 想...
    蒸包包阅读 379评论 4 5
  • 夜 氤氲成浓雾 遥远 厚重 整理往事的网 岁月之河 封存 眺望远方 故乡是一缕炊烟 袅袅升起 风卷起漫漫黄土 模糊了岁月
    衣诺双鱼阅读 174评论 2 13