Make Service Fault Transparent

This article is an English one, because I really need to work on the language. Sorry if it is not easy to understand.

A Summary to What's Happening Recently

Recently in my campus, IT service is very unstable.

  • In March, many people posted on forums that they tried to top up campus Internet account by WeChat, but more money (maybe 100x) than they paid were topped up.
    • Later WeChat top-up service were disabled. Because most people were not aware of the existing offline top-up-by-card service, many of them became arrearage.
    • Several days later, campus Internet's charging system was disabled, which means you can use it for free. Later the charging system was resumed, but only charging at the monthly fee (not counting flux fee).
    • An unnoticeable statement was published then, indicating that it was caused by a bug from the software company.
  • On March 20th, campus card users who used their cards to drink hot water or eat breakfast, found their card locked. (Those lazy guys were not affected at all)
    • In the morning nobody knows whether the issue was being solved, until at around 11 (lunchtime) my school's instructor sent an announcement that "there will be unlock service in canteens, please keep order and don't panic at the scene". At canteens announcements by canteens' administrator is put up. Unlocking was quick and easy, but most people still went to canteens where Alipay is accepted.
    • Later that afternoon public statement by card administrator was out: It was a service fault (on BITUnion some said that it's a bug hidden for 14 years). IT staffs explained on BITUnion that they tried to work out solutions and mitigate the issue before they drafted public statements.
  • In these months campus Internet is unstable: During peak hours it became very slow or even unavailable. Maybe it's around 2%'s downtime (in a 24-hour aspect), looking not that much, but users surely could experience that.
    • The causes seem very complex. In my view, new DNS servers, old cache servers, new firewall systems, new upstream link providers and upstream link issue all can cause problems. And of course those new facilities all need to be fine-tuned, which takes time.
    • Currently no authentic statement is published. But in the IT service monthly report (which most people are not aware of), it said "Issue fully fixed, during peak hours upstream links can work in full bandwidth now". One of the reasons they mentioned was "DDoS attack causing network core server CPU instant usage up to 99% (usually ~20%)".
    • However, as student representatives meeting will be held, many representatives will raise the heated Internet issue onto the meeting. But I believe most of they will never get the point why this is happening.
"Totally nailed the fix"

Why Fault Needs to Be Transparent

As you can see, suddenly all issues came into being, but they will not happen because of no reason. Anyway apart from solving issues, making the solving process transparent is also important. Why?

Because, Information technology is becoming essential to our life, just like water and electricity supplies. To this point, it is not anything "advanced" any more, for which people get high expectations to that. What's more, IT is developing fast (counting with years, not decades), thus people's expectations are growing fast with it.

It's quite a challenge for campus IT service to catch up with that. But firstly, they are working on that. If they don't speak, people thinking the service essential will imagine "It's just messing up my life, and they just don't try hard to solve that". This is surely a gap between the two's understanding.

"Why you leave the esculator unfixed for ONE MONTH!"

P.S. Some good man has reminded me that, sometimes there will be staffs not working at all in the "old system". But I guess in my campus they work hard.

Another problem if IT service is not transparent "in time" is that, users don't know whether they need to report or wait. Of course most of us will silently wait for the fix - most of us are busy, right? But what if the staffs don't know the issue at all? We don't know whether they know the issue, and most people won't trust others forever and believe "they must be fixing it now". This might be a more misleading situation, which causes user dissatisfaction.

I can't think of any disadvantage of being actively transparent to faults for a hard-working public service, so I strongly believe this theory.

Ah, yes, I have to highlight that what I mean here about transparency, is "instant transparency". Something this brings one problem: when you realize that you identifed a wrong cause that you published before, you have to recall the previous statement, which brings confusion. If everybody is wise and realizes that people can make mistakes, this is not a problem at all, and you can just leave your previous "wrong" statement there.

In Staytus's demo, an issue became red again from `Monitoring` status

Tool and Platform is Not That Important

People may argue that, "we might not have the right tool to do that for now". Probably the tool doesn't fit, but when you have the idea to do the right thing, tools and platforms are not a problem.

A good example in my campus is the student financial service. They always use forums to answer students' scholarship questions. Though the forum they choose is not that popolar, and I guess some scholarship project process information can be formatted in a nicer single page, but firstly they choose to be transparent.

IT service, on the contrary, is:

  • Essential, so users need feedback more instantly;
  • Wide, so physical service and on-site announcement in all areas is expensive;
  • Complex, where hardware, software and configuration all matters.

Thus a digital way might be a better way to provide transparency.

But what if "the digital way" is faulty? We can put the solution on a school server that hardly fails (probably standalone) and connects with both Intranet and Internet. And a better solution might be prepare for the worst: Choose a third-party (VPS outside Intranet) or public service (Weibo or WeChat), and hope that it won't fail when our infrastructure fails. Unreliable as it seems, you are winning a lottery if everything fails (maybe once in a lifetime?), and you won't hestitate to do the physical announcements.

Yeah, maybe your physical announcement is not enough...

A Blueprint Specifically for IT Service

When everybody is busy, this kind of customer service cannot be depended only by "I contacted you and you talk to me". Some self-service theory can be incorporated here: Make status updates available to everyone. When they need help, they can check on the updates, rest assured, and calmly wait.

I heard that the support ticket systems for IT services is being considered now, but now the "status page" thing is more important.

We have talked about the platforms, right? We will look into them one by one.

  • Webpage, which is very customizable, seems good. But no matter it's in the browser, or inside WeChat WebView, it can't push notifications by itself.
    • However, when users met issues, if that matters to them, they will check the status themselves. So pushing doesn't matter that much.
    • When we have met a disaster and need to "push" some apologies, it doesn't need to be instant and frequent. That's not in the aspect of what we are talking about.
  • Weibo seems good, and can be a choice. But two problems: It is so public that sometimes it's not that good. Last but not least, when everybody uses WeChat, who cares about Weibo?
  • WeChat official account's problem is that it can't push messages that frequent. When you have limits, you might not want to be that transparent. And yes, users don't want to receive that frequent messages.
  • WeChat enterprise account seems don't have these problems. It doesn't limit your push frequency. But when you choose this, remember, this is not a long-running solution (surpassed by Enterprise WeChat App), and this is not supported by PC and Windows Phone. Seems not that fit to be called "transparent" unless you provide a webpage alternative.
  • For other push methods, people hardly use emails, and SMS are expensive, and you probably think of mobile app? Nobody likes this to be heavy.

As I said above, when fault happens, users have motivations to "check status". Thus frequent, up-to-date, no-need-to-push-to-everybody status update looks good.

The conclusion is that, it's best to have

  • a self-hosted standalone status webpage,
  • linked from major IT platforms (in my campus's case, wechat enterprise account and IT department website),
  • which can be quickly deployed to external VPS and work if the self-hosted one crashed,
  • whose data can be consumed via Webhooks or API by other official platforms, like Weibo, WeChat or something.

Of course this have some technology expenses, thus choosing a existing public service (in the short term) is fine, too.

"How to publish" is easy: we can formulate some statement templates (like the well-known investgating/identified/monitoring/resolved model), and when being used, add details to the statements. And we can form rules of updates, to keep transparency, like at lease publish one update every X hours.

Pre-translated templates in Google's statusboard; notice the "we have additional English explaination" sentences

We also need someone to publish messages (I know in China this is a bigger problem). A good technical writer should be recurited. But I think it can be achieved by part-time job by students: they signed some confidentiality agreement and joined the working discussion group, and if any fault happens, they are responsible to publish the situation according to the template and the discussion group's conversations. Yeah, I bet these conversations sometimes contain password or something else, so confidentiality is important.

Or if the tech staffs can do updates themselves, that's fine (but that's really too busy for them).

"The well-known modal" in Staytus

Choosing a open-source solution

As a student, who don't have that much money, I like open-source a lot. For this status page thing, of course I would like to solve it by open-source stuffs.

Actually according to my recongnition, there is no such "status-page service" in China. For example, Leancloud built the status page themselves. The "international" cloud version of this seems not good here, because it might be very slow. So we have to count on self-hosted, open-source ones.

In my opinion, for a status page of school IT service, the most important thing is "update". The overall "status indicator" is not that important.

Yes, this Apple style doesn't fit

After some research, the dynamic, usable, being maintained open-source status solutions are not that many.

  • Cachet, the most popular one, but not perfect for now, with PHP, MySQL/PostgreSQL (as a reminder try dev version, current stable version doesn't have status update)
  • Staytus, already elegant and perfect to use (but simple), with Ruby and MySQL, and the demo is really pretty
  • statuspage, not that popular, hasn't checked thoroughly yet, but as a Python alternative it said "Cachet is a great product, I simply despise PHP"
Cachet (dev version)

I know some of you hate databases. Using a static page generator is a good idea. These solutions exist, but they just seem not that perfect, and to form the workflow is a hard work.

  • Netlify StatusKit, though it's "a template to deploy your own Status pages on Netlify", it seems to be a generator
  • or we can make it with Jekyll and customized themes and plugins

I hope these solutions can be helpful. Though, the most important thing is still what you are trying to achieve.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 199,636评论 5 468
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 83,890评论 2 376
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 146,680评论 0 330
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 53,766评论 1 271
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 62,665评论 5 359
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,045评论 1 276
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,515评论 3 390
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,182评论 0 254
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,334评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,274评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,319评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,002评论 3 315
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,599评论 3 303
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,675评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,917评论 1 255
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,309评论 2 345
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 41,885评论 2 341

推荐阅读更多精彩内容

  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 9,287评论 0 23
  • PLEASE READ THE FOLLOWING APPLE DEVELOPER PROGRAM LICENSE...
    念念不忘的阅读 13,420评论 5 6
  • 她从周五晚上开始,回归旧模式。有个销售离职,她担心团队内大家有情绪,所以想带着一起唱唱歌吼一吼发泄一下情绪。有两三...
    艳敏姐阅读 270评论 4 2
  • 静默的黑 恶毒的潮湿 无人行走、低语 风呜咽着钻过窗棂 我的蛇 一路缓慢游来 缠绵在喉 蛇眼盛开干竭的具象 吞没 ...
    七乔阅读 220评论 0 0
  • D博士是我们办公室我最佩服的人,工作时低调,娱乐时活泼,专业博学有节制,有我最缺少的东西,也有我正在努力的东西。 ...
    胖鱼Kingman阅读 163评论 0 0