Recently in Definitions Category

I originally wrote this for my online "'zine" called Buzz, hosted on commons.somewhere.com. That was in November of 2000, right after the Florida election fiasco. Although the specifics of that election are dated, the general information is not. This is about polls, how newspapers report them, and how our need for instant results just makes things worse.

"Margin of error" is a measurement of how confident you can be of a statistical result.  My stats books have long since been relegated to a box in the basement, but if you want more specifics, try this link (http://whyfiles.org/009poll/math_primer.html).

The primary influence on the margin of error is the size of the sample, but methodology can also have an impact, although it generally isn't measurable until you discover just how wrong you were after the fact.  That's what happened in the famous Dewey vs. Truman example (they polled via telephone, and people who had telephones at the time turned out to be a rather different group of people than the general population).  Methodology problems also explain what went wrong with the predictions of who was going to win the 2000 presidential election in Florida, and why web-based polls (sell-selecting sample) are completely bogus.  Of course, if enough people start refusing to answer pollsters, all polls will be self-selecting and invalid.

The problem is that, although pollsters report margin of error, newspapers seldom give them more than a footnote.  So this year we've seen a lot of headlines like "XXX Takes the Lead" followed by an article that tells how the candidate is now two percentage points ahead, followed by something (buried deep inside) mentioning that the margin of error is plus or minus three percentage points.  Let's be perfectly clear.  If the difference in the polls is within the margin of error, then there is no difference.  None.  Any headline to the contrary is making up something that doesn't exist.  If you did the poll again it might well show XXX losing by two points--and that wouldn't mean anything either.

But of course, margin of error in polls is a pretty trivial problem compared to margin of error in vote taking machinery.  The reports I've seen (sorry, I've lost the reference) say that these lovely punch card machines are, when working absolutely optimally with absolutely wonderfully punched cards, 99.9% accurate.  That sounds nice, until we consider that the machines in use throughout polling machines in this country are almost certainly not in optimal condition, and we know for a fact that the cards weren't well punched either.  On top of it all, 99.9% sounds nice, but the results of the Florida elections were so close that the difference between the candidates was less than .1%.  Or in other words, the difference was within the margin of error of the counting machines.  Which means, just like in polling, the difference is statistically insignificant.  It doesn't mean anything.  Do it again and you might get a different result. However, in this case the errors tend towards a particular type.  Punch card voting machines have trouble with partially poked holes--they think nobody voted.  This isn't necessarily because the voter screwed up either--you can poke the things and the chad still might not separate properly. So that biases the results in a particular direction.

So, we have a vote that is so close that it is statistically a tie.  You can either treat it that way, or you can try and decrease the margin of error.  That's what hand counting is all about.  The real good news is you don't have to hand count every single vote.  All you need to count are the ones that the voting machine rejected as being possibly in error.  Voting machines don't tend to count unpunched holes as punched (and if they did, you'd get multiple votes--so those will be flagged as errors anyway).  So by counting just a few (relatively speaking) votes, you can greatly increase the accuracy of the count, and thus decrease the margin of error.  Meaning that even if the results are still off by a few hundred votes--now you have confidence that those votes actually represent the will of the people.

BTW.  If punch cards are so inaccurate, how come computers used them so easily?  Two reasons.  One, you didn't punch computer punch cards by hand--you used a special machine.  You'd enter the data for a card, and then it punches out the holes.  Secondly, computer punch cards, in recognition of the fact that you can have errors, tended to use a checksum to help ensure that if something did go wrong with a card, you'd notice.  (An example checksum might be a parity bit--where a hole would be punched at the end if the result of adding the numbers on the card was even, and not if it was odd.  If a hole became clogged, or the chad fell out, the checksum would not be likely to match, and the card would be rejected.)

So, what was that about instant gratification?

Some people (primarily pundits and candidates) have been making much of urgency to bring the election to resolution as soon as possible.  This sounds reasonable because we are used to our election results being announced nearly as soon as the polls close (and sometimes before).  But in fact, the whole concept is silly.  The elections are not over until the electoral college has made its decision.  But even before that, as people are finally seeing in the Florida situation, there are three levels of "results" from an election.  There are the polls, there are the unofficially tallies (often made by some organization jointly formed by the two major parties), and then there are the official tallies that the state certifies.  You virtually never see the official tallies.  That's particularly frustrating if you are trying to form a viable third party, since the unofficial tallies often don't even bother to count minor candidates.  (Quick, how many votes did the Socialist Party get nationwide?)  A few years ago I voted for a third party candidate for governor and I was never able to find out how many votes he received. By the time the official results were available, nobody considered it news--so the press never reported it.  So far this election has taken no longer than any other election has ever taken.  It's just that this time people actually want to know what the real results are, not just the fast food version.

The other evening I added a few new people to the list of folks I was following on Twitter. It was one of those typical social networking things; checking your "friends" to see who they were tracking, and then adding the ones that looked interesting.

The result, oddly enough, was a late night conversation on the pros and cons of Welfare. It felt very much like those late night conversations I used to have in college; when everyone was full of ideas and eager to explore them. It was really quite enjoyable. And it certainly created a bump in my Twitter usage.

People who don't use Twitter often ask just what it's for. Why would you want to broadcast what you're doing to the world. That's partially the fault of the Twitter folks themselves, for that "What are you doing?" prompt. A better prompt might be "What do you want to do?" (although perhaps a bit too reminiscent of Babylon V :-). Adam Engst once called Twitter "iChat on shuffle," and it certainly can feel that way when you're carrying on several conversations at once. But Twitter isn't so much a piece of software that does something, as a medium through which software can do things. When the Telegraph and Phone were introduced, people certainly wondered why you'd want one, but people found interesting things to do with them, many of which had never been anticipated by their inventors. That's Twitter.

What's interesting to me is the relative pluses and minuses of having this type of discussion in Twitter. Andy Ihnatko recently pointed out the rather obvious (sorry Andy) fact that trying to express complete thoughts to their conclusion in 140 characters is rather difficult. You can of course just post a second message to finish the thought, but the delays in Twitter make it less natural to do so than it might be in a chat program. The performance of Twitter reinforces that 140 character limit, and I'm not sure that's a bad thing, because keeping points brief and concise may have the effect of equalizing the conversation. Nobody can dominate with a long exposition on a particular topic. Ratholes on side topics tend to be limited to pointers to URLs, not long conversations. You can use Twitter to espouse and justify an idea, but not to explain it in detail. But that's fine. We have other media that are better suited to that. That's not to say Twitter is perfect, separating incoming messages into categories (a New York Times news blast in the middle of a conversation is a bit distracting) and threading conversations (particularly when you aren't following all the participants) would be big pluses. But those are things that Twitter clients can do, they aren't necessarily drawbacks of Twitter itself.

Twitter's limitations might make it seem superficial and trivial. But that's like saying chatting around the water cooler is superficial. It can be, but it can also be a catalyst for new ideas that are followed up elsewhere. Andy's Twitter posting was the catalyst for my writing this blog posting. The discussion on Welfare was the catalyst for making new connections on other networks. Social interaction takes place on many different levels, all of which are necessary. What we are seeing online is people taking new tools and adapting them, consciously or unconsciously, to fit the interactions they feel they need in a virtual world. The companies that are succeeding in the social networking sphere are those that either identify those needs, or more likely, have the flexibility to be molded by their users. Flickr, FaceBook, Twitter... none of those companies are necessarily doing what they started out to do, but they were able to adapt to the way people used them. In Social Networking, you achieve success when you stop being an application and become a transparent part of people's interactions.

But enough of that. I need to go fix the Welfare System!

P.S. I've suddenly become very self-conscious about the fact that I seem to be very fond of semi-colons.

All I really wanted to do was find the most recent email address of a friend. It was a mere matter of checking for the most recent email message from him, but he has one of those random .signature generators, and it had this interesting little poem. An hour (at least) later, here we are.

My Spill Chequer
Eye halve a spelling chequer
It came with my pea sea
It plainly marques four my revue
Miss steaks eye kin knot sea.
Eye strike a key and type a word
And weight four it two say
Weather eye am wrong oar write
It shows me strait a weigh.
As soon as a mist ache is maid
It nose bee fore two long
And eye can put the error rite
Its rarely ever wrong.
Eye have run this poem threw it
I am shore your pleased two no
Its letter perfect in it's weigh
My chequer tolled me sew.
(Sauce unknown)

So I started searching to see who wrote it. I didn't find that, but I did come across a lovely word; "oronym". It isn't in my online dictionary (it's a relatively recent neologism (another lovely word), but the Wikipedia (of course) has it. It says:

This term was coined by Gyles Brandreth and first published in his book The Joy of Lex (1980), and it was used in the BBC programme Never Mind the Full Stops, which also featured Brandreth as a guest.

Oronyms are basically homophones which span words. They work in spoken English (and often depend on dialects) because we run all our words together. The above poem uses them of course, but there's a more famous example. (This version taken from Fun With Words.) I've heard this one before, although I'd forgotten it. Once upon a time :-) I had a friend who could recite the entire piece.

An Oronym Story – Ladle Rat Rotten Hut

Even more impressive in length is the following oronym story. It is the tale of Little Red Riding Hood... but not the famous version; this one is constructed entirely from homophones: Ladle Rat Rotten Hut. This curious version was written in 1940 by a professor of French named H. L. Chace. He wanted to show his students that intonation is an integral part of the meaning of language. Try reading it out loud (best in the accent of Southern/Central USA)!

Wants pawn term, dare worsted ladle gull hoe lift wetter murder inner ladle cordage, honor itch offer lodge, dock, florist. Disk ladle gull orphan worry putty ladle rat cluck wetter ladle rat hut, an fur disk raisin pimple colder Ladle Rat Rotten Hut.

Wan moaning, Ladle Rat Rotten Hut's murder colder inset. "Ladle Rat Rotten Hut, heresy ladle basking winsome burden barter an shirker cockles. Tick disk ladle basking tutor cordage offer groinmurder hoe lifts honor udder site offer florist. Shaker lake! Dun stopper laundry wrote! Dun stopper peck floors! Dun daily-doily inner florist, an yonder nor sorghum-stenches, dun stopper torque wet strainers!"

"Hoe-cake, murder," resplendent Ladle Rat Rotten Hut, an tickle ladle basking an stuttered oft. Honor wrote tutor cordage offer groin-murder, Ladle Rat Rotten Hut mitten anomalous woof. "Wail, wail, wail!" set disk wicket woof, "Evanescent Ladle Rat Rotten Hut! Wares are putty ladle gull goring wizard ladle basking?"

"Armor goring tumor groin-murder's," reprisal ladle gull. "Grammar's seeking bet. Armor ticking arson burden barter an shirker cockles."

"O hoe! Heifer gnats woke," setter wicket woof, butter taught tomb shelf, "Oil tickle shirt court tutor cordage offer groin-murder. Oil ketchup wetter letter, an den - O bore!"

Soda wicket woof tucker shirt court, an whinney retched a cordage offer groin-murder, picked inner windrow, an sore debtor pore oil worming worse lion inner bet. En inner flesh, disk abdominal woof lipped honor bet, paunched honor pore oil worming, an garbled erupt. Den disk ratchet ammonol pot honor groin-murder's nut cup an gnat-gun, any curdled ope inner bet.

Inner ladle wile, Ladle Rat Rotten Hut a raft attar cordage, an ranker dough ball. "Comb ink, sweat hard," setter wicket woof, disgracing is verse. Ladle Rat Rotten Hut entity betrum an stud buyer groin-murder's bet.

"O Grammar!" crater ladle gull historically, "Water bag icer gut! A nervous sausage bag ice!"

"Battered lucky chew whiff, sweat hard," setter bloat-Thursday woof, wetter wicket small honors phase.

"O Grammar, water bag noise! A nervous sore suture anomolous prognosis!"

"Battered small your whiff, doling," whiskered dole woof, ants mouse worse waddling.

"O Grammar, water bag mouser gut! A nervous sore suture bag mouse!"

Daze worry on-forger-nut ladle gull's lest warts. Oil offer sodden, caking offer carvers an sprinkling otter bet, disk hoard hoarded woof lipped own pore Ladle Rat Rotten Hut an garbled erupt.

Mural: Yonder nor sorghum stenches shut ladle gulls stopper torque wet strainers.

The same Fun With Words page also then references "mondegreens" (another new word!), which are misheard lyrics.

The term mondegreen was originally coined by author Sylvia Wright, and has come to be quite widely used. As a child, Wright heard the lyrics of The Bonny Earl of Murray(a Scottish ballad) as:

Ye highlands and ye lowlands
Oh where hae you been?
Thou hae slay the Earl of Murray
And Lady Mondegreen

It eventually transpired that Lady Mondegreen existed only in the mind of Sylvia Wright, for the actual lyrics said that they "slay the Earl of Murray and laid him on the green." And to this day Lady Mondegreen's name has been used to describe all mishearings of this type!

You see these a lot on the web, when people are writing down the lyrics to their favorite songs. I remember stumbling across this one. The song is Natasha Bedingfield's "These Words". The verse goes:

Read some Byron, Shelley and Keats,
recited it over a hip-hop beat
I'm havin trouble sayin what i mean,
with dead poets and a drum machine

But the first version I found online (on some poor girl's journal) was:

Written by Ricelli and Keys
Resided in over a heartbeat
I'm having trouble saying what I mean
With dead poets and drum machines

And now I think I better get back to sending my friend that email message!

It's early morning and the buzz on Google's Knol is already building fast. I'm not going to rehash what other's are saying, you can go read them for yourself.

Google calls a "knol" a unit of knowledge (this from the people who misspelled "googol"). Google says the goal is to "find a way to help people share their knowledge", and Google Knol is the place where they can do that; as authors, contributors and commenters. Everyone has jumped on this and said it's a Wikipedia competitor, and maybe in the long run that is true, but that ignores an important distinction; Knol is focused on highlighting authors. Google calls this the "key idea", and I think they are absolutely right.

Wikipedia leverages the wisdom of the crowd to build collaborative articles. It relies on multiple authors, many eyes, consensus and majority rule to get accuracy. In a lot of cases that works well. However, it suffers from all the usual problems of a democratic system. Backroom deals can skew the results. Controversial subjects can require special protection, which gives more control to the editors. And majority rule can stifle new ideas or legitimate criticism. Again, those in control of the overall system can exercise a great deal of power that isn't especially visible to the outside world. And of course, sometimes the things which "everybody knows" aren't always correct.

If Wikipedia is a communist democracy (and I mean that in a completely positive sense, you can't truly have the former without the latter), then Google Knol is a meritocracy. The key is something that has been talked about in social networking circles for several years. Knol depends on reputation. The author of the article is prominent. You see everything else they have written. You see what their peers think of them (and who their peers are). You see what commenters have said about them. Knol is blogging with a focus, and attempt to move beyond general topic pundits and bring in the specialists. The author of the article is a known identity which can be tied to other articles in the past and future. (Note that I don't say "person". It could be a group, and of course we don't necessarily have to know the physical identity. The key behind combining reputation and anonymity is the concept of a long term identity. In the ideal system, nobody knows that you're a dog, but they know that you're the same dog.) Knol attempts to ensure accuracy by assuming that a persistent identity (e.g. your Google account) will encourage you to try and maintain a good reputation. Your reputation in turn depends on how much support you can garner from your peers, contributors and commenters.

The usual problems with online reputation systems apply here of course. Online identities can be discarded when they become tarnished. In some cases that's a feature—there are certainly aspects of my past I'd love to discard that easily—but if identities are easy to come by it weakens the power of reputation. That is countered however, by the fact that it takes time to build a reputation, and discarded articles don't drive traffic. More worrisome is the degree to which people can jazz the system by creating multiple identities that work together to build a buzz and the appearance of consensus. But then, Wikipedia has the same weakness. Even real world systems are susceptible to fake groundswells.

In general, I think the idea has a lot of merit, and it's likely to result in a lot of in-depth and well organized articles (the current Knol screen shot shows a very professional looking page—much nicer than your typical wiki). The big question is whether it will gain the breadth that Wikipedia has, and how it will evolve over time? Who maintains articles when the author loses interest (or dies)? People can make contributions and comments, but they aren't directly editing (or will it allow edits, but with publication under control of the author?). I keep coming back to the first Wikipedia edit my daughter made. She was writing an article on the Oregon Trial (a mock tourism brochure, actually) and in the course of her research she discovered the Wikipedia had the length of the trail wrong—so she fixed it. How easy (and immediate) would that process be using Knol? And what happens when over time there are thirty different articles on the same subject? Have we just recreated the web? (Well, at least we know it won't do away with the need for Google's search engine :-).

Like most Google projects, Knol is starting out on an invitation basis, although in this case I suspect invitations will be a bit harder to get than usual. The initial focus will probably be more on quality than quantity. I think the idea of a reputation-based system, and the appeal of an author-centric system, will make it successful, but I don't see it replacing the Wikipedia. If anything, I think merging the two concepts would make more sense. Combining both authored and crowd-sourced systems into a single repository. It seems unlikely that Wikipedia would do anything so drastically different, and starting a "new" Wikipedia would be hard for anyone to do, so unfortunately it's not likely to happen. I guess we'll all have to get used to searching two locations and sending our edits to two different sites.


Definition: Buffer Overflow

| 1 Comment
Buffer Overflow

If you read any press on computer security problems, at some point you are likely to come across the phrase "Buffer Overflow"--it's by far the most common security error that programmers make. It's common for several reasons.

  • It has nothing to do (by itself) with security.
  • It's an easy error to make, and a hard one to detect.
  • It's human nature not to expect the unexpected.

So what is a buffer overflow? I'll start off extremely non-technical here, and gradually bump up the level until the final section, at which point if you don't understand programming and call stacks you may want to stop reading, and if you do understand them, you may decide to start reading.

First, here's the non-technical explanation.

You need to tell a co-worker something important, you go to their office, expecting a conversation something like this:

"Hello."
"Hi."
"I though you should know about this new thing."
"Oh? What is it?"
You tell them the important thing.

Instead the conversation goes like this:

"Hello."
"Hey! Just the person I wanted to see! Did you hear about this crazy election thing,"...followed by five minutes of political diatribe. By the end of the conversation, not only have you forgotten what you came in to say, you're on the way out the door with a poster to protest something.

Your buffer just overflowed, and you were hijacked for a purpose other than your original intent. You had an expectation of how the conversation would go (the protocol) and it was violated, with the result that you ended up doing something different. That's exactly what happens to a program when someone exploits a buffer-overflow problem.

Now a slightly more technical explanation.

When a program is designed, it is designed with an interface to the outside world. That interface is not just what you see on the screen, but also how it communicates with other programs and the operating system. The interface is typically defined in terms of either an API (a set of programming conventions for direct communication with another piece of code) or a protocol (a definition of a set of data and commands to be passed between programs). Think of the API as how your brain tells you arm to pick something up, the protocol as how you ask someone to pass the salt. Of course the protocols are not always executed directly. Your brain tends to use the mouth API to tell someone to pass the salt, rather than using telepathy directly, and many programs use standard sets of code provided by the operating system when they want to use a protocol.

Now, these APIs and protcols specify the form of the information to be passed back and forth. For instance, a specification might say that the correct response to an initial communication is no more than five letters long (e.g. "Hello"). In the days before people had to worry about hostile programs, code was written assuming that the program you were talking to was going to be following the rules of the protocol. If the protocol said "five letters" then there wasn't a lot of point in leaving room for six. Sure, your program might crash if there were six, but it wasn't your bug, it was a bug in the program talking to you--it should have sent five letters.

So that's a buffer overflow. You expect one thing, and somebody sends you something much bigger. The "buffer" that you had set aside to store that information doesn't have room for what you get, and you end up writing those six (or six hundred) letters on top of other things that you were trying to remember. Obviously that's not going to be a good thing for the continued functioning of your program, but it turns out it's also a major security problem.

And still a bit more technical.

Computers tend to think in terms of two things--code and data. Code consists of the instructions for the computer, telling it what to do. Data is what it does it to and with. When you run a program, it loads into memory both the code and the data that code needs. When that program communicates with some other program, it is receiving data, and it will then use the code that it already has to figure out what to do next. This makes remote communication relatively safe. The remote program can only tell the local program to do within the constraints of the original code. Assuming nobody has done anything stupid (which is not generally a good assumption), the remote program cannot tell the local program to do anything that wasn't originally intended.

Modern computer architectures have an unfortunate design, however. They don't really no the difference between data and code. If somebody can convince your program to try running the data that it has in memory, it will do so quite happily. So a malicious program has two goals. First it wants to get some code to your machine, and then it wants to persuade somebody to run it. This is of course, no different than an email virus writer's goal. In that case, they expect you to run it, in the case of a buffer overflow, they expect the broken program to run it. Email viruses are so successful because users often don't know the difference between data and code either (and some operating systems helpfully try to hide the difference so as no to confuse them).

It turns out that if a malicious programmer can find a target program that didn't check for a buffer overflow, it can be very trivial to get that program to execute code provided by the remote program. So easy, in fact, that there are standard packages out there that provide the entire payload for the overflow--all the script kiddie (we'll define that sometime, but suffice to say it isn't a compliment of someone's hacking prowess) has to do is find the write length for the buffer overflow and bang--they have control of your computer.

Before you panic, remember that doing this requires that they have remote access to a program on your computer already, and that that program have a buffer overflow problem. That means (for an internet exploit) that your computer has to have some program that is listening to external connections (e.g. print server, file sharing...) or that you have a malicious user at your computer (or you helpfully downloaded and ran their software).

Now let's get completely technical.

How does a buffer overflow exploit work from a programmer's perspective?

First you find some place in that program where it's reading data and assuming that it's going to be reading something rational. E.g.

        char    buf[4];      /* Store 4 characters */
        gets(buf)               /* Read any number of characters from the input
                                                and put them in buf */

where the input turns out to be more than 4 characters long.

Now the question is, where is the data stored in "buf" located?

If "buf" is a global variable, then that data is probably allocated in a data segment somewhere, and you're going to try and overwrite some other piece of data which will result in something useful (e.g. a place where the program was going to execute one program, now executes another). That's tricky and hard to do without source code.

However "buf" is probably a local variable, allocated on the stack. So instead of overwriting data, your goal is to overwrite the stack itself. So you are going to put in buf some amount of padding (that will overwrite the rest of the data stored on the stack), followed by some machine code that overwrites the part of the stack that had code on it. You'll set things up so that your code will be executed (possibly when this particular function returns) instead of the code that normally would have been executed. Now you're home free. Since there are plenty of examples of sample exploit machine code, all you need to do when you find a new buffer overflow is figure out the appropriate offset--the rest of the work has been done already. You don't need to transfer very much data, just enough to run something that connects you to the remote machine--from there you can transfer the rest of the software you want to install remotely.

This is where security-by-obscurity comes in handy. Want to lessen the chance of buffer-overflow attacks? Just run some obscure piece of hardware. Run a Mac, or even Linux on the PowerPC1Of course with Apple switching to an Intel platform, some of that obscurity goes away, but exploits still have to vary from operating system to operating system, even if the underlying processor is the same.. It's not that there aren't buffer-overflow problems, but their are less handy examples of how to exploit them running around. Less examples, less successful attacks. It's not a solution of course (especially if everyone does it :-), but it is one way to slightly increase your odds of remaining secure.

There are machine/OS architectures that would make buffer overflows much harder to exploit. Disable dynamic creation and execution of code on the stack for one. Or keep a separate data stack. And there are tools out there which will put watchdog data on the stack, and then watch it to make sure it doesn't get overwritten (effective, but rather painful from a performance standpoint). But fundamentally, where there are bugs, there are exploits. And modern software, with it's layers and layers of abstraction that no one person can fully grok, has a hell of a lot of bugs.

About this Archive

This page is an archive of recent entries in the Definitions category.

Commerce is the previous category.

Health is the next category.

Find recent content on the main index or look in the archives to find all content.

Subscribe via Reader

Subscribe via Email

Enter your email address:

Delivered by FeedBurner

About Me

I'm the CEO/CTO of Somewhere, Inc., a company building a unified social networking layer that gives people the means to track their friends across multiple social networks.
Creative Commons License
This blog is licensed under a Creative Commons License.

Archives