Friday, 6 August 2010
I'm working towards a PhD in computational linguistics. Many of you have asked me for more detail about my studies... and now seems like a good time, because I'm having some problems that you can help with.
I'm a linguist at heart, with a background in pragmatics: the study of what people really mean by what they say. It was almost by accident that I ended up in a computer science department for my doctorate, but one major advantage of a computational approach is that it's comparatively easy to study a vast amount of data in a short length of time. And there's a nice dataset of email, in the form of the Enron corpus (thousands of emails subpoenaed when the company was investigated, and subsequently released into the public domain for research).
However, it's really hard to find good-quality, human-annotated language data. For what I'm interested in looking at, the data simply doesn't exist yet. And yet without hand-crafted examples of data, it's hard to "teach" a computer how to process something, and even harder to assess how well it performs.
Hence, a new kind of experiment (for me, at least).
I'd like to ask the power of the internet to help me pull together some sample answers. If you're a fluent English speaker with a few minutes to spare, please help me to categorise some questions. I'll then compare the answers given by different people, and try to find some kind of underlying "truth". If I get enough good answers, this will be the basis for training a computer model to perform the same task.
If you can spare even ten minutes to help out, I'd really appreciate it. Please pass the link to your friends, too. And I promise to let you know how it goes.