Thursday 23 January 2014

Removing Annotations from the Brown Corpus with Sed

The Brown corpus is a collection of texts on various topics amounting to around 1 million words, and each word is annotated with its Part Of Speech (POS). The following is an excerpt from it:

 The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at 
investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd
``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.

You can find a copy of the Brown corpus on this page. To get a copy of the Brown corpus with tags removed, you can get it here.

Each word ends with a "/" character, followed by the part of speech, e.g. "The" is an article, which is abbreviated to "at". The full list of abbreviations can be seen here. I would like to strip off all the POS tags though, so I can use the text on its own. My main motivation is to get a wide selection of texts on several topics to use as a test set for computing the perplexity of language models. The Brown corpus seems to fit the bill.

Stripping the Tags

I am going to use sed to do all the work here, based on the observation that each word has a slash ("/") in the middle, followed by some more characters. To express this as a regular expression, we'll use the following:

[^ ]*/[^ ]*

This regular expression (RE) will match a sequence of characters (not including space), followed by a slash, followed by some other sequence of characters (not including space). Each time we match this, we want to keep only the first part of match i.e. the part before the slash. In sed this is done like so:

s_\([^ ]*\)/[^ ]*_\1_g

We have put the RE into sed's syntax inside the "s/a/b/g" construct, which tells sed to match "a" and substitute any instances matched with "b". Note that the forward slash characters are optional, any character can be used in that spot, in the example above "_" is used instead. The next sed feature we have used is bracketing "\( \)", which tells sed to record the part of the match that happened inside the brackets and put it in a variable called "\1". Note we used this variable in the second half of the example. This tells sed to match the whole RE, but replace it with only the first bit.

If we have a file called "temp.txt" with the following text in it:

The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at 
City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in
the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at
City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at
election/nn was/bedz conducted/vbn ./.

we'll want to strip out the tags using sed like this:

sed 's_\([^ ]*\)/[^ ]*_\1_g' temp.txt > newtemp.txt

and we are left with:

The jury further said in term-end presentments that the City Executive Committee 
, which had over-all charge of the election , `` deserves the praise and thanks of the
City of Atlanta '' for the manner in which the election was conducted .

Applying it to Multiple Files

The Brown corpus contains 500 files, and we want to apply our sed expression to all the files, to do this we'll use bash.

for fname in `ls c*`; do 
sed -i 's_\([^ ]*\)/[^ ]*_\1_g' $fname;

The script above loops over all the files in the brown corpus beginning with the letter c, then uses sed to apply the regular expression to each file. The '-i' flag tells sed to apply the regular expression to the file in-place, whereas usually it would print out the results.

To get a copy of the Brown corpus with tags removed, you can get it here.

No comments:

Post a Comment