Using sed to filter a file, outputting matching patterns only

I’m parsing a dictionary file from Princeton University’s WordNet, but the only part I’m interested in extracting from the file is the word itself. grep won’t work for this since you can only output a matching line, but not a single matching word or pattern from a file.

The file structure looks like:

{ Bigfoot, Sasquatch, legendary_creature,@i (large hairy humanoid creature said to live in wilderness areas of the United States and Canada) }
{ Demogorgon, deity,@i noun.group:Greek_mythology,;c ((Greek mythology) a mysterious and terrifying deity of the underworld) }
{ doppelganger, legendary_creature,@ (a ghostly double of a living person that haunts its living counterpart) }
{ Loch_Ness_monster, Nessie, legendary_creature,@i noun.object:Loch_Ness,#p (a large aquatic animal supposed to resemble a serpent or plesiosaur of Loch Ness in Scotland) }
{ sea_serpent, legendary_creature,@ (huge creature of the sea resembling a snake or dragon) }

The second word in the file is what I’m looking for, so the output should be:

Bigfoot
Demogorgon
Loch_Ness_monster
sea_serpent

Let’s start with a simpler example first. Given a string “apple potato tomato”, let’s filter to print only words starting with “po”

$ echo "apple potato tomato" | sed -E -n "s/.*(po[[:alpha:]]+).*/\1/p"
potato

Breaking it down:

-E : used extended regex
-n : surpress outputting each line of the input
.*(po[[:alpha:]]+).*
- match any characters, then as a capture group () starting with po then 1 or more other alpha characters, followed by any other characters
\1 - replace match with capture group 1
/p - print only matches

Now that’s use the same approach to match just the first word on each of the dictionary file lines:

sed -E -n "s/{[[:space:]]([[:alpha:]]+\_*[[:alpha:]]*).*/\1/p" ./dict/dbfiles/noun.person >> noun.person.parsed.txt

Combining find and grep

Quick note to remember this syntax as every few months I find a need to do a grep within a number of files:

find -name "pattern" -exec grep "pattern" {} ;

Grep options:
-H print filename in results
-n print line number where match was found
-l limit match to first found match in file (useful if you want to find files containing a match but don’t care how many matches are in each file)

Pipe to wc -l to count file occurrences, eg:

find -name "pattern" -exec grep -l "pattern" {} ; | wc -l

Use egrep if you need to use regex in the pattern.