I’m parsing a dictionary file from Princeton University’s WordNet, but the only part I’m interested in extracting from the file is the word itself. grep won’t work for this since you can only output a matching line, but not a single matching word or pattern from a file.
The file structure looks like:
{ Bigfoot, Sasquatch, legendary_creature,@i (large hairy humanoid creature said to live in wilderness areas of the United States and Canada) } { Demogorgon, deity,@i noun.group:Greek_mythology,;c ((Greek mythology) a mysterious and terrifying deity of the underworld) } { doppelganger, legendary_creature,@ (a ghostly double of a living person that haunts its living counterpart) } { Loch_Ness_monster, Nessie, legendary_creature,@i noun.object:Loch_Ness,#p (a large aquatic animal supposed to resemble a serpent or plesiosaur of Loch Ness in Scotland) } { sea_serpent, legendary_creature,@ (huge creature of the sea resembling a snake or dragon) }
The second word in the file is what I’m looking for, so the output should be:
Bigfoot Demogorgon Loch_Ness_monster sea_serpent
Let’s start with a simpler example first. Given a string “apple potato tomato”, let’s filter to print only words starting with “po”
$ echo "apple potato tomato" | sed -E -n "s/.*(po[[:alpha:]]+).*/\1/p" potato
Breaking it down:
-E : used extended regex -n : surpress outputting each line of the input .*(po[[:alpha:]]+).* - match any characters, then as a capture group () starting with po then 1 or more other alpha characters, followed by any other characters \1 - replace match with capture group 1 /p - print only matches
Now that’s use the same approach to match just the first word on each of the dictionary file lines:
sed -E -n "s/{[[:space:]]([[:alpha:]]+\_*[[:alpha:]]*).*/\1/p" ./dict/dbfiles/noun.person >> noun.person.parsed.txt