I’m parsing a dictionary file from Princeton University’s WordNet, but the only part I’m interested in extracting from the file is the word itself. grep won’t work for this since you can only output a matching line, but not a single matching word or pattern from a file.
The file structure looks like:
{ Bigfoot, Sasquatch, legendary_creature,@i (large hairy humanoid creature said to live in wilderness areas of the United States and Canada) } { Demogorgon, deity,@i noun.group:Greek_mythology,;c ((Greek mythology) a mysterious and terrifying deity of the underworld) } { doppelganger, legendary_creature,@ (a ghostly double of a living person that haunts its living counterpart) } { Loch_Ness_monster, Nessie, legendary_creature,@i noun.object:Loch_Ness,#p (a large aquatic animal supposed to resemble a serpent or plesiosaur of Loch Ness in Scotland) } { sea_serpent, legendary_creature,@ (huge creature of the sea resembling a snake or dragon) }
The second word in the file is what I’m looking for, so the output should be:
Bigfoot Demogorgon Loch_Ness_monster sea_serpent
Let’s start with a simpler example first. Given a string “apple potato tomato”, let’s filter to print only words starting with “po”
$ echo "apple potato tomato" | sed -E -n "s/.*(po[[:alpha:]]+).*/\1/p" potato
Breaking it down:
-E : used extended regex -n : surpress outputting each line of the input .*(po[[:alpha:]]+).* - match any characters, then as a capture group () starting with po then 1 or more other alpha characters, followed by any other characters \1 - replace match with capture group 1 /p - print only matches
Now that’s use the same approach to match just the first word on each of the dictionary file lines:
sed -E -n "s/{[[:space:]]([[:alpha:]]+\_*[[:alpha:]]*).*/\1/p" ./dict/dbfiles/noun.person >> noun.person.parsed.txt
GNU grep has -o which will output only the matching part… also be aware the sed back/forward referencing only sort-of works, since, even though SED is Stream EDitor, it still expects LF/CR, hard and soft line breaks. And it’s not portable across all seds.
If you’re working on oneline or zerospace oneline JSON (which are legit, legal formats) you can have ONE LINE of text that’s dozens or hundreds of megabytes. A “normal” text editor could show your cursor position as Ln 1, Col 853213 and so forth. (it’s basically a text database format, if you don’t work with it.)
Take the following oneline JSON from the Cataclysm:DDA datafile I’m currently working on:
—-
{ “typeid”: “rifle_case_soft_2”, “owner”: “your_followers”, “damaged”: -1000, “last_temp_check”: 0, “contents”: { “contents”: [ { “pocket_type”: 0, “contents”: [ { “typeid”: “boots_rubber”, “owner”: “your_followers”, “damaged”: 3092, “last_temp_check”: 0, “item_tags”: [ “FILTHY” ] }, { “typeid”: “bottle_plastic_small”, “owner”: “your_followers”, “last_temp_check”: 0, “contents”: { “contents”: [ { “pocket_type”: 0, “contents”: [ { “typeid”: “vitamins”, “charges”: 36, “owner”: “your_followers”, “active”: true, “last_temp_check”: 5602214, “specific_energy”: 97646064, “temperature”: 28931350 } ], “_sealed”: false, “allowed”: true }, { “pocket_type”: 7, “contents”: [ ], “_sealed”: false, “allowed”: true } ], “additional_pockets”: [ ] } }, { “typeid”: “762_51”, “charges”: 635, “bday”: 5391464, “owner”: “your_followers”, “last_temp_check”: 0 },
—-
grep –basic-regexp -o -e “\”damaged\”:\ [0-9]\{1,4\}” example-oneline-format.json
Will output:
“damaged”: 3092
(but not the ‘”damaged”: -1000’ since there’s no -\? in the regex.)