In a previous post, I downloaded the Yelp dataset, 5.79GB of json data. My first thought (before I get to experimenting with Apache Spark), was how can I extract some basic stats from this dataset, basics like how many data items are there, and what do the records look like in each of the data files.
Using mongoimport and referring to the docs here, the syntax for the import is:
mongoimport -d database -c collection importfile.json
Here’s the Yelp dataset json file for importing, to get an idea of the size of each file:
kev@esxi-ubuntu-mongodb1:~/data/yelp$ ls -lS
total 5657960
-rwxrwxr-x 1 kev kev 3819730722 Oct 14 16:16 review.json
-rwxrwxr-x 1 kev kev 1572537048 Oct 14 16:22 user.json
-rwxrwxr-x 1 kev kev 184892583 Oct 14 16:16 tip.json
-rwxrwxr-x 1 kev kev 132272455 Oct 14 16:03 business.json
-rwxrwxr-x 1 kev kev 60098185 Oct 14 16:03 checkin.json
-rwxrwxr-x 1 kev kev 24195971 Oct 14 16:03 photos.json
So importing each of the datasets, one at a time:
kev@esxi-ubuntu-mongodb1:~/data/yelp$ mongoimport -d yelp -c checkin checkin.json
2017-10-14T16:49:35.566-0700 connected to: localhost
2017-10-14T16:49:38.564-0700 [#########……………] yelp.checkin 22.6MB/57.3MB (39.5%)
2017-10-14T16:49:44.474-0700 [########################] yelp.checkin 57.3MB/57.3MB (100.0%)
2017-10-14T16:49:44.475-0700 imported 135148 documents
kev@esxi-ubuntu-mongodb1:~/data/yelp$ mongoimport -d yelp -c business business.json
2017-10-14T16:49:59.593-0700 connected to: localhost
2017-10-14T16:50:02.592-0700 [#####……………….] yelp.business 27.9MB/126MB (22.1%)
2017-10-14T16:50:12.873-0700 [########################] yelp.business 126MB/126MB (100.0%)
2017-10-14T16:50:12.873-0700 imported 156639 documents
kev@esxi-ubuntu-mongodb1:~/data/yelp$ mongoimport -d yelp -c tip tip.json
2017-10-14T16:50:38.061-0700 connected to: localhost
2017-10-14T16:50:41.058-0700 [##………………….] yelp.tip 17.5MB/176MB (9.9%)
2017-10-14T16:51:07.381-0700 [########################] yelp.tip 176MB/176MB (100.0%)
2017-10-14T16:51:07.381-0700 imported 1028802 documents
kev@esxi-ubuntu-mongodb1:~/data/yelp$ mongoimport -d yelp -c user user.json
2017-10-14T16:51:28.648-0700 connected to: localhost
2017-10-14T16:51:31.648-0700 [……………………] yelp.user 36.9MB/1.46GB (2.5%)
2017-10-14T16:54:15.907-0700 [########################] yelp.user 1.46GB/1.46GB (100.0%)
2017-10-14T16:54:15.907-0700 imported 1183362 documents
kev@esxi-ubuntu-mongodb1:~/data/yelp$ mongoimport -d yelp -c review review.json
2017-10-14T16:57:01.018-0700 connected to: localhost
2017-10-14T16:57:04.016-0700 [……………………] yelp.review 34.9MB/3.56GB (1.0%)
2017-10-14T17:02:31.967-0700 [########################] yelp.review 3.56GB/3.56GB (100.0%)
2017-10-14T17:02:31.967-0700 imported 4736897 documents
Done! Almost 6GB of data imported to MongoDB. Now, time for some queries!