Loading the Yelp dataset into MongoDB

In a previous post, I downloaded the Yelp dataset, 5.79GB of json data. My first thought (before I get to experimenting with Apache Spark), was how can I extract some basic stats from this dataset, basics like how many data items are there, and what do the records look like in each of the data files.

Using mongoimport and referring to the docs here, the syntax for the import is:

mongoimport -d database -c collection importfile.json

Here’s the Yelp dataset json file for importing, to get an idea of the size of each file:

kev@esxi-ubuntu-mongodb1:~/data/yelp$ ls -lS

total 5657960

-rwxrwxr-x 1 kev kev 3819730722 Oct 14 16:16 review.json

-rwxrwxr-x 1 kev kev 1572537048 Oct 14 16:22 user.json

-rwxrwxr-x 1 kev kev  184892583 Oct 14 16:16 tip.json

-rwxrwxr-x 1 kev kev  132272455 Oct 14 16:03 business.json

-rwxrwxr-x 1 kev kev   60098185 Oct 14 16:03 checkin.json

-rwxrwxr-x 1 kev kev   24195971 Oct 14 16:03 photos.json

 

So importing each of the datasets, one at a time:

kev@esxi-ubuntu-mongodb1:~/data/yelp$ mongoimport -d yelp -c checkin checkin.json

2017-10-14T16:49:35.566-0700 connected to: localhost

2017-10-14T16:49:38.564-0700 [#########……………] yelp.checkin 22.6MB/57.3MB (39.5%)

2017-10-14T16:49:44.474-0700 [########################] yelp.checkin 57.3MB/57.3MB (100.0%)

2017-10-14T16:49:44.475-0700 imported 135148 documents

 

kev@esxi-ubuntu-mongodb1:~/data/yelp$ mongoimport -d yelp -c business business.json

2017-10-14T16:49:59.593-0700 connected to: localhost

2017-10-14T16:50:02.592-0700 [#####……………….] yelp.business 27.9MB/126MB (22.1%)

2017-10-14T16:50:12.873-0700 [########################] yelp.business 126MB/126MB (100.0%)

2017-10-14T16:50:12.873-0700 imported 156639 documents

 

kev@esxi-ubuntu-mongodb1:~/data/yelp$ mongoimport -d yelp -c tip tip.json

2017-10-14T16:50:38.061-0700 connected to: localhost

2017-10-14T16:50:41.058-0700 [##………………….] yelp.tip 17.5MB/176MB (9.9%)

2017-10-14T16:51:07.381-0700 [########################] yelp.tip 176MB/176MB (100.0%)

2017-10-14T16:51:07.381-0700 imported 1028802 documents

 

kev@esxi-ubuntu-mongodb1:~/data/yelp$ mongoimport -d yelp -c user user.json

2017-10-14T16:51:28.648-0700 connected to: localhost

2017-10-14T16:51:31.648-0700 [……………………] yelp.user 36.9MB/1.46GB (2.5%)

2017-10-14T16:54:15.907-0700 [########################] yelp.user 1.46GB/1.46GB (100.0%)

2017-10-14T16:54:15.907-0700 imported 1183362 documents

 

kev@esxi-ubuntu-mongodb1:~/data/yelp$ mongoimport -d yelp -c review review.json

2017-10-14T16:57:01.018-0700 connected to: localhost

2017-10-14T16:57:04.016-0700 [……………………] yelp.review 34.9MB/3.56GB (1.0%)

2017-10-14T17:02:31.967-0700 [########################] yelp.review 3.56GB/3.56GB (100.0%)

2017-10-14T17:02:31.967-0700 imported 4736897 documents

 

Done! Almost 6GB of data imported to MongoDB. Now, time for some queries!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.