Big data analysis, machine learning, and publicly available data sets

I’ve been meaning to take a look at some Big Data analysis tools for a while, particularly Apache Spark, and deeplearning4j. If I’m going to use Spark to ingest a large dataset, I thought it would be worthwhile to write a regular Java app to crunch some numbers on a dataset first as a benchmark. Looking around for some publicly available datasets, I’ve know for a while that Project Gutenburg has publicly available texts of many classic novels available. I wondered what it would take to do a simple word count on all words in a typical novel.

It turns out a typical novel, say Alice in Wonderland, is actually pretty small, at around 150kb. Not exactly ‘big’ at all in today’s meaning of ‘big data’, in fact trivial. Anyway, I wrote a simple Java app to count word occurrences and then order by number of occurrences, you can see my code here. I didn’t attempt to optimize the code at all, this was my first attempt at writing a word count app – the surprising thing is how quick it executes. On my i7 Macbook Pro with an SSD, it complete the count and sort in 100ms. I was hoping to have something with more siginficant number crunching than this, so clearly I need to set my sights higher in terms of larger data sets.

If you Google ‘public big data sets’ you’ll find many collections, for example this list. Some of these are collections of publicly available data, some are data shared by organizations who are asking the community for input on analyzing their data. The Yelp data set is interesting in this category – they offer a dataset that’s 5.79GB of json data for example, for researchers to analyze and provide feedback in a ‘Dataset challenge‘. Almost 6GB of data is significantly larger than my 150k, so if I’m going to do anything interesting with Spark this might be a good place to start.

Data set downloaded, off I go 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.