Blog

Big Data Is A Matter Of Speed, Not Size

Finally the market is getting over its initial BIG Data fixation. Unfortunately, in the process we may be inclined to throw away the Big Data signal in an attempt to rid ourselves of all the noise.

The Guardian's John Burn-Murdoch highlights this today, asserting that "'small data' - or data of the volumes most regular analysts, researchers and statisticians are used to dealing with - is actually both more relevant and more useful to the vast majority of organisations than its big cousin." He concludes, "[I]t is speed, not size that is increasingly driving desire for software and hardware improvements at data-processing organisations."

While we talk about Big Data, the reality is that there is a much more important trend going on in data, generally, as Rufus Pollock, Founder and Co-Director of the Open Knowledge Foundation, captures:

[W]e risk overlooking the much more important story here, the real revolution, which is the mass democratisation of the means of access, storage and processing of data. This story isn't about large organisations running parallel software on tens of thousand of servers, but about more people than ever being able to collaborate effectively around a distributed ecosystem of information, an ecosystem of small data.

Now if only we could get everyone else to recognize this essential truth, so we could stop admiring how very big all our data is, and instead focus on actually putting it to work in time for it to be useful to us.

This Week in MongoDB May 20-28

Here's what's going on in the MongoDB community this week:

Learn More

Upcoming MongoDB Days

Upcoming Webinars:

h3. User Groups and Events this Week

May 20

May 21

May 22

Have something you'd like to share? Let us know

Data Scientist Shortage? There's An App For That

Big Data is all the rage, but apparently will come to a crashing halt due to a shortage of data scientists. As I've argued elsewhere, this is mostly a sham. Context is critical for making use of a company's data, and the people with context already work for the enterprise. So it becomes a matter of training the people one has, rather than going off on a scouting trip for the mythical data scientist.

Nor will the "science" of Big Data remain such for long, according to IBM's James Kobielus. As he notes, "core data scientist aptitudes -- curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor, skeptical nature -- are widely distributed throughout workforces everywhere." He then points to a few key trends that will make data science less of a science:

  • As more data discovery, acquisition, preparation, and modeling functions are automated through better tools, today's data scientists will have more time for the core of their jobs: statistical analysis, modeling, and interaction exploration.
  • Data scientists are developing fewer models from scratch. That's because more and more big data projects run on application-embedded analytic models integrated into commercial solutions....
  • Open source communities and tools will greatly expand the pool of knowledgeable, empowered data scientists at your disposal, either as employees or partners.

This jibes with Cloudera CEO Mike Olson's contention that "There will be enormous Hadoop adoption, but you'll get it by virtue of the applications you run."

But whether an organization interprets its data through applications or directly using open-source technologies, one thing that remains true in all this: people are critical to making sense of Big Data. The data won't speak for itself. It's therefore critical to find people inside one's organization who can help make sense of the organization's data. The good news? They're already available and on the payroll.

MongoDB March Madness Recap

There are 76 MongoDB User Groups (MUGs) around the globe that all share the same mission: to grow their MongoDB skills by networking with other like-minded developers, architects and engineers. This February, the MUG organizers around the world wanted to compete in a Global MongoDB User Group Hackathon. We called this global challenge MongoDB March Madness. Local user groups hacked together and build tools to support the growing MongoDB community and their final products were judged by 10gen engineers.

We challenged the groups to build tools to help the MongoDB community learn from one another.

After 6 weeks of hacking, the MongoDB User Group community submitted some awesome projects. Here are the highlights

  • First place, Leafblower: An open source, easily extensible, flexible and live dashboarding platform
  • Second place, NoSQL Test Results Repository: A tool to collect and search YCSB (Yahoo! Cloud Serving Benchmark) test results.
  • Third Place, Mongowebstat: A MongoDB Monitoring tool written in Go! *
  • Runner Up, Selene: A simple CMS built with Motor, the asynchronous MongoDB driver for Python.
  • Runner Up, Backup: A Ruby Gem for easily performing backups on local and remote environments.

Thanks to the Bangalore, Sydney, Orange County, Omsk, Lima, Los Angeles and Mallorca MongoDB User Groups for participating in this Hackathon!

Sponsored Post: From Cloud to Clear Strategy - GigaOm Structure

Structure • June 19 & 20, 2013 • San Francisco

FROM CLOUD TO CLEAR STRATEGY

If your business depends on the cloud, then you need to be at Structure as we explore the best approaches for implementing the cloud today. You'll meet the people getting their hands dirty with deployments.

You'll also have a chance to meet GigaOM's Top 10 Cloud Catalysts. Hand selected by our editors, these up and comers are rethinking infrastructure for the next generation of computing. We'll announce the selected participants soon, so stay tuned!

Plus, you'll get a first look at our Structure LaunchPad finalists, chosen for their groundbreaking technologies and business models that are driving the future of the cloud industry. See them compete live.

Three ways to attend: * Single registration: $1,195 (less than 35 tickets left at this rate!) * Group rate (3 or more): $995 each * Startup package: Special offer for early stage startups that includes discounted tickets, GigaOM Pro subscription and more...

Less than 35 Supersaver tickets left - save an additional 25% by registering now using the discount code MONGODB.

This Week in MongoDB: May 13-19

Provisioned IOPS On AWS Marketplace Significantly Boosts MongoDB Performance, Ease Of Use

One of the largest factors affecting the performance of MongoDB is the choice of storage configuration. As data sets exceed the size of memory, the random IOPS rate of your storage will begin to dominate database performance. How you split your logs, journal and data files across drives will impact performance and the maintainability of your database. Even choice of filesystem and read-ahead settings can have a major impact. A large number of performance issues we encounter in the field are related to misconfigured or under-provisioned storage. Storage configuration is often more important than instance size in determining the expected performance of a MongoDB server.

MongoDB With Provisioned IOPS: Better Performance, Less Guesswork

That’s why we’re excited to announce the availability of MongoDB with bundled storage configurations on the Amazon Web Services (AWS) Marketplace. Working closely with the Marketplace and EBS teams, we’ve made available a new set of MongoDB AMI’s that not only include the world’s leading document database software installed and configured according to our best practices, but also include high performance storage configurations that leverage Amazon’s Provisioned IOPS storage volumes, including Amazon’s new 4000 IOP pIOPS drives. These options take a lot of the guess work out of running MongoDB on EC2 and help ensure a great out-of-the-box experience without needing to do any additional setup yourself.

These configurations offer radically improved performance for MongoDB, even on datasets much larger than RAM. If you want to take MongoDB for a spin, or set up your first production cluster, we recommend starting with these images. We plan to keep extending this set of configurations to give you more choices to address different workloads and use cases.

The MongoDB with Bundled Storage AMI is available today in 3 configurations:

The choice of configuration will depend on how much storage capacity you want to put behind your MongoDB instance. For comparison, we have found that ephemeral storage and regular (non-pIOPS) EBS volumes can reliably deliver about 100 IOPS on a sustained basis. That means that these configurations can deliver up to 10x-40x higher out-of-memory throughput than non-pIOPS based setups.

There’s no charge from 10gen for using these AMI’s. You pay only the EC2 usage charges for the instances and disk volumes used by your setup. Take them for a test-drive and please let us know what you think.

Implications Of Using MongoDB With pIOPS

Here’s what you get when you use these instances:

Separate Volumes For Data, Journal And Logs

When you launch the AMI on an EC2 instance, there will be three EBS volumes attached. One each for Data, Journal and Logs. By separating these onto separate volumes, we help decrease contention for disk access during high load scenarios and avoid head-of-line blocking that can occur.

The data volume is provisioned at 200GB or 400GB, with IOPS rates at 1000, 2000 and 4000. For write-heavy workloads, this helps ensure that the background flush can get synced quickly to disk. For read-heavy workloads, the IOPS rate of the drive determines the rate at which a random document, or b-tree bucket can be loaded from disk into memory.

The journal gets its own 25GB drive provisioned at 250 IOPS. While 25GB is large for the journal, we wanted to make sure we had enough IOPS to handle the journal load and to provide sufficient capacity for reading the journal during a recovery. In order to maintain the 10:1 ratio of size to IOPS imposed by EBS, we made it a little bigger than needed. Separating the journal onto a separate volume ensures that a journal flush is never queued behind the big IO’s that happen when data files are synced.

The log volumes are provisioned at 10GB, 15GB and 20GB sizes with 100, 150, and 200 IOPS. This gives you plenty of room for storage of logs as well as predictable storage performance for collection of log data.

Pre-tuned Filesystem And OS Configuration

We’ve pre-configured EXT4 filesystem, sensible mount options, read-aheads and ulimit settings into the AMI. pIOPS EBS volumes are rated for 16KB IO’s, so using read-aheads higher than this size actually lead to decreased throughput. We’ve set this up out of the box.

Amazon Linux With Pre-installed Software And Repositories

We started with Amazon’s latest and greatest Linux AMI, and then added in 10gen’s RPM repo. No more adding a repo to get access to the latest software version. We’ve also pre-installed MongoDB, the 10gen MMS Agent and various useful software utilities like sysstat (which contains the useful iostat utility) and munin-node (which MMS can use to access host statistics).

The MMS agent is deactivated by default, but can be activated simply by adding your MMS account ID and then starting the agent.

A New Wave Of MongoDB Adoption In The Cloud

A significant percentage of MongoDB applications are currently deployed in the cloud. We expect this percentage to continue to grow as enterprises discover the cost and agility benefits of running their applications on clouds like AWS. As such, it's critically important that MongoDB run exceptionally well on Amazon, and with the addition of pIOPS to the MongoDB AMI's on Marketplace, MongoDB performance in the cloud just got a big boost. We look forward to continuing to work closely with Amazon to facilitate MongoDB performance improvements on AWS.

Why It's the Right Time to Learn MongoDB

There are a number of technical considerations involved in choosing a database for a new project, but if you’re looking to learn a new technology, you need the reassurance that there is traction in the field and resources available to grow as a developer or ops professional.

Here’s why it’s the right time to learn MongoDB.

The Technology has matured

Product maturity grows due to increased usage and familiarity. MongoDB is open source and has grown along with the community--thanks both to code contributors, community testers and even those who vote on new features. If you’re learning MongoDB now, you will be learning to use a solid product that has industry validation and similar functionality to many RDBMS systems you’ve encountered before. You will also have the support of a community of experts who have been using MongoDB in different environments for three years or more.

You Need to Stay Relevant

Interest in MongoDB spiked in 2010, according to Google Search Insights and the momentum has only continued to grow. This is because the technology has matured, 10gen’s development on MongoDB has increased and adoption has grown. MongoDB has enabled developers to build new types of applications for cloud, mobile, social, making MongoDB developers an invaluable resource for companies looking to innovate in each of these areas.

In May 2012, James Governor posted Indeed Job Trends for various NoSQL products, all heading uphill since 2010, and MongoDB came out on top. Additionally, MongoDB is the most widely adopted NoSQL technology according to 451 Group's monthly LinkedIn Skills Index, with 45% of LinkedIn profile mentions in the NoSQL category. MongoDB skills are in high demand from businesses, and your peers are learning the skills to stay relevant.

You Need to Get Ahead.

Employers are looking for talented engineers who stay up-to-speed on new technologies. But even if you’re not looking for a new position, learning MongoDB can place you in line to lead a new project or oversee a large database migration.

Developers at companies like eBay, Disney, Carfax, Edmunds and Cisco are running large production deployments of MongoDB. Companies like The Guardian have committed to prototype all new projects on MongoDB--calling it the “MongoDB First” philosophy. If you work at a large engineering company, it’s likely that some new projects for social communications, advanced analytics products, content management or archiving could use a MongoDB backend. With the right expertise, you can position yourself to lead the project.

The Resources are there for you!

MongoDB has matured, and so have the resources for learning how to use the database. The docs, mailing lists and user forums are all at least three years old and are available in a number of languages. Additionally, there are community developed resources for getting started, including the Little MongoDB book. Here are some more materials for getting started with MongoDB:

  • Online Education Courses: 10gen launched online education classes in November 2012, and have been adding on new courses every few months. 10gen’s 7 Week classes will help you learn the basics of data modeling, application design and operations with MongoDB. The next set of courses for MongoDB and Java will begin on May 13 and MongoDB for Developers will begin June 17.
  • Training: 10gen provides 2-3 day training for Developers and DBAs. These courses offer a deep dive into MongoDB. 10gen offers training regularly in New York, Palo Alto and London, and offers training in other cities in the United States and Europe. This is ideal for those interested in getting started on a new MongoDB project right away.
  • Webinars: If you’re chained to your desk all day, try attending an introductory webinar. At 10gen we host at least 1 webinar a week. These offer an in-depth, technical overview into a specific topic, and you’ll always get slides and video after.
  • Conferences: Full-day conferences are an excellent way to get a good overview of a particular technology and its ecosystem. Not only will you leave with practical knowledge on how to get started, but you’ll also get to hear from production users who have valuable experience in onboarding development teams designing and scaling applications. Check out 10gen’s conference schedule for the rest of 2013.

This Week in MongoDB: May 6-12

Why Open Source Is Essential To Big Data

Gartner analyst Merv Adrian recently highlighted some of the recent movements in Hadoop Land, with several companies introducing products "intended to improve Hadoop speed." 

This seems odd, as that wouldn't be my top pick for how to improve Hadoop or, really, most of the Big Data technologies out there. By many accounts, the biggest need in Hadoop is improved ease of use, not improved performance, something Adrian himself confirms:

Hadoop already delivers exceptional performance on commodity hardware, compared to its stodgy proprietary competition. Where it's still lacking is in ease of use.

Not that Hadoop is alone in this. 

As Mare Lucas asserts,

Today, despite the information deluge, enterprise decision makers are often unable to access the data in a useful way. The tools are designed for those who speak the language of algorithms and statistical analysis. It’s simply too hard for the everyday user to “ask” the data any questions – from the routine to the insightful. The end result? The speed of big data moves at a slower pace … and the power is locked in the hands of the few.

Lucas goes on to argue that the solution to the data scientist shortage is to take the science out of data science; that is, consumerize Big Data technology such that non-PhD-wielding business people can query their data and get back meaningful results. 

The Value Of Open Source To Deciphering Big Data

Perhaps. But there's actually an intermediate step before we reach the Promised Land of full consumerization of Big Data. It's called open source.

Even with technology like Hadoop that is open source yet still too complex, the benefits of using Hadoop far outweigh the costs (financial and productivity-wise) associated with licensing an expensive data warehousing or analytics platform. As Alex Popescu writes, Hadoop "allows experimenting and trying out new ideas, while continuing to accumulate and storing your data. It removes the pressure from the developers. That’s agility."

But these benefits aren't unique to Hadoop. They're inherent in any open-source project. Now imagine we could get open-source software that fits our Big Data needs and is exceptionally easy to use plus is almost certainly already being used within our enterprises...? That is the promise of MongoDB, consistently cited as one of the industry's top-two Big Data technologies. MongoDB makes it easy to get started with a Big Data project.

Using MongoDB To Innovate

Consider the City of Chicago. The Economist wrote recently about the City of Chicago's predictive analytics platform, WindyGrid. What The Economist didn't mention is that WindyGrid started as a pet project on chief data officer Brett Goldstein's laptop. Goldstein started with a single MongoDB node, and iterated from there, turning it into one of the most exciting data-driven applications in the industry today.

Given that we often don't know exactly which data to query, or how to query, or how to put data to work in our applications, this is precisely how a Big Data project should work. Start small, then iterate toward something big. This kind of tinkering simply is difficult to impossible with a relational database, as The Economist's Kenneth Cukier points out in his book, Big Data: A Revolution That Will Transform How We Live, Work, and Think:

Conventional, so-called relational, databases are designed for a world in which data is sparse, and thus can be and will be curated carefully. It is a world in which the questions one wants to answer using the data have to be clear at the outset, so that the database is designed to answer them - and only them - efficiently.

But with a flexible document database like MongoDB, it suddenly becomes much easier to iterate toward Big Data insights. We don't need to go out and hire data scientists. Rather, we simply need to apply existing, open-source technology like MongoDB to our Big Data problems, which jibes perfectly with Gartner analyst Svetlana Sicular's mantra that it's easier to train existing employees on Big Data technologies than it is to train data scientists on one's business.

Except, in the case of MongoDB, odds are that enterprises are already filled with people that understand MongoDB, as 451 Research's LinkedIn analysis suggests:

In sum, Big Data needn't be daunting or difficult. It's a download away.

Pages