Leaf in the Wild posts highlight real world MongoDB deployments. Read other stories about how companies are using MongoDB for their mission-critical projects.
Multi-Data Center Data Aggregation System Accessed by over 3,000 Physicists from Nearly 200 Research Institutions Across the Globe
The European Organisation for Nuclear Research, known as CERN, plays a leading role in the fundamental studies of physics. It has been instrumental in many key global innovations and breakthroughs, and today operates the world's largest particle physics laboratory. The Large Hadron Collider (LHC) nestled under the mountains on the Swiss - Franco border is central to its research into origins of the universe.
A key part of CERN’s experiments is the Compact Muon Solenoid (CMS), a particle detector designed to observe a wide range of particles and phenomena produced in high-energy collisions in the LHC. Scientists use data collected from these collisions to search for new phenomena that will help to answer questions such as:
- “What is the Universe really made of, and what forces act within it?”
- “What gives everything substance?”
The CMS experiment has used MongoDB for over five years to aid the discovery and analytics of data generated from the LHC. I met up with Valentin Kuznetsov, part of the team responsible for data management in the CMS experiment, to learn more.
Can you start by telling us a little bit about what you are doing at CERN?
I am a data scientist and research associate at Cornell University where I specialize in the development of data management software for high energy physics experiments. I am also actively involved in data management for the Compact Muon Solenoid (CMS) experiment.
CMS is one of the two general purpose particle physics detectors operated at the LHC. The LHC smashes groups of protons together at close to the speed of light: 40 million times per second and with seven times the energy of the most powerful accelerators built up to now. It is designed to explore the fundamental building blocks of the universe, used by more than 3,000 physicists from 183 institutions representing 38 countries. This team drives the design, construction and maintenance of the experiments across the 20PBs of data generated by the LHC each year.
How is data managed by the CMS?
Experiments of this magnitude require a vast distributed computing and storage infrastructure. The CMS spans more than a hundred data centres, handling raw data from the detector, as well as multiple simulations and associated meta-data. Data is stored in a variety of backend repositories that we call “data-services”, including relational databases, filesystems, message queues, wikis, customized applications and more.
At this scale, efficient information discovery within a heterogeneous, distributed environment becomes an important ingredient of successful data analysis. Scientists want to be able to query and combine data from all of the different data-services. The challenge for our users is that this vast and complex collection of data means they don’t necessarily know where to find the right information, or have the domain knowledge necessary to extract the data.
What role does MongoDB play in the CMS?
MongoDB powers our Data Aggregation System (DAS), providing the ability for researchers to search and aggregate information distributed across all of the CMS backend data-services, and bring that data into a single view.
The DAS is implemented as a layer on top of the data-services, allowing researchers to query data via free text-based queries, and then aggregate the results from distributed providers into a single view – while preserving data integrity, security policy and formats. When a user submits a query, it checks if MongoDB has the aggregation the user is asking for and, if it does, returns it. Otherwise the DAS performs the aggregation and saves it to MongoDB.
Why was MongoDB selected to power the DAS?
MongoDB’s flexible data model was key to our selection. We can’t know the structure of the all the different queries researchers want to run, so a dynamic schema is essential when storing results in a single view. This requirement eliminated relational databases from our evaluation process.
We could get similar schema flexibility from other non-relational databases, but what is unique about MongoDB is that it also offers a rich query language and extensive secondary indexes. This gives our users fast and flexible access to data by any query pattern – from simple key-value look-ups, through to complex search, traversals and aggregations across rich data structures, including embedded sub-documents and arrays. We also use CouchDB in other parts of our infrastructure for data replication between different endpoints.
Can you describe how MongoDB is deployed in the DAS?
MongoDB is used as an intelligent cache on top of distributed data-services. It runs on our storage backends and currently talks to about a dozen of our CMS data-services. The data-services are backed by traditional RDBMS systems based on the Oracle CERN IT cluster. The beauty of this architecture is that it allows us transparently change our data-services without impacting user access into the system. In fact, we have already changed the implementation of a few data-services without affecting our end-users.
MongoDB handles the ingestion and expiration of millions of documents managed by the DAS every day. A single query can return up to 10,000 different documents extracted from multiple data-services, which then have to be processed, aggregated and stored in MongoDB, all in real time.
The DAS helps scientists easily discover and extract information they need in their research, and it represents one of the many tools which physicists use on a daily basis towards great discoveries. Without the DAS information retrieval would take orders of magnitude longer.
MongoDB is deployed on commodity hardware configured with SSDs and running Scientific Linux. Our applications are written in Python, and we are experimenting with Go.
Do you have other projects that are using MongoDB?
We are in the beta phase of our new Workflow and Data Management (WM) archive system. Agents from systems running data processing pipelines persist log data to MongoDB, which can then be used by our administrators and data scientists to monitor job status, system throughput, error conditions and more. MongoDB provides short-term storage for the archive, providing real-time analytics to staff using MongoDB 3.2’s aggregation pipeline across the past two months of machine data. We collect the throughput of agents at specific sites and aggregate statistics such as total CPU time of the running jobs, their success/failure rates, total size of produced data, and other metrics
Data is replicated from MongoDB to Hadoop where it is converted from JSON into an AVRO format, and persisted for long-term storage. Spark jobs are run against the historic archive, with result sets loaded back to MongoDB where they update our real time analytics views, and served to monitoring and visualization apps.
Figure 1: Architecture of the CMS WMArchive system
Can you share any best practices for getting started with MongoDB?
MongoDB’s dynamic schema is great for rapidly prototyping new applications – it gives you the freedom to try out new ideas, and you can be productive in hours. It doesn’t replace the need for proper schema design, but its flexibility means you can quickly iterate to identify the optimum data model for your application.
MongoDB also has a vibrant and active community. Never be fearful of reaching out and asking questions, even if you are afraid they may seem pretty basic! MongoDB engineers and community masters are active on the mailing lists, and so you can always get help and guidance.
About the Author - Mat Keep
Mat is a director within the MongoDB product marketing team, responsible for building the vision, positioning and content for MongoDB’s products and services, including the analysis of market trends and customer requirements. Prior to MongoDB, Mat was director of product management at Oracle Corp. with responsibility for the MySQL database in web, telecoms, cloud and big data workloads. This followed a series of sales, business development and analyst / programmer positions with both technology vendors and end-user companies.