Leaf in the Wild posts highlight real world MongoDB deployments. Read other stories about how companies are using MongoDB for their mission-critical projects.
There is no question that Apache Spark is on fire. It’s the most active big data project in the Apache Software Foundation, and was recently “blessed” by IBM who committed 3,500 engineers to advancing it. While some are still confused by what it is, or claiming it will kill Hadoop (which it won’t, or at least not the non-MapReduce parts of it), there are already companies today harnessing its power to build next generation analytics applications.
Stratio are one such company. With an impressive client list including BBVA, Just Eat, Santander, SAP, Sony and Telefonica, Stratio claims more projects and clients with its Apache Spark-certified Big Data (BD) platform than pretty much anyone else.
MongoDB is often used as the database within the Stratio BD platform, so I jumped at the chance to sit down with their Connectors Development Team, along with Steve Galache, Stratio Deputy Officer. We discussed some of their recent projects, and best practices they have learned for leveraging Spark in new generations of analytics workloads.
Can you start by telling us a little bit about your company? What are you trying to accomplish? How do you see yourself growing in the next few years?
We at Stratio have been working with Big Data for over a decade. Years ago we foresaw that the future of big data was fast data, and started working with Apache Spark even before it was in incubation. We brought the first platform to market in early 2014, and it is currently in production in over a dozen large enterprises in banking, e-commerce, energy and telecom, to name a few. The platform is enabling a range of use cases, from internet scale, consumer applications all the way to cybersecurity. We are not only a certified Spark distribution – we are focused on simplifying application development using fast data. On top of Spark we have built applications for data ingestion, processing and visualization without needing to write a line of code, and provide full integration with leading non-relational databases such as MongoDB.
Please describe your projects using MongoDB and Stratio. What problems were your customers trying to solve? How do MongoDB and Spark help them solve those problems?
A very common scenario is a large enterprise that has implemented sophisticated Business Intelligence (BI) processes to understand its business. But they now want to take the next step by exploiting new data sources to further improve their understanding of customers. The company is sitting on large amounts of untapped raw data from service logs, from browsing behavior, from social data and more. Being able to analyze that data helps the enterprise deepen their understanding of customers in terms of what the customer does or doesn’t do, or indeed what they are trying to do. Ultimately this helps them improve the customer experience, as well as predict new business opportunities. This is where Stratio and our BD platform can help.
As an example of this scenario, we recently implemented a unified real-time monitoring platform for a multinational banking group operating in 31 countries with 51 million clients all over the world. The bank wanted to ensure a high quality of service across its online channels, and so needed to continuously monitor client activity to check service response times and identify potential issues. To build the application, we used the following technology:
- Apache Flume to aggregate log data
- Apache Spark to process log events in real time
- MongoDB to persist log data, processed events and Key Performance Indicators (KPIs).
The aggregated KPIs enables the bank to meet its objective of analyzing client and systems behavior in order to improve the customer experience. Collecting raw log data allows the bank to immediately rebuild user sessions if a service fails, with the analysis providing complete traceability to identify the root cause of the issue.
It isn’t just financial services companies that see the opportunity of analyzing logs and social streams. We have implemented similar projects for our clients in the insurance and retail sectors as well.
Why was MongoDB a good fit for this project?
We needed a database that gave us always-on availability, high performance and linear scalability. In addition, we needed a fully dynamic schema to support high volumes of rapidly changing semi-structured and unstructured JSON data being ingested from the variety of logs, clickstreams and social networks. When we evaluated the project’s requirements, MongoDB was the best fit.
Beyond projects of this type, we also see data lakes – central repositories of data from multiple applications – growing in use. MongoDB’s dynamic schema is a great fit for this type of use case as we cannot predict what type of data structures we need to manage.
Do you use other databases in your projects?
In order to collect and blend data from multiple applications, our BD platform can support a variety of databases and search engines. Each of these databases has its own strengths, and some can also give the availability and performance of MongoDB.
However, if an application is rapidly evolving with the addition of new features, and is generating multi-structured data, then MongoDB’s dynamic schema gives us a highly agile and flexible database layer to build upon.
In addition, deploying an application into production requires a comprehensive enterprise-class operational platform for orchestrating the application flow and proactive monitoring. MongoDB Cloud Manager provides us this tooling.
Can you describe a typical MongoDB and Spark deployment?
We have deployed MongoDB with the Stratio BD platform in many projects with different requirements. In some cases, MongoDB is used alongside other databases, and so maybe deployed on a single replica set.
In other projects, MongoDB is the sole data source, and is often running across a sharded cluster.
Can you share any best practices on customers scaling their MongoDB and Stratio Spark infrastructure?
There are several factors that have a major impact on system performance:
- Select an appropriate partitioning policy based on the access patterns of the Spark analytics workloads. MongoDB gives us a lot of flexibility here, and we can support both hash and range-based sharding to accommodate a wide variety of application schemas and query patterns.
- Make sure you provision each node with sufficient memory. As a best practice, the MongoDB working set – indexes and most frequently accessed data – should be stored in RAM. Similarly, giving Spark enough memory is also critical to performance as it avoids serializing data to disk during iterative computation.
- Design your architecture to ensure you maintain data locality between the MongoDB and Spark processes, even as you scale the cluster.
Figure 1: Using MongoDB replica sets for data locality and isolation of analytics from operational workloads
What types of queries and analysis are your customers driving with Spark and MongoDB? What APIs are they using?
It varies by use case. Everything from Spark Streaming to Machine Learning algorithms and SparkSQL. Stratio's SparkSQL MongoDB connector implements the PrunedFilteredScan API instead of the TableScan API. This means that, in addition to MongoDB’s query projections, we can avoid scanning the entire collection, which is not the case with other databases.
For SQL integration between MongoDB and Spark, the Stratio BD platform provides a SparkSQL data source that implements the Spark Dataframe API. This connector, fully implemented in Scala, currently supports Spark 1.3 (and is currently being updated for Spark 1.4, released a couple of weeks ago) and MongoDB 3.0. Why choose this implementation? Because it allows integration among a different set of data sources that implement the same API for Spark, in addition to using the SQL syntax to provide higher-level abstractions for complex analytics.
Once data is stored in MongoDB, we provide an ODBC/JDBC connector for integrating with any BI tool, such as Tableau and QlikView. We also use the connector to integrate with our own custom Datavis tool for creating dashboards and reports.
Figure 2: Stratio Datavis - creates reports, dashboards and microsites
How do your customers measure the impact of using MongoDB and the Stratio BD Platform?
Time-to-insight and time-to-market are critical dimensions. Both MongoDB and the Stratio BD platform are integrated and enterprise-ready. As a result customers avoid wasting time on installing, configuring and monitoring the system. They can just get on and start analyzing data in new and interesting ways.
Built around open source software and running on commodity hardware, the platform also delivers significant cost savings over proprietary technologies.
What advice would you give to those who are considering using Spark for their next project?
- Learn Scala. Scala is Spark’s mother tongue. Even though there are APIs for Java and Python, most of Spark’s developments use Scala, a functional and object oriented language.
- Beware of data shuffling! A good partitioning policy and use of RDD caching, coupled with avoiding unnecessary action operations, should be more than enough for achieving good performance.
- Consider serialization policies. If you know (or suspect) your data is not going to fit in RAM, you can experiment with the available serialization policies (i.e., MEMORY, MEMORY_SER, DISK) to achieve the best performance.
- A good assembly for your application. A tricky sort of problem is related to JAR assembly and classpath management when deploying your application. A good plugin for generating an assembly with transitive dependencies is fully recommended.
Steve and team – we’d like to thank you for taking the time to share your insights with the MongoDB and Apache Spark communities.