Just published on the Bright blog is my follow-up post regarding my poster presentation (poster, video) at the 2015 Rice Oil and Gas HPC Workshop. In addition to summarizing the disruptive potential for Apache Spark in energy exploration and other industries, this new post also captures my shift in emphasis from Apache Hadoop to Spark. Because the scientific details of my investigation are more-than-a-little OT for the Bright blog, I thought I’d share them here.
RTM has a storied history of being performance-challenged. Although the method was originally conceived by geophysicists in the 1980s, it was almost two decades before it became computationally tractable. Considered table stakes in terms of seismic processing by today’s standards, algorithms research for RTM remains highly topical – not just at Rice, York and other universities, but also at the multinational corporations whose very livelihood depends upon the effective and efficient processing of seismic-reflection data. And of particular note are the consistent gains being made since the introduction of GPU programmability via CUDA, as innovative algorithms for RTM can exploit this platform for double-digit speedups.
Why does RTM remain performance-challenged? Dr. G. Liu and colleagues in the School of Geophysics and Information Technology at China University of Geosciences in Beijing identify the two key challenges:
- RTM modelling is inherently compute intensive. In RTM, propagating seismic waves are modeled using the three-dimensional wave equation. This classic equation of mathematical physics needs to be applied twice. First in the forward problem, assumptions are made about the characteristics of the seismic source as well as variations in subsurface velocity, so that seismic waves can be propagated forward in time from their point of origin into the subsurface (i.e., an area of geological interest from a petroleum exploration perspective); this results in the forward or source wavefields in the upper-branch of the diagram below. Using seismic traces recorded at arrays of geophones (receivers sensitive to various types of seismic waves) as well as an assumed subsurface-velocity model, these observations are reversed-in-time (hence the name RTM), and then backwards propagated using the same 3D wave equation; this results in the receivers’ wavefields in the lower-branch of the diagram below. It is standard practice to make use of the Finite Difference Method (FDM) to numerically propagate all wavefields in space and time. In order to ensure meaningful results (stable and non-dispersive from the perspective of numerical analysis) from application of FDM to the 3D wave equation, however, both time and 3D space need to be discretized into small steps and grid intervals, respectively. Because the wave equation is a Partial Differential Equation in time and space, the FDM estimates future values using approximations for all derivatives. And in practice, it has been determined that RTM requires high-order approximations for all spatial derivatives if reliable results are to be optimally obtained. In short, there are valid reasons why the RTM modeling kernel is inherently and unavoidably compute intensive.
- RTM data exceeds memory capacity. From the earliest days of computational tractability around the late 1990s, standard practice was to write the forward/source wavefields to disk. Then, in a subsequent step, cross-correlate this stored data of forward wavefields with the receivers’ wavefields. Using cross-correlation as the basis for an imaging condition, coherence (in the time-series analysis sense) between the two wavefields is interpreted as being of geological interest – i.e., the identification in space and time of geological reflectors like (steeply dipping) interfaces between different sedimentary lithologies, folds, faults, salt domes as well as reservoirs of even more complex geometrical structure. Although the method consistently delivered the ‘truest images’ of the subsurface, it was literally being crushed by its own success, as multiple-TB data volumes are typical for the forward wavefields. The need to write the forward wavefields to disk, and then re-read them piecemeal from disk during cross-correlation with the receivers’ wavefields, results in disk I/O emerging as the significant bottleneck.
Not surprisingly then, researchers like Liu et al. have programmed GPUs using CUDA for maximum performance impact when it comes to implementing RTM’s modeling kernel. However they, along with a number of other researchers, have introduced novel algorithms to address the challenge of disk I/O. As you might anticipate, the novel aspect of their algorithms is in how they make use of the memory hierarchy presented by hybrid-architecture systems based on CPUs and accelerators. (Although CUDA 6 introduced a kernel module to allow for shared memory between CPUs and GPUs in the first quarter of 2014, I am unaware of the resulting contiguous memory being exploited in the case of RTM.) Programming GPUs via CUDA is not for the feint of coding heart. However, the double-digit performance gains achieved using this platform have served only to validate an ongoing investment.
Spark’ing Possibilities for RTM
Inspired by the in-memory applications of GPUs, and informed about the meteoric rise of interest in Apache Spark, the inevitable (and refactored) question for the Rice workshop became: “RTM using Spark? Is there a case for migration?” In other words, rather than work with HDFS and YARN in a Hadoop context, might Spark have more to offer to RTM?
With the caveat that my investigation is at its earliest stages, and that details need to be fleshed out by me (and hopefully!) many others, Spark appears to present the following possibilities for RTM:
- Replace/reduce disk I/O with RDDs. The key innovation implemented in Spark is RDDs – Resilient Distributed Datasets. This in-memory abstraction (please see the 6th reason here for more) has the potential to replace disks in RTM workflows. More specifically, in making use of RDDs via Spark:
- Forward wavefields could reside in memory and be rendered available without the need for disk I/O during the application of the imaging condition – i.e., as forward and receivers’ wavefields are cross-correlated. This is illustrated in a modified version of RTM’s computational workflow above. You should be skeptical about the multiple-TBs of data involved here – as you’re unlikely to have a single system with such memory capacity in isolation. This is where the Distributed aspect of RDDs factors in. In a fashion that mimics Hadoop’s use of distributed, yet distinct disks to provide the abstraction of a contiguous file system, RDDs do the same only with memory. Because RDDs are inherently Resilient, they are intended for clustered environments where various types of failures (e.g., a kernel panic followed by a system crash) are inevitable and can be tolerated. Even more enticing in this use case involving RTM wavefields, the ability to functionally transform datasets using Spark’s built-in capability for partitioning RDDs means that more sophisticated algorithms for imaging RTM’s two wavefields can be crafted – i.e., algorithms that exploit topological awareness of the wavefields’ locality in memory. In confronting the second challenge identified above by Liu et al., an early win for in-memory RTM via RDDs would certainly demonstrate the value of the approach.
- Gathers of seismic data could reside in memory, and be optimally partitioned using Spark for wavefield calculations. Once acquired, reflection-seismic data is written to an industry-standard format (SEG Y rev 1) established by the Society of Exploration Geophysicists (SEG). Gathers are collections of data for pairs of sources and receivers that have depth (typically) in common. (This is referred to as a Common Depth Point or CDP gather by the industry.) RTM is systematically applied to each gather. Although this has not been a point of focus from an algorithms-research perspective, even in the innovative cases involving GPUs, the in-memory possibilities afforded by Spark may be cause for reconsideration. In fact Professor Huang and his students, in the Department of Computer Science at Prairie View A&M University (PVAMU) in the Houston area, have already applied Spark to SEG Y rev 1 format seismic data. In a poster presented at the Rice workshop, not only did Prof. Huang demonstrate the feasibility of introducing RDDs via Spark, he indicated how this use is crucial to a cloud-based platform for seismic analytics currently under development at PVAMU.
- Apply alternate imaging conditions. For each (CDP) gather, coherence between RTM’s two wavefields comprises the basis for establishing the presence of subsurface reflectors of geological origin. Using cross-correlation, artifacts introduced by complex reflector geometries, for example, are de-emphasized as the gather is migrated as-a-whole. Whereas it represents the canonical imaging condition envisaged by the originators of RTM in the 1980s, cross-correlation is by no means the only mechanism for establishing coherence between wavefields. Because Spark includes support for machine learning (MLlib), graph analytics (GraphX) and even statistics (SparkR), alternate possibilities for rapidly establishing imaging conditions have never been more accessible to the petroleum industry. Spark’s analytics upside for imaging conditions is much more about introducing new capabilities than computational performance. For example, parameter studies based upon varying gathers and/or velocity models might serve to reduce the levels of uncertainty inherently present in inverse problems that seek to image the subsurface in areas of potential interest for the exploitation of petroleum resources. Using Spark’ified Genetic Algorithms (e.g., derivative of Spark-complimentary ones already written in Scala), for example, criteria could be established for evaluating the imaging conditions resulting from parameter studies – i.e., naturally selecting the most-appropriate velocity model.
- Alternate implementation of the modeling kernel. Is it possible to Spark’ify the RTM modeling kernel? In other words, make programmatic use of Spark to propagate wavefields via the FDM implementation of the 3D wave equation. And even if this is possible, does it make sense? Clearly, this is the most speculative of the suggestions here. Though most speculative, in asking more questions than it presently answers, also the most intriguing. How so? At its core, speculation of this kind speaks to the generality of RDDs as a paradigm for parallel computing that reaches well beyond just RTM using FDM, and consequently of Spark as a representative implementation. Without speculating further at this time, I’ll take the 5th, and close conservatively here with: Further research is required.
- Real-time streaming. Spark includes support for streamed data. Whereas streaming seismic data upon acquisition in real time through an RTM workflow appears problematical even to blue-skying me at this point, the notion might find application in related contexts. For example, perhaps a stream-based implementation involving Spark might aid in ensuring the quality of seismic data in near real time as it is acquired, or be used to assess the resolution adequacy in an area of heightened interest within a broader campaign.
Incorporating Spark into Your IT Environment
Whether you’re a boutique outfit, a multinational corporation, or something in between, you have an incumbent legacy to consider in upstream-processing workflows for petroleum exploration. Therefore, introducing technologies from Big Data Analytics into your existing HPC environments is likely to be deemed unwelcome at the very least. However based on a number of discussions at the Rice workshop, and elsewhere in the Houston oil patch, there are a number of reasons why Spark presents as more appealing than Hadoop in complimenting existing IT infrastructures:
- Spark can likely make use of your existing file systems;
- Spark will integrate with your HPC workload manager;
- Spark can be deployed alongside your HPC cluster;
- You can likely use your existing code with Spark;
- You could run Spark in a public or private cloud, or even a (Docker) container;
- Spark is not a transient phenomena – despite the name; and finally
- Spark continues to improve.
Briefly, in conclusion:
- RTM has a past, present and future of being inherently performance-challenged. This means that algorithms research remains topical. Noteworthy gains are being made through the use of GPU programmability involving CUDA.
- Using some ‘novel exploitation’ of HDFS and YARN, Hadoop might afford some performance-related benefits – especially if diskless HDFS is employed. Performance aside, the analytics upside for Hadoop is arguably comparable to that of Spark, even though there would be a need to make use of a number of separate and distinct applications in the Hadoop case.
- Spark is much easier to integrate with an existing HPC IT infrastructure – mostly because Spark is quite flexible when it comes to file systems. Anecdotal evidence suggests that this is a key consideration for organizations involved in petroleum exploration, as they have incumbent storage solutions in which they have made significant and repeated investments. Spark has eclipsed Hadoop in many respects, and the risk of adoption can be mitigated on many fronts.
- From in-memory data distributed in a fault-tolerant fashion across a cluster, to analytics-based imaging conditions, to refactored modeling kernels, to possibilities involving data streaming, Spark introduces a number of possibilities that are already demanding the attention of those involved in processing seismic data.
In making use of Spark in the RTM context, there is the potential for significant depth and breadth. Of course, the application of Spark beyond RTM serves only to deepen and broaden the possibilities. Spark is based on sound research in computer science. It has developed into what it is today on the heels of collaboration. That same spirit of collaboration is now required to determine how and when Spark will be applied in the exploration for petroleum, in other areas of the geosciences, as well as in other industries – possibilities for which have been enumerated elsewhere.
Shameless plug: Interested in taking Spark for a test drive? With Bright Cluster Manager for Apache Hadoop all you need is a minimal amount of hardware on the ground or in the cloud. Starting with bare metal, Bright provides you with the entire system stack from Linux through HDFS (or alternative) all the way up to Spark. In other words, you can have your test environment for Spark in minutes, and get cracking on possibilities for Spark-enabling RTM or almost any other application.
Recently I shared an a-ha! moment on the use of virtual environments for confronting the fear of public speaking.
The more I think about it, the more I’m inclined to claim that the real value of such technology is in targeted skills development.
Once again, I’ll use myself as an example here to make my point.
If I think back to my earliest attempts at public speaking as a graduate student, I’d claim that I did a reasonable job of delivering my presentation. And given that the content of my presentation was likely vetted with my research peers (fellow graduate students) and supervisor ahead of time, this left me with a targeted opportunity for improvement: The Q&A session.
Countless times I can recall having a brilliant answer to a question long after my presentation was finished – e.g., on my way home from the event. Not very useful … and exceedingly frustrating.
I would also assert that this lag, between question and appropriate answer, had a whole lot less to do with my expertise in a particular discipline, and a whole lot more to do with my degree nervousness – how else can I explain the ability to fashion perfect answers on the way home!
Over time, I like to think that I’ve approved my ability to deliver better-quality answers in real time. How have I improved? Experience. I would credit my experience teaching science to non-scientists at York, as well as my public-sector experience as a vendor representative at industry events, as particularly edifying in this regard.
Rather than submit to such baptisms of fire, and because hindsight is 20/20, I would’ve definitely appreciated the opportunity to develop my Q&A skills in virtual environments such as Nortel web.alive. Why? Such environments can easily facilitate the focused effort I required to target the development of my Q&A skills. And, of course, as my skills improve, so can the challenges brought to bear via the virtual environment.
All speculation at this point … Reasonable speculation that needs to be validated …
If you were to embrace such a virtual environment for the development of your public-speaking skills, which skills would you target? And how might you make use of the virtual environment to do so?
Confession: In the past, I’ve been extremely quick to dismiss the value of Second Life in the context of teaching and learning.
Even worse, my dismissal was not fact-based … and, if truth be told, I’ve gone out of my way to avoid opportunities to ‘gather the facts’ by attending presentations at conferences, conducting my own research online, speaking with my colleagues, etc.
So I, dear reader, am as surprised as any of you to have had an egg-on-my-face epiphany this morning …
Please allow me to elaborate:
- Yesterday, I witnessed a demonstration of Nortel web.alive (dubbed by some as ‘Second Life for business’)
- This morning I was brainstorming content with a colleague for an upcoming presentation on computing resources available for researchers at York
It was at some point during this morning’s brainstorming session that the egg hit me squarely in the face:
Why not use Nortel web.alive to prepare graduate students for presenting their research?
Often feared more than death and taxes, public speaking is an essential aspect of academic research – regardless of the discipline.
As a former graduate student, I could easily ‘see’ myself in this environment with increasingly realistic audiences comprised of friends, family and/or pets, fellow graduate students, my research supervisor, my supervisory committee, etc. Because Nortel web.alive only requires a Web browser, my audience isn’t geographically constrained. This geographical freedom is important as it allows for participation – e.g., between graduate students at York in Toronto and their supervisor who just happens to be on sabbatical in the UK. (Trust me, this happens!)
As the manager of Network Operations at York, I’m always keen to encourage novel use of our campus network. The public-speaking use case I’ve described here has the potential to make innovative use of our campus network, regional network (GTAnet), provincial network (ORION), and even national network (CANARIE) that would ultimately allow for global connectivity.
While I busy myself scraping the egg off my face, please chime in with your feedback. Does this sound useful? Are you aware of other efforts to use virtual environments to confront the fear of public speaking? Are there related applications that come to mind for you? (As someone who’s taught classes of about 300 students in large lecture halls, a little bit of a priori experimentation in a virtual environment would’ve been greatly appreciated!)
Update (November 13, 2009): I just Google’d the title of this article and came up with a few, relevant hits; further research is required.
I bumped into a professional acquaintance last week. After describing briefly a presentation I was about to give, he offered to broker introductions to others who might have an interest in the work I’ve been doing. To initiate the introductions, I crafted a brief description of what I’ve been up to for the past 5 years in this area. I’ve also decided to share it here as follows:
As always, [name deleted], I enjoyed our conversation at the recent AGU meeting in Toronto. Below, I’ve tried to provide some context for the work I’ve been doing in the area of knowledge representations over the past few years. I’m deeply interested in any introductions you might be able to broker with others at York who might have an interest in applications of the same.
Since 2004, I’ve been interested in expressive representations of data. My investigations started with a representation of geophysical data in the eXtensible Markup Language (XML). Although this was successful, use of the approach underlined the importance of metadata (data about data) as an oversight. To address this oversight, a subsequent effort introduced a relationship-centric representation via the Resource Description Format (RDF). RDF, by the way, forms the underpinnings of the next-generation Web – variously known as the Semantic Web, Web 3.0, etc. In addition to taking care of issues around metadata, use of RDF paved the way for increasingly expressive representations of the same geophysical data. For example, to represent features in and of the geophysical data, an RDF-based scheme for annotation was introduced using XML Pointer Language (XPointer). Somewhere around this point in my research, I placed all of this into a framework.
In addition to applying my Semantic Framework to use cases in Internet Protocol (IP) networking, I’ve continued to tease out increasingly expressive representations of data. Most recently, these representations have been articulated in RDFS – i.e., RDF Schema. And although I have not reached the final objective of an ontological representation in the Web Ontology Language (OWL), I am indeed progressing in this direction. (Whereas schemas capture the vocabulary of an application domain in geophysics or IT, for example, ontologies allow for knowledge-centric conceptualizations of the same.)
From niche areas of geophysics to IP networking, the Semantic Framework is broadly applicable. As a workflow for systematically enhancing the expressivity of data, the Framework is based on open standards emerging largely from the World Wide Web Consortium (W3C). Because there is significant interest in this next-generation Web from numerous parties and angles, implementation platforms allow for increasingly expressive representations of data today. In making data actionable, the ultimate value of the Semantic Framework is in providing a means for integrating data from seemingly incongruous disciplines. For example, such representations are actually responsible for providing new results – derived by querying the representation through a ‘semantified’ version of the Structured Query Language (SQL) known as SPARQL.
I’ve spoken formally and informally about this research to audiences in the sciences, IT, and elsewhere. With York co-authors spanning academic and non-academic staff, I’ve also published four refereed journal papers on aspects of the Framework, and have an invited book chapter currently under review – interestingly, this chapter has been contributed to a book focusing on data management in the Semantic Web. Of course, I’d be pleased to share any of my publications and discuss aspects of this work with those finding it of interest.
With thanks in advance for any connections you’re able to facilitate, Ian.
If anything comes of this, I’m sure I’ll write about it here – eventually!
In the meantime, feedback is welcome.
Just in case you haven’t heard:
… join us for an exciting national summit on innovation and technology, hosted by ORION and CANARIE, at the Metro Toronto Convention Centre, Nov. 3 and 4, 2008.
“Powering Innovation – a National Summit” brings over 55 keynotes, speakers and panelist from across Canada and the US, including best-selling author of Innovation Nation, Dr. John Kao; President/CEO of Intenet2 Dr. Doug Van Houweling; chancellor of the University of California at Berkeley Dr. Robert J. Birgeneau; advanced visualization guru Dr. Chaomei Chen of Philadelphia’s Drexel University; and many more. The President of the Ontario College of Art & Design’s Sara Diamond chairs “A Boom with View”, a session on visualization technologies. Dr. Gail Anderson presents on forensic science research. Other speakers include the host of CBC Radio’s Spark Nora Young; Delvinia Interactive’s Adam Froman and the President and CEO of Zerofootprint, Ron Dembo.
This is an excellent opportunity to meet and network with up to 250 researchers, scientists, educators, and technologists from across Ontario and Canada and the international community. Attend sessions on the very latest on e-science; network-enabled platforms, cloud computing, the greening of IT; applications in the “cloud”; innovative visualization technologies; teaching and learning in a web 2.0 universe and more. Don’t miss exhibitors and showcases from holographic 3D imaging, to IP-based television platforms, to advanced networking.
For more information, visit http://www.orioncanariesummit.ca.
How do scientists actually use computers in their day-to-day work?
A Canadian team is conducting a survey to find out:
Computers are as important to modern scientists as test tubes, but we know surprisingly little about how scientists develop and use software in their research. To find out, the University of Toronto, Simula Research Laboratory, and the National Research Council of Canada have launched an online survey in conjunction with “American Scientist” magazine. If you have 20 minutes to take part, please go to:
Thanks in advance for your help!
Jo Hannay (Simula Research Laboratory)
Hans Petter Langtangen (Simula Research Laboratory)
Dietmar Pfahl (Simula Research Laboratory)
Janice Singer (National Research Council of Canada)
Greg Wilson (University of Toronto)
The results of the survey will be shared via American Scientist.
RDF-ization is a term used by the Semantic Web community to describe the process of generating RDF from non RDF Data Sources such as (X)HTML, Weblogs, Shared Bookmark Collections, Photo Galleries, Calendars, Contact Managers, Feed Subscriptions, Wikis, and other information resource collections.
Although Idehen identifies a number of data sources, he does not explicitly identify two data sources I’ve been spending a fair amount of time with over the past few years:
- One source of data is that generated by scientific instruments. With various colleagues, the semantic framework I’ve built around this data source allows for RDF-ization of scientific data from semi-structured ASCII to XML (specifically ESML) to RDF via GRDDL. (Please see the illustration.) In principle, it should be possible to further transform the RDF representation into OWL thus resulting in what I’ve referred to elsewhere as an informal ontology. (According to Morville as well as Shadbolt et al., the RDF-ization of the data sources Idehen identifies result in folksonomies, rather than informal ontologies.) Again with various colleagues, I’ve also made use of RDF to annotate features inherent in the scientific data via XML Pointer Language (XPointer).
- Even more recently, with members of my Network Operations team at York University, I’ve been working with a relational database as a source of data on the topology of IP networks. (Please see the illustration.)
Of course, whether the motivation is personal/social-networking or scientific/IT related, the attention to RDF-ization is win-win for all stakeholders. Why? Anything that accelerates the RDF-ization of non-RDF data sources brings us that much closer to realizing the true value of the Semantic Web.