[Activity] Stream Live Tweets with Spark Streaming!November 6, 20193 Back to: Streaming Big Data with Spark Streaming, Scala, and Spark 3! Previous Lesson Introduction, and Getting Set Up Next Lesson [Activity] Scala Basics: Part 1 Share this Article Comments Ahmed Khalil July 15, 2020 10:24 pm Log in to Reply Hello Frank, First, Thank you so much for the deep and insightful information throughout all courses. These are really great, especially when compared to other courses out there. Second, I have a question regarding the solution architecture for a system I’m building now with my team. We are collecting data from various sources (mostly REST APIs) for financial data, caching these data into elastic search, then using some modern UI components building visualizations and dashboards. My questions are regarding the tool we will use to ingest the data from the different sources, as we need to do some operations on the data while importing it to elastic, for ex: – Transformations: splitting one record from the source system to multiple documents inelastic based on some logic – creating aggregated documents, this is a separate scheduler that we are looking to create to create some aggregations and save these aggregations as a new document in elastic. I know that elastic aggregations are great and we will use it. these documents are just for a special purpose. We tried logstash and it was limiting to what we are looking to do in terms of transformations. Now we are comparing Spark streaming and Kafka streaming in order to have the most flexibility possible, we have java background and open to learn any language as well. Looking forward for your feedback and sorry for the long question. Best, Ahmed Frank Kane July 16, 2020 8:38 am Log in to Reply My gut reaction would be to start with Spark Streaming, if you’re looking to have the most flexibility in transforming and aggregating the data as it is ingested. Tools such as Logstash and Kafka tend to be better suited to ingesting data produced by large numbers of individual hosts and funneling that data somewhere. Your use case is a bit different, as you’re just hitting REST API’s for your data and not trying to solve the problem of reliably transmitting data from a large number of systems to a single data repository. Ahmed Khalil July 17, 2020 3:33 am Log in to Reply Thanks a lot, Frank, this kind of the way we chose to use spark streaming. in the future when the system grows, we might add spark itself betwen spark streaming and elstic to carry the datasets as is from the source and keep the aggregate data only in elastic for optimum performance. Also later we might add kafka to handle the real-time data streams for sources like google analytics. This is getting interesting 😀 . Leave a Comment Cancel CommentYou must be logged in to post a comment.