Big Data Hadoop + Spark

Big Data Hadoop + Spark

1. Python & PySpark Programming Gautam Verma
2. Agenda- Day 2 Hands on with Pandas 1. Pandas Overview 2. Dataframes & Series 3. Object creation 4. Indexing, selection operations in pandas 5. Plotting & Visualization Introduction to Big Data 1. Distributed Processing 2. Hadoop vs Spark 2
3. Some facts about Big Data Analytics Before getting started, some under appreciated facts about analytics 1) Data is never clean 2) You will spend most of your time cleaning and preparing data 3) There is no fully automated data science. You need to get your hands dirty 4) 95% of the tasks do not require deep learning 5) Analytics can be classified into three broad categories – Descriptive, Predictive and Prescriptive Analytics 6) Not all data generated online is crucial 7) Data science is very closely associated with IoT because IoT is all about data generation and Data Science is about analyzing it 8) Most models are false, but some are useful 9) High quality data beats fancy model all the time 3
4. What is Big Data? Larger, more complex data sets, especially from new data sources 4Vs of Big DataVolume: High volumes of low-density, unstructured data For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes Variety: Unstructured and semi-structured data types, such as events, logs, text, audio, video, images. Each comes with respective metadata Velocity: Fast rate at which data is received and (perhaps) acted on Concept of real time and near real time processing and action Veracity: Correctness of the data being processed What could be Big Data in Music Industry? 4
5. Data Analytics Architectures What are the limitations of existing Data Architectures? Scalability  Monolithic data pipelines  Vertical: Buy better hardware 5 Costs Performance  Expensive Hardware  Low throughput  Expensive Licenses needed  Network Congestion  Don’t support variety of data sources  Resource intensivemaintenance and support teams  Not extensible enough  Recurring need to add CPUs and/or memory  Increased query times with increasing database size
6. How Hadoop solves the problem? Hadoop enables setting up an interconnection of commodity hardware instead of a single high scale machine Key Characteristics Distributed System: several independent nodes participate in processing large volume of and variety of structured/semi-structured/unstructured data Files formed by blocks (64 or 128 MB) Scalable: Scale up from a single server to thousands of machines offering local computation and storage Who uses Hadoop? Who owns the biggest Hadoop Distribution in the world? 6
7. How Hadoop solves the problem? Credits: ownself.me 7
8. What is Hadoop? “Framework that allows distributed processing of large data sets across clusters of computers” - Apache Hadoop Foundation Key Characteristics Scalability High Availability Fault Tolerance Economical Distributed Storage Replication 8
9. Hadoop Ecosystem 9
10. Hadoop Ecosystem Different Components of Hadoop Ecosystem File System Data Ingestion  Hadoop Distributed File System (HDFS)  Apache Cassandra  Apache Hbase  Flume  Sqoop  Map Reduce  Pig & Hive  Mahout, Spark  Management & Monitoring- Ambari  Zeppelin Data Processing Support  Scheduling- YARN: Yet Another Resource Negotiator  Oozie 10 Data Visualization
11. Hadoop Distributed File System (HDFS) “Framework that allows distributed processing of large data sets across clusters of computers Key Characteristics • Scale up from a single server to thousands of machines offering local computation and storage • Suitable for applications with large data sets • Most suited for batch processing of data • Write-once-read-many: a file once created, written and closed need not be changed • Each file is a sequence of blocks ( generally of 64 MB each) 11 To edit footnote click on Insert / Header & Footer
12. Analytics Big Data Spark Map Reduce 12
13. Why Map Reduce Big Data Processing Framework to work on distributed cluster of machines  Massive data processing on commodity cluster  Simple programming model  Scalable  Parallel Computation  Fault Tolerant  Ease of querying big data 13
14. Map Reduce Placeholder for your own subheadline  Key- Value Pairs: Inputs and Outputs are sequence of key-value pairs Eg- Word Count Problem Statement 14
15. Map Reduce End to end workflow of a Map Reduce job 15
16. Traditional RDBMS way vs Hadoop MR way RDBMS Data Size GBs of data Updates Petabytes of data R/W as many Read many times, writes are times limited Data Types Hadoop Structured, Semi- structured, UnStructured data structured data Scalabilit Vertical y Scalability Horizontal Scalability OLTP Support OLTP Support Batch processing (OLAP) Licensed Cost 16 Software Open source (free)
17. Map Reduce vs Spark Map Reduce Limited applicability for real-time data processing  Multiple MR steps needed for final metric calculation – Because data doesn’t fit in the memory  Data Imbalance – Some machine complete soon, others later – Inefficient usage of resources – Data processing doesn’t scale linearly – Next MR step cant start till current one doesn’t complete fully Spark Speed, Caching, polyglot, real time  Lazy Evaluation, In-memory computation 17
18. Spark Uses Memory instead of Disk Hadoop: Use Disk for Data Sharing HDFS read HDFS Write Iteratio n1 HDFS read HDFS Write Iteratio n2 Spark: In-Memory Data Sharing HDFS read Iteratio n1 Iteratio n2
19. Map Reduce vs Spark Map Reduce  Difficult to program and requires abstractions  Used for descriptive/ retrospective analysis of data  No in-built interactive mode except tools like Pig and Hive  Does not leverage the memory of the Hadoop cluster to the maximum.  Allows you to just process a batch of stored data 19 Spark  Easy to program and does not require any abstractions  Perform streaming, batch processing and machine learning ,all in the same cluster  Has in-built interactive mode  Executes jobs 10 to 100 times faster than Hadoop MapReduce  Programmers can modify the data in real-time through Spark streaming
20. Spark Why Spark? Because disk operations are far slower than memory operations Fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads Data Structures: RDDs/ Dataframes/ Datasets RDD Operations: Transformations (Eg- Map) & Actions (Reduce) 20
21. RDD vs Dataframe vs Datasets RDD Type Safe Safety (Immutable) Developer has to Optimizat take care of ion optimization Performa nce Not the best Not memory Memory efficient 21 Dataframe Dataset Not Type Safe Type Safe Auto optimization Auto optimization Not the best Not memory efficient Best performance More memory efficient
22. Spark Use Cases Below is a brief set of categories of Spark Use CasesInteractive Query • Enterprise-scale data volumes accessible to interactive query for business intelligence (BI) • Faster time to job completion Large-Scale Batch • Data cleaning to improve data quality (missing data, entity resolution, unit mismatch, etc.) • Nightly ETL processing from production systems Complex Analytics Event Processing Model Building 22 Forecasting vs. “Nowcasting” (e.g. Google Search queries analyzed en masse for Google Flu Trends to predict outbreaks) Data mining across various types of data • Web server log file analysis (human-readable file formats that are rarely read by humans) in near-real time • Responsive monitoring of RFID-tagged devices • Predictive modeling answers questions of "what will happen?" • Self-tuning machine learning, continually updating algorithms, and predictive modeling
23. Spark Applications  Spark is implemented inside many types of products, across a multitude of industries and organizations Source: 2015 Databricks Spark Su 23
24. Spark Ecosystem  Different components and the interplay among them 24
26. “From a little spark may burn a flame” Dante Alighieri 26
27. Any questions? Q&A Questions & Answers 27
28. Thank you! 28
No comments...
none