全书分为三大部分: 第一部分,主要讨论有关增强数据密集型应用系统所需的若干基本原则。首先开篇第1章即瞄准目标:可靠性、可扩展性与可维护性,如何认识这些问题以及如何达成目标。第2章我们比较了多种不同的数据模型和查询语言,讨论各自的适用场景。接下来第3章主要针对存储引擎,即数据库是如何安排磁盘结构从而提高检索效率。第4章转向数据编码(序列化)方面,包括常见模式的演化历程。 第二部分,我们将从单机的数据存储转向跨机器的分布式系统,这是扩展性的重要一步,但随之而来的是各种挑战。所以将依次讨论数据远程复制(第5章)、数据分区(第6章)以及事务(第7章)。接下来的第8章包括分布式系统的更多细节,以及分布式环境如何达成一致性与共识(第9章)。 第三部分,主要针对产生派生数据的系统,所谓派生数据主要指在异构系统中,如果无法用一个数据源来解决所有问题,那么一种自然的方式就是集成多个不同的数据库、缓存模块以及索引模块等。首先第10章以批处理开始来处理派生数据,紧接着第11章采用流式处理。第12章总结之前介绍的多种技术,并分析讨论未来构建可靠、可扩展和可维护应用系统可能的新方向或方法。
2023-03-13 17:44:46 12.76MB 设计 数据密集型 应用
1
数据密集型应用系统设计
2023-02-19 16:58:10 293.36MB 后端
1
设计数据密集型应用,不错的书。受益良多,值得数据库人员都进行观看。
2022-11-17 09:31:55 19.89MB 数据 数据库
1
完整文字版(英文),带书签目录,介绍分布式原理,非常非常好的一本书。作者:马丁·科勒普曼 ,目录如下: Part I. Foundations of Data Systems 1. Reliable, Scalable, and Maintainable Applications 3 Thinking About Data Systems 4 Reliability 6 Hardware Faults 7 Software Errors 8 Human Errors 9 How Important Is Reliability? 10 Scalability 10 Describing Load 11 Describing Performance 13 Approaches for Coping with Load 17 Maintainability 18 Operability: Making Life Easy for Operations 19 Simplicity: Managing Complexity 20 Evolvability: Making Change Easy 21 Summary 22 2. Data Models and Query Languages 27 Relational Model Versus Document Model 28 The Birth of NoSQL 29 The Object-Relational Mismatch 29 Many-to-One and Many-to-Many Relationships 33 Are Document Databases Repeating History? 36 Relational Versus Document Databases Today 38 Query Languages for Data 42 Declarative Queries on the Web 44 MapReduce Querying 46 Graph-Like Data Models 49 Property Graphs 50 The Cypher Query Language 52 Graph Queries in SQL 53 Triple-Stores and SPARQL 55 The Foundation: Datalog 60 Summary 63 3. Storage and Retrieval 69 Data Structures That Power Your Database 70 Hash Indexes 72 SSTables and LSM-Trees 76 B-Trees 79 Comparing B-Trees and LSM-Trees 83 Other Indexing Structures 85 Transaction Processing or Analytics? 90 Data Warehousing 91 Stars and Snowflakes: Schemas for Analytics 93 Column-Oriented Storage 95 Column Compression 97 Sort Order in Column Storage 99 Writing to Column-Oriented Storage 101 Aggregation: Data Cubes and Materialized Views 101 Summary 103 4. Encoding and Evolution 111 Formats for Encoding Data 112 Language-Specific Formats 113 JSON, XML, and Binary Variants 114 Thrift and Protocol Buffers 117 Avro 122 The Merits of Schemas 127 Modes of Dataflow 128 Dataflow Through Databases 129 Dataflow Through Services: REST and RPC 131 Message-Passing Dataflow 136 Summary 139 Part II. Distributed Data 5. Replication 151 Leaders and Followers 152 Synchronous Versus Asynchronous Replication 153 Setting Up New Followers 155 Handling Node Outages 156 Implementation of Replication Logs 158 Problems with Replication Lag 161 Reading Your Own Writes 162 Monotonic Reads 164 Consistent Prefix Reads 165 Solutions for Replication Lag 167 Multi-Leader Replication 168 Use Cases for Multi-Leader Replication 168 Handling Write Conflicts 171 Multi-Leader Replication Topologies 175 Leaderless Replication 177 Writing to the Database When a Node Is Down 177 Limitations of Quorum Consistency 181 Sloppy Quorums and Hinted Handoff 183 Detecting Concurrent Writes 184 Summary 192 6. Partitioning 199 Partitioning and Replication 200 Partitioning of Key-Value Data 201 Partitioning by Key Range 202 Partitioning by Hash of Key 203 Skewed Workloads and Relieving Hot Spots 205 Partitioning and Secondary Indexes 206 Partitioning Secondary Indexes by Document 206 Partitioning Secondary Indexes by Term 208 Rebalancing Partitions 209 Strategies for Rebalancing 210 Operations: Automatic or Manual Rebalancing 213 Request Routing 214 Parallel Query Execution 216 Summary 216 7. Transactions 221 The Slippery Concept of a Transaction 222 The Meaning of ACID 223 Single-Object and Multi-Object Operations 228 Weak Isolation Levels 233 Read Committed 234 Snapshot Isolation and Repeatable Read 237 Preventing Lost Updates 242 Write Skew and Phantoms 246 Serializability 251 Actual Serial Execution 252 Two-Phase Locking (2PL) 257 Serializable Snapshot Isolation (SSI) 261 Summary 266 8. The Trouble with Distributed Systems 273 Faults and Partial Failures 274 Cloud Computing and Supercomputing 275 Unreliable Networks 277 Network Faults in Practice 279 Detecting Faults 280 Timeouts and Unbounded Delays 281 Synchronous Versus Asynchronous Networks 284 Unreliable Clocks 287 Monotonic Versus Time-of-Day Clocks 288 Clock Synchronization and Accuracy 289 Relying on Synchronized Clocks 291 Process Pauses 295 Knowledge, Truth, and Lies 300 The Truth Is Defined by the Majority 300 Byzantine Faults 304 System Model and Reality 306 Summary 310 9. Consistency and Consensus 321 Consistency Guarantees 322 Linearizability 324 What Makes a System Linearizable? 325 Relying on Linearizability 330 Implementing Linearizable Systems 332 The Cost of Linearizability 335 Ordering Guarantees 339 Ordering and Causality 339 Sequence Number Ordering 343 Total Order Broadcast 348 Distributed Transactions and Consensus 352 Atomic Commit and Two-Phase Commit (2PC) 354 Distributed Transactions in Practice 360 Fault-Tolerant Consensus 364 Membership and Coordination Services 370 Summary 373 Part III. Derived Data 10. Batch Processing 389 Batch Processing with Unix Tools 391 Simple Log Analysis 391 The Unix Philosophy 394 MapReduce and Distributed Filesystems 397 MapReduce Job Execution 399 Reduce-Side Joins and Grouping 403 Map-Side Joins 408 The Output of Batch Workflows 411 Comparing Hadoop to Distributed Databases 414 Beyond MapReduce 419 Materialization of Intermediate State 419 Graphs and Iterative Processing 424 High-Level APIs and Languages 426 Summary 429 11. Stream Processing 439 Transmitting Event Streams 440 Messaging Systems 441 Partitioned Logs 446 Databases and Streams 451 Keeping Systems in Sync 452 Change Data Capture 454 Event Sourcing 457 State, Streams, and Immutability 459 Processing Streams 464 Uses of Stream Processing 465 Reasoning About Time 468 Stream Joins 472 Fault Tolerance 476 Summary 479 12. The Future of Data Systems 489 Data Integration 490 Combining Specialized Tools by Deriving Data 490 Batch and Stream Processing 494 Unbundling Databases 499 Composing Data Storage Technologies 499 Designing Applications Around Dataflow 504 Observing Derived State 509 Aiming for Correctness 515 The End-to-End Argument for Databases 516 Enforcing Constraints 521 Timeliness and Integrity 524 Trust, but Verify 528 Doing the Right Thing 533 Predictive Analytics 533 Privacy and Tracking 536 Summary 543 Glossary 553 Index 559
2022-06-09 10:02:14 21.55MB 分布式 大数据 技术理念
1
无线传感器网络(WSN)技术已用于监视自然灾害已有十多年的历史了。 可以通过增加各种传感器来密切监视灾难,并且WSN具有(1)成本低,(2)快速响应以及(3)可销售性和灵活性的优点。 使用WSN进行自然灾害监视是众所周知的数据密集型应用程序,它具有高带宽要求和严格的延迟约束。 它体现了低成本可扩展系统上数据密集型应用程序的典型范例。 在这项研究中,我们首先评估了具有代表性的作品通过对WSN在灾难和优化技术中的应用领域进行分类,对区域进行分类将它们与通用WSN区别开来。 然后,我们描述了预警系统的设计用于水库地区地质灾害的方法,该方法依赖于WSN技术,该技术的灵感来自于现有的工作着重解决以下问题:(1)支持可靠的数据传输,(2)处理异构源的海量数据,以及类型,以及(3)减少能源消耗。 本研究提出了一种动态路由协议,一种用于网络的方法恢复以及用于管理移动节点以实现实时和可靠数据传输的方法。 该系统结合了数据融合和重建方法,可将所有数据整合到一个地质灾害的单一视图中在监视之下。 已经开发了一种用于联合优化控制功率和费率的分布式算法,该算法可以提高网络效用(> 95%)并最大程度地降低能耗(与之相比减少20%以上) 与LEACH)。 实验结果表明了拟议方法在适应需求方面的潜力地质灾害预警。
2022-05-02 15:08:31 768KB 研究论文
1
面向数据密集型应用的Lustre文件系统
2022-04-27 12:04:07 1.32MB 文档资料 源码软件
设计数据密集型应用/Designing Data-Intensive Applications 中文文字版,完整。
2022-04-05 01:48:07 15.55MB 数据库 数据密集型
1
一本好书,分析问题,解决问题,理论联系实践,豆瓣评分10分。
2022-01-22 23:10:00 20.62MB 数据系统 系统设计
1
数据密集型应用系统设计高清版.zip
2021-12-18 17:13:21 240.72MB PDF
1
数据密集型应用(data-intensive applications)正在通过使用这些技术进步来推动可能性的边界。一个应用被称为数据密集型的,如果数据是其主要挑战(数据量,数据复杂度或数据变化速度)—— 与之相对的是计算密集型,即处理器速度是其瓶颈。
2021-11-30 18:27:24 24.46MB 数据密集型
1