HappyFather さんのプロフィールHappy Fatherブログリスト ツール ヘルプ

HappyFather

フィード

このモジュールでは RSS フィードが指定されていません。
4月15日

Google BigTable

Big Table
  *) Definition
     a sparse, distributed, persistent multiple dimensional sorted map.The map is indexed by row key, column key, and a timestamp;
    each value in the map is an uninterpreted array of bytes.
  *) Goal
     o) petabytes data
     o) thousands of machines
     o) reliability
     o) scalability
     o) high performance
  *) Data Model
     o) (row:string, column:string, time:int64) -> string

     o) Rows
        .) key is arbitrary string, up to 64 KB, typical used is 10-100 bytes
        .) Bigtable maintains data in lexicographic order by row key
        .) row range is dynamically partitioned, which is call tablet
      o) Columns
        .) family:qualifier
        .) column keys are grouped into sets call column famlies
        .) collumn family is basic unit for access control
           >) add data
           >) read data
           >) create derived column family
      o) timestamp
        .) 64 bit
        .) real time in microseconds in typically
        .) unique
  *) APIs
      o) create/delete/change tables
      o) create/delete/change families
      o) change access control rights
      o) write/delete values
      o) lookup value from individual rows
      o) iterate over a subset of data
      o) single row atomic transaction
      o) supports the execution of client-supplied scripts in the address
          spaces of the servers, which is developed in Sawzall
      o) examples
         {{{ language = C++
          example 1:
// Open the table
Table *T = OpenOrDie("/bigtable/web/webtable");
// Write a new anchor and delete an old anchor
RowMutation r1(T, "com.cnn.www");
r1.Set("anchor:www.c-span.org", "CNN");
r1.Delete("anchor:www.abc.com");
Operation op;
Apply(&op, &r1);
}}}

{{{ language = C++
Scanner scanner(T);
ScanStream *stream;
stream = scanner.FetchColumnFamily("anchor");
stream->SetReturnAllVersions();
scanner.Lookup("com.cnn.www");
for (; !stream->Done(); stream->Next()) {
printf("%s %s %lld %s\n",
scanner.RowName(),
stream->ColumnName(),
stream->MicroTimestamp(),
stream->Value());
}
}}}
  *) building blocks
    o) Scheduler (Google WorkQueue)
    o) GFS
    o) Chubby Lock service
        .) highl-available and persistent distributed lock service
        .) five active replicas, one of which is master and actively serve requests
        .) Paxos algorithm to keep its replicas consistent in face of failure
        .) client needs renew its session lease before it expires
        .) big table use it for
           >) ensure at most one active master at any time
           >) store bootstap location of big table data
           >) discover tablet servers and finalize tablet server deaths
           >) store bigtable schema information(column family for each table)
           >) store access control lists
           >) single cluster most affected by Chubby unavailability was 0.0326%.
    o) SSTable file format
        .) persistent, ordered immuatable map from keys to values
           where both keys and values are arbitrary byte strings
        .) support look up value with a key, or iterate in a specified key range.
        .) each SSTable contains a sequence of blocks(typically 64 KB)
        .) A block index is used to locate blocks(stored at the end of SSTable)
        .) binary search to locate     
    O) related project
        .) MapReduce
        .) Sawzall
  *) implemetation
    o) components
       .) library link into client
          >) client directly communicate tablet server for read/write data
          >) cache tablet server location
       .) master server
          >) assign tablet to tablet server
          >) detect addition/expiration of tablet server
          >) balance tablet server load
          >) garbage collection of GFS
          >) schema change such as table/column family creation
       .) tablet servers
          >) support dynamically add/remove
          >) manage ten - thousand tablet each server
          >) handle read/write request for tablet
          >) split tablet if it's too large, each tablet is 100MB-200MB
     o) tablet location
       

      .) 3 level B+ tree
      .) Root tablet never split, and saved in a Chubby file
      .) each METADATA row stores 1KB data
      .) 128MB limitation for tablets, so three-level can stores 2^17 * 2^17 = 2^34 tablets, and totally 2^34 * 2^27=2^61 bytes data
      .) client locate the tablet
         >) move up tablet if not found or incorrect
         >) client needs 3 roundtrip if cache is empty(includes one Chubby read)
         >) use preread to reduce the cost of client roundtrip
      .) secondary informantion in the METADATA table includes a log of all events pertaining to each tablet for debug/performance purpose
   o) tablet assignment
      .) each tablet assigned to one tablet server at a time
      .) master keeps track each tablet server, each tablet assignment and each unassigned tablet
      .) master send tablet load instruction to tablet for assignment
      .) use Chubby to keep track of tablet server
         >) once tablet sever starts, creates and requires an exclusive lock on a uniquely-named file in a specific Chubby directory
         >) master monitor the directory to discover tablet servers
         >) tablet server needs kill itself once lose lock of the Chubby file
         >) tablet server needs release the lock once it quit for more quickly tablet reassignment
     .) Master monintor tablet server
        >) though Chubby lock once the tablet server quit
        >) periodly query status of tablet server, once found the tablet server lost, will try to aquire the lock of the server and deleting the file.
            Once succeed, reassign the tablets
        >) Master kills itself once Chubby fail
      .) Master start work
        >) started by cluster management system
        >) grabs a unique master lock in Chubby
        >) scans the servers directory in Chubby to find the live tablet servers
        >) communicates with every live tablet server to discover tablets assignment
        >) scans the  METADATA table  to learn the set of tablets
      .) tablet split
    o) tablet serving

       .) updates
          >) persistent state stored in GFS
          >) updates are commited to commit log that stores the redo command
          >) recent commit ones store in memtable which is sorted buffer in memory, while older updates are stored in sequence of SSTables
       .) recovery
          >) tablet server reads its metadata from METADATA table, where contains the list of SSTables that comprise a tablet and a set of a redo points.
          >) reads the indices of SSTables into memory and reconstruct the memtable by apply all updateds have commmit since the redo points.
       .) write
          >) Authorization is performed by reading the list of permmited writers from a Chubby file
          >) mutation is written to commit log
          >) group commit to improve the throughput of logs of small mutations
          >) update memtable after all done
       .) read
          >) check Authorization first
          >) merge the view of the sequence of SSTables and memtable
   o) Compactions
       .) once memtable size reach the threshold, freeze the memtable, create a new memtable, and convert the frozen memtable to SSTable and write into GFS.
       .) minor compaction
          >) two goals, shrink the memory usage of tablet server; reduces the amount of data that has to read from the commit log
          >) incoming read/write still continue on the new memtable.
          >) merge compaction once compaction happened exceed the given boundry
       .) major compaction
          >) rewrite SSTable into exactly one SSTable
          >) really delete data.
 *) refinement
    o) locality group
        >) clients can group multiple column families together into a local group
        >) a seperate SSTable is generated for each locality group in each tablet
        >) segregate column families which is not accessed typically into a seperate locality group enable more efficient read
        >) locality group can declared in-memory
    o) compression
        >) client can control whether compress the SSTable for a local group, and also control which format used
        >) client supply format is used for each SSTable block(size is controllable via local group parameter)
        >) many client use two-pass custom compression format.
            ?> first pass uses Bentley and McIlroy's scheme, which compresses long commonn strings across a large window
            ?> second pass uses fast compression algorithm that looks for repetitions in a small 16 KB window of the data
            ?> both pass very fast, encode at 100-200MB/s, and decode at 400-1000MB/s
       >) the scheme works very well, even acheived 10-to-1 reduction in space comparing with typical gzip reductions of 3-to-1 or 4-to-1.
           And it works even better for multiple versions of the same value in Bigtable.
    o) cache for read performance
       >) two levels of caching in tablet server
           ?> Scan cache, high level cache for SSTable, for most readed values
           ?> Block cache, low level cache for GFS, for close location values
    o) Bloom filters
        >) Bloom filters allow us to ask whether an SSTable might contain any data for a specified row/column pair
        >) drasitcally reduce the number of disk seeks required for read operations
        >) most lookups for non-existent rows or columns do not need to touch disk
    o) commit-log implemetation
        >) append mutations to a single commit log per tablet server.
        >) once need recover mutations from the file, partition the log into 64MB segments, then parallelly sorting the log file
            in order of the keys <table, row name, log sequence number>.
        >) to protect mutations from GFS latency spikes
            ?> each tablet server actually has two log writing threads: each writing to its
                 own log file
            ?> only one of these two threads is actually used at a time.
            ?> swich once the active thread performing poorly, and mutations
            ?> sequence numbers to elide duplicated entries during the switch process
     o) speeding up tablet recovery
         >) two pass compress for the tablet
     o) exploiting immutability
         >) SSTables are immutable
         >) each memtable row copy-on-write
         >) permanently removing deleted data is transformed to garbage collecting obsolete SSTables
         >) child tablets share the SSTables of the parent tablet to make to split tablet quickly
  *) performance evaluation
     o) configuration
        >) N tablet servers (N is varied), each has 1GB memory
        >) GFS cell, 1786 machines with two 400GB IDE hard drivers
        >) N client machines
        >) each machine has two dual-core Opteron 2GHz chips, single gigabit Ethernet link
        >) two-level tree-shaped switched network 100-200 Gbps of aggregate bandwidth available at the root
        >) round-trip time between any pair of machines was less than a millisecond
        >) each machine could run multiple jobs: tablet server, client server, processes from other jobs
        >) R, distinct number of Bigtable row keys. 1GB data per tablet server
        >) R is split to 10 clients.
        >) sequential write random data to uniformly across entire row space
        >) sequential read similar with write, random read
        >) sequential scan
        >) random reads(mem), make sure the read doesn't need require GFS.
    o) single tablet-server performance
        >) random reads 1212/seconds
            ?> slowest
            ?> each needs 64KB SSTable over the network from GFS to a tablet server.
            ?> 75MB/s data rate
            ?> typical Bigtable application uses block size 8KB.
        >) random mem read
            ?> quite fast, since needn't GFS
        >) random/seqnential write
            ?> no difference
            ?> efficient commit log
        >) sequential read
            ?> better than random read
            ?> due to block cache
        >) scan
            ?> even faster
            ?> one RPC return a large number of values
    o) scaling
        ?> aggregate throughput increases dramatically by over a factor of a hundred
        ?> 300 X read mem improvement for tablet server number increase from 1 to 500,
              since the bottleneck is cpu.
        ?> not increase linearly could because:
        ?> imbalance of load in multiple server configuration
             ?) rebalancing is throttled to reduce the number of tablet movements
             ?) load shifts
        ?> the worst case still is 100X, for 500X server number increase
  *) Real applications
     o) 388 non-test Bigtable clusters as of Auguest 2006, with total of 24,500 tablet servers
     o) 14 busy clusters with 8069 total tablet servers, aggregate volume > 1.2 M requests/seconds
         with incoming RPC traffic of about 741 MB/s and outgoing RPC traffic of about 16GB/s
      o) Google Analytics
          ?> ~200TB      
          ?> row name: <site name, time created>
          ?> 14% compaction
          ?> summary table ~ 20TB, 29% compaction, throughput is limited by GFS throughput.
      o) Google Earth
          ?> ~70 TB
          ?> needn't compression, since the pictures are already efficiently compressed
          ?> row name: geographic information
          ?> column family: source of data
          ?> MapReduce over BT
          ?> 1MB/sec
          ?> ~500GB index table
          ?> server ~10K/s/datacenter queries with low latency.
          ?> contained in hundreds of tablet servers and in-memory column family.
      o) Personalized Search
          ?> row name: <userid>
          ?> MapReduce over BT


  *) references
     google labs: big table
     http://andrewhitchcock.org/?post=214
     wiki: big table


--
------------------------------------
Stay hungry, stay foolish
2月22日

read notes for mapreduce.

. Key Point
*) A programming framework
*) User only needs implement two functions: Map & Reduce, where:
*) Input & Output: each a set of key/value pairs [[BR]]
*) Programmer specifies two functions: [[BR]]
map (in_key, in_value) -> list(out_key, intermediate_value) [[BR]]
*) Processes input key/value pair [[BR]]
*) Produces set of intermediate pairs [[BR]]
reduce (out_key, list(intermediate_value)) -> list(out_value) [[BR]]
*) Combines all intermediate values for a particular key [[BR]]
*) Produces a set of merged output values (usually just one) [[BR]]
*) Inspired by similar primitives in LISP and other languages [[BR]]
*) Refinement
*) Fault tolerance: Handled via re-execution
*) Redundant Execution
*) Locality Optimization
*) Skipping Bad Records
*) Other could be useful refinement
*) Sorting guarantees within each reduce partition
*) Compression of intermediate data
*) Combiner: useful for saving network bandwidth
*) Local execution for debugging/testing
*) User-defined counters
*) Case study
*) configuration
*) 1800 machines
*) 2x 2GHz Intel Xeon processors with Hyper-Threading enabled
*) 4GB of memory
*) 160GB IDE
*) two-level tree-shaped switched network
*) 100-200 Gbps of aggregate bandwidth
*) 1-1.5GB was reserved by other tasks running on the cluster
*) evaluation in weekend afternoon
*) distributed grep
*) 10^10 100-byte records
*) relatively rare three-character pattern (the pattern
occurs in 92,337 records).
*) M = 15000, R = 1
*) 150 seconds (one minute delay due to setup)
*) data processing at 30 GB/s when 1764 workers
*) distributed sort
*) 10^10 100-byte records (TeraSort benchmark)
*) M = 15000, R = 4000
*) no pre-pass MapReduce operation to compute split-points
*) 200s(Map) + 300s(shuffle) + 350s(reduce) + setup time =
891s ~ 1057s (TeraSort benchmark)
*) 13 GB/s
*) [http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html
Now, they can sort 1PB in six hours and two minutes!!!]
*) Example uses
*) distributed grep
*) distributed sort
*) web link-graph reversal
*) term-vector per host
*) web access log stats
*) inverted index construction
*) document clustering
*) machine learning
*) statistical machine translation
*) Latest result
[http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html
process 20PB data per day by mapreduce ] [[BR]]
[http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html
68s for sort 1TB ]
. Reference link

[http://labs.google.com/papers/mapreduce.html Google Map Reduce ] [[BR]]
[http://en.wikipedia.org/wiki/MapReduce Wiki Map Reduce ] [[BR]]
[http://wiki.apache.org/hadoop/HadoopMapReduce Hadoop Map Reduce ] [[BR]]
[http://www.hpl.hp.com/hosted/sortbenchmark/ Sort Benchmark ] [[BR]]
[http://research.google.com/people/jeff/index.html Jeffrey Dean] [[BR]]
[http://research.google.com/people/sanjay/index.html Sanjay Ghemawat] [[BR]]

. Example code
{{{
#!cplusplus
#include "mapreduce/mapreduce.h"
// User's map function
class WordCounter : public Mapper {
public:
virtual void Map(const MapInput& input) {
const string& text = input.value();
const int n = text.size();
for (int i = 0; i < n; ) {
// Skip past leading whitespace
while ((i < n) && isspace(text[i]))
i++;
// Find word end
int start = i;
while ((i < n) && !isspace(text[i]))
i++;
if (start < i)
Emit(text.substr(start,i-start),"1");
}
}
};

REGISTER_MAPPER(WordCounter);

// User's reduce function
class Adder : public Reducer {
virtual void Reduce(ReduceInput* input) {
// Iterate over all entries with the
// same key and add the values
int64 value = 0;
while (!input->done()) {
value += StringToInt(input->value());
input->NextValue();
}
// Emit sum for input->key()
Emit(IntToString(value));
}
};

REGISTER_REDUCER(Adder);

int main(int argc, char** argv) {
ParseCommandLineFlags(argc, argv);
MapReduceSpecification spec;
// Store list of input files into "spec"
for (int i = 1; i < argc; i++) {
MapReduceInput* input = spec.add_input();
input->set_format("text");
input->set_filepattern(argv[i]);
input->set_mapper_class("WordCounter");
}
// Specify the output files:
// /gfs/test/freq-00000-of-00100
// /gfs/test/freq-00001-of-00100
// ...
MapReduceOutput* out = spec.output();
out->set_filebase("/gfs/test/freq");
out->set_num_tasks(100);
out->set_format("text");
out->set_reducer_class("Adder");
// Optional: do partial sums within map
// tasks to save network bandwidth
out->set_combiner_class("Adder");
// Tuning parameters: use at most 2000
// machines and 100 MB of memory per task
spec.set_machines(2000);
spec.set_map_megabytes(100);
spec.set_reduce_megabytes(100);
// Now run it
MapReduceResult result;
if (!MapReduce(spec, &result)) abort();
// Done: 'result' structure contains info
// about counters, time taken, number of
// machines used, etc.
return 0;
}
}}}


--
------------------------------------
Stay hungry, stay foolish

post from gmail

Well, since using gmail could post blog, I seldom need log in the space.
Here are the steps if you still don't know yet:
1) first log in your spaces, and set in 选项=>电子邮件发布
as it said input a email which you want to write the blog, also
give a number, for example 11111, and confirm your change.
2) log in your email, and write your post as normal email. Subject
will be used as the title of your blog. Send the mail to
<youraccount>.<number>@spaces.live.com.

Enjoy it.

--
------------------------------------
Stay hungry, stay foolish

reading notes for <<Above the Clouds: A Berkeley View of Cloud Computing>>

不知道在那里存这些技术笔记,索性先放在这里。将来再找更合适的地方。
今天读了《云之上》,还是有不少启发。看到Berkeley 和David Patterson的名头就进去了。顺便提一下,Randy的那本讲体系结构的书一直没有找到时间看,看到Patterson,又想了起来。
主要的东西原来也陆陆续续看到了一下,所以启发还是来自于细节。
1) 首先看到的一个规模经济的比较,可以看出来大的DC比小的DC还是有显著的优势(7X),可以预见,如果云计算普及,势必会是相当集中的行业,而现在
大型DC具有不可忽视的先发优势,这恐怕也是大公司拼命忽悠的主要原因。
Table 2: Economies of scale in 2006 for medium-sized datacenter (~=1000 servers) vs. very large datacenter (~=50,000
servers). [24]
|Technology     | Cost in Medium-sized DC     | Cost in Very Large DC        |Ratio |
|Network        | $95 per Mbit/sec/month      | $13 per Mbit/sec/month       |7.1 |
|Storage        | $2.20 per GByte / month     | $0.40 per GByte / month      |5.7 |
|Administration | ~=140 Servers/Administrator | >1000 Servers/Administrator  | 7.1 |

2)电力价格差距还是很大的,这样对于中国西部倒是一个机会,不知道电力占成本的比例。粗略的算一下,一台普通台式机,负载100%,一天大致3度电,一度电大约0.45元(上海峰谷平均),这样一年1000元,好像跟我们公司大体一致。这样,大致与一台计算机的一年折旧费用大体相当。这样,电费是运营成本中不可忽视的因素。
Table 3: Price of kilowatt-hours of electricity by region [7].
|Price per KWH | Where | Possible Reasons Why |
|3.6¢          | Idaho | Hydroelectric power; not sent long distance |
|10.0¢         | California | Electricity transmitted long distance over the grid;limited transmission lines in Bay Area; no coal
fired electricity allowed in California. |
|18.0¢         | Hawaii | Must ship fuel to generate electricity |

3) 毫无疑问,最为稀缺的资源是带宽。这跟我们平常计算一致。就我这次对我们cluster的性能简单评估的经验来看,瓶颈也是网络速度。可是这也是云计算所不能解决的问题。当然,如果数据也产生于云里,这个问题倒还好办。所以,云计算首先也解决的是云存储,一旦数据进了云里,很多问题就好说了。同时意味着数据的迁移成本也是巨大的。换句话说,还是巨大的先发优势。

Table 5: We update Gray’s costs of computing resources from 2003 to 2008, normalize to what $1 could buy in 2003
vs. 2008, and compare to the cost of paying per use of $1 worth of resources on AWS at 2008 prices.
|              | WAN bandwidth/mo. | CPU hours (all cores) | disk storage |
| Item in 2003 | 1 Mbps WAN link   | 2 GHz CPU, 2 GB DRAM  | 200 GB disk, 50 Mb/s
transfer rate |
| Cost in 2003 | $100/mo.          | $2000                 | $200         |
| $1 buys in 2003 | 1 GB | 8 CPU hours  | 1 GB |
| Item in 2008 | 100 Mbps WAN link | 2 GHz, 2 sockets, 4
cores/socket, 4 GB DRAM |
1 TB disk, 115 MB/s sustained
transfer |
| Cost in 2008 | $3600/mo. | $1000 | $100 |
| $1 buys in 2008 | 2.7 GB | 128 CPU hours | 10 GB |
| cost/performance improvement | 2.7x | 16x |10x
| Cost to rent | $1 | $0.27–$0.40 | $2.56 | $1.20–$1.50 |

4)值得注意的是,Amazon EC2并没有停运过,我想这会不会意味着EC2用户相比S3还是少? 大多数付费用户还是用云存储? EC2用的认证服务与S3不同? 相比较公司内部IT的故障,这个故障时间还是可以忍受的。
Table 7: Outages in AWS, AppEngine, and Gmail
Service and Outage Duration Date
S3 outage: authentication service overload leading to unavailability [39] 2 hours 2/15/08
S3 outage: Single bit error leading to gossip protocol blowup. [41] 6-8 hours 7/20/08
AppEngine partial outage: programming error [43] 5 hours 6/17/08
Gmail: site unavailable due to outage in contacts system [29] 1.5 hours 8/11/08
5)文中还讨论个老问题,皮鞋传输和网络传输的比较。就是用快递,可以达到1500 Mbit/sec的速度,而用网络,却是20 Mbit/sec。因此如果发展云计算,估计数据中心可以养活一大批快递。

BTW, how to create a table here?

Reference links:
http://www.hpcwire.com/features/Berkeley-Releases-Cloud-Computing-Study-39502692.html?page=1
http://www.amazon.com/Computer-Systems-Programmers-Randal-Bryant/dp/013034074X/ref=sr_1_1?ie=UTF8&s=books&qid=1235294059&sr=1-1

 
リスト項目が追加されていません。