[精华] Unix下针对邮件,搜索,网络硬盘,照片，播客等海量存储的分布式文件系统项目

来源：百度文库编辑：神马文学网时间：2024/04/29 00:36:31

：Google是当前最有影响的Web搜索引擎，它利用一万多台廉价PC机构造了一个高性能、超大存储容量、稳定、实用的巨型Linux集群。
http://bbs.chinaunix.net/forum/viewtopic.php?t=390949&show_type=old
其分布式分布式文件系统的实现方法，用低成本实现了高可用、高性能集群的方法是并行机设计、开发的一个成功典范,这种严格追求性价比的设计方法值得借鉴。
请大家参与到这一工作中来:)
发件人: Eric Anderson
收件人: FreeBSD Clustering List
主题: FreeBSD Clustering wishlist - Was: Introduction & RE: Clustering with Freebsd
日期: Wed, 11 May 2005 22:45:55 -0500  (星期四，11:45 CST)
邮件程序: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.7) Gecko/20050504
Ok - I'm changing the subject here in an attempt to gather information.
Here's my wishlist:
FreeBSD have a 'native' clustered filesystem.  This is different than
shared media (we already can do that over fiber channel, ggated, soon
iscsi and AOE).  This would allow multiple servers to access the same
data read/write - highly important for load balancing applications like
web servers, mail servers, and NFS servers.
Online growable filesystem.  I know I can growfs a filesystem now, but
doing online while data is being used is *insanely* useful.  Reiserfs
and Polyserve's FS (a clustered filesystem, not open-source) do this well.
FreeBSD's UFS2 made to do journaling.  There's already someone working
on this.
I believe the above mean that we need a distributed lock manager too, so
might as well add that to my wishlist.
Single filesystem limits set very high - 16TB would be a good minimum.
Vinum/geom (?) made to allow added a couple more 'disks' - be it a real
scsi device, or another vinum device - to existing vinum's, so I can
extend my vinum stripe, raid, concat, etc to a larger volume size,
without worrying about which disk is where.  I want to stripe mirrors of
raids, and raid striped mirrors of stripes.  I know it sounds crazy, but
I really *do* have uses for all this. :)
We currently pay lots of money every year (enough to pay an engineers
salary) for support and maintenance with Polyserve.  They make a good
product (we need it for the clustered filesystem and NFS distributed
lock manager stuff) - I'd much rather see that go to FreeBSD.
Eric :em10:  :em10:  :em10:  :em10:  :em16:
[ 本帖最后由 yftty 于 2008-4-12 13:03 编辑 ]
yftty 回复于：2005-05-13 14:46:32
在 2005-05-11三的 22:45 -0500，Eric Anderson写道：
>; Ok - I'm changing the subject here in an attempt to gather information.
>;
>; Here's my wishlist:
>;
>; FreeBSD have a 'native' clustered filesystem.  This is different than
>; shared media (we already can do that over fiber channel, ggated, soon
Yes, the clustered filesystem will not run on SAN, since that will give
a high cost.
>; iscsi and AOE).  This would allow multiple servers to access the same
>; data read/write - highly important for load balancing applications like
>; web servers, mail servers, and NFS servers.
http://www.netapp.com/tech_library/3022.html <-- this article give some
info about the small file operations among the web, mail, IM, netdisk,
blog, etc. service. and that's our DFS targets at ;)
>;
>; Online growable filesystem.  I know I can growfs a filesystem now, but
>; doing online while data is being used is *insanely* useful.  Reiserfs
>; and Polyserve's FS (a clustered filesystem, not open-source) do this well.
Yes, we also support that with our insanely mechanism.
And you know in the current clustered fs, as GoogleFS, Lustre, etc.
which can be built on online growfs. That's also our way to do it.
>;
>; FreeBSD's UFS2 made to do journaling.  There's already someone working
>; on this.
Good news.
>;
>; I believe the above mean that we need a distributed lock manager too, so
>; might as well add that to my wishlist.
By the specific application & services, we can easily remove the
distributed lock manager easily with upper layer way. You can read the
GoogleFS paper to get some further info.
>;
>; Single filesystem limits set very high - 16TB would be a good minimum.
The limits can be removed.
>;
>; Vinum/geom (?) made to allow added a couple more 'disks' - be it a real
>; scsi device, or another vinum device - to existing vinum's, so I can
>; extend my vinum stripe, raid, concat, etc to a larger volume size,
>; without worrying about which disk is where.  I want to stripe mirrors of
>; raids, and raid striped mirrors of stripes.  I know it sounds crazy, but
>; I really *do* have uses for all this. :)
Yes, that's Lustre's way, and we also add a logical disk layer to
support it.
>;
>; We currently pay lots of money every year (enough to pay an engineers
>; salary) for support and maintenance with Polyserve.  They make a good
Would you like to persuade your company to sponse the developing ;)
>;
>; product (we need it for the clustered filesystem and NFS distributed
>; lock manager stuff) - I'd much rather see that go to FreeBSD.
At last, any help & donate & contribute among the requirements & tech.
domains are great appreciated !
>;
>; Eric
>;
>;
>;
--
yf-263
Unix-driver.org
chifeng 回复于：2005-05-13 15:00:52
yftty愿意为BSD做贡献啦。。。哈哈。
而且还有钱拿。。。。
riverfor 回复于：2005-05-13 15:05:58
我也想写fs!
thzjy 回复于：2005-05-13 15:19:51
largeness   project
dtest 回复于：2005-05-13 16:33:32
though i can not understand it completely, i think it's a good idea.:)
kofwang 回复于：2005-05-14 11:22:43
need to learn more advanced tech for understanding this artical
yftty 回复于：2005-05-15 21:54:46
在 2005-05-11三的 22:45 -0500，Eric Anderson写道：
>; Ok - I'm changing the subject here in an attempt to gather information.
>;
>; Here's my wishlist:
As for your wishlists, how about the MogileFS of
http://www.danga.com/mogilefs/
And what do you think about our GoogleFS like && MogileFS features
Clustre FS ?
Any comments are quite welcomed :)
yftty 回复于：2005-05-16 13:11:12
看样子大家对英文不是很感冒,给点我们头人的中文的乐呵乐呵
我一直在构想一个基于类SMPP协议的公开协议的分布式网络存储系统。大家可以看到google发布过一个google
fs的白皮书。实质上就是一个将fs的操作变为网络协议的操作的做法。最近手头上在帮助一个朋友完成了一个相关系统的设计的考虑。不知各位是否有兴趣一起来完成这样的一个项目，并且将它一直维护下去，也许将来它不止是一个python的实现，还会有c、java的实现。但是我相信python的实现会是最好的，就像现在的bt一样。
这样的分布式网络存储的用处会非常的多，如现在大家常在使用的大容量网络硬盘、gmail这样的大容量邮件系统、NNTP这样的大容量信息交互系统、Blog这样的大容量信息存储系统。
它的特点在于存储的内容多样化、存储的数据不能集中化、存储的数据会以用户/组/系统等为中心进行存储。
相关内容大家可以看看google fs。如果找不到我可以提供相关的pdf白皮书。
另：项目会开源（GPL或BSD）、项目会有实质的用所来证明我们的想法的正确性（我来解决测试环境的问题）。
----HD
wheel 回复于：2005-05-16 13:47:49
为何要基于类SMPP协议，不要基于bt
yftty 回复于：2005-05-16 13:53:24
引用：原帖由 "wheel"]为何要基于类SMPP协议，不要基于bt
发表：
具体的网络抽象层(NAL)正在选型,我上个季度用过CURL作了个DEMO.
后面可能会用类似 PVFS2 的网络层架构
文件访问支持 TFTP, FTP, HTTP, NFS, etc.
另: 现在看来还是用类似Lustre的Portals那样的东西吧:(
dtest 回复于：2005-05-16 13:53:38
ok, i can take part in this project, how to start it? If python be used to develop, i think most of us must learn it at first.
yftty 回复于：2005-05-16 23:10:45
some good talk on Spotlight on Tiger (Mac OS X)
http://www.kernelthread.com/software/fslogger/
这也是我们的设计所追求的目标:
表现层 (基于搜索的目录, 用户文件)
检索/搜索层 (搜索引擎)
存储层 (分布式文件系统)
sttty 回复于：2005-05-17 00:15:15
好想法。支持。可惜我能力不够。不然我一定报名。
狂顶
ly_1979425 回复于：2005-05-17 09:18:18
如果使用光盘作为近线存储介质，会更有效的发挥成本优势。
如果把现在的光盘库文件系统，如果ISO9660，UDF，JOLIET等光盘文件系统格式，以一种统一的网络文件系统的格式显现给用户，会极大的提高光盘在网络中的使用。如果光盘库这种设备。
这种大家存储很大的数据，但成本很便宜。光盘的成本远低于硬盘的成本。
我可以在这个方面与yftty合作。
xuediao 回复于：2005-05-17 09:28:56
看了一下，基本了解了大概的事情。不过楼主能不能描述一下DFS将来的应用场景，和基于SMPP协议的考虑，这点我不是怎么明白。
However, my pleasure to join in this! :D
yftty 回复于：2005-05-17 10:45:22
引用：原帖由 "xuediao" 发表：
看了一下，基本了解了大概的事情。不过楼主能不能描述一下DFS将来的应用场景，和基于SMPP协议的考虑，这点我不是怎么明白。
However, my pleasure to join in this! :D
不好意思, 请看英文部分;) 现有的集群文件系统就我所了解到的好像没有基于SMPP的.
应用场景就是那种海量存储. 如WEB, MAIL, VOD/IPTV, 广电, 图书馆等. 比较熟悉的系统应用如：Google的LINUX机群系统，Yahoo的BSD Server机群系统。
[ 本帖最后由 yftty 于 2006-3-8 12:02 编辑 ]
yftty 回复于：2005-05-17 10:49:49
引用：原帖由 "ly_1979425" 发表：
如果使用光盘作为近线存储介质，会更有效的发挥成本优势。
如果把现在的光盘库文件系统，如果ISO9660，UDF，JOLIET等光盘文件系统格式，以一种统一的网络文件系统的格式显现给用户，会极大的提高光盘在网络中的使用..........
是的,本设计有这方面的考虑,如你前面所言. 将每个光盘文件系统的MetaData信息统一存储在MDS部分,完成Namespace解析功能, 使得到达光盘的指令仅为Seek和Read/Write Stripe操作, 会大大提高它的易用性.
同时光盘会大大降底使用成本如场地费用, 电费.
zhuwas 回复于：2005-05-17 13:10:57
i can do it in my spare time , support , support !!!
yftty 回复于：2005-05-17 13:23:24
或者你可以通过这种流程分析:)
Ext3/UFS/ReiserFS ;
NFS ;
GlobalFS ;
OpenAFS (Arla), Coda, Inter-mezzo, Lustre, PVFS2, GoogleFS.
因为我们的组内成员在扩大, 我一直在考虑如何使它像路边的大白菜一样普通; 而不是令人觉得突然在面前立起一个望不到头的高楼.
javawinter 回复于：2005-05-17 16:20:00
友情支持 :D
zl_vim 回复于：2005-05-17 17:02:36
是个什么dd？
怎么参与啊？
潇湘夜雨回复于：2005-05-17 18:17:27
支持一把。。。在IT职业生涯里也发一个吧
nemoliu 回复于：2005-05-17 23:00:19
hehe,伴随着google的成功fs显得更加诱人了，如果有实力也很像参与
javawinter 回复于：2005-05-18 02:55:46
有实力的都来加入吧:)
citybugzzzz 回复于：2005-05-18 08:45:05
UpUp!
继续关注中。。。虽然项目很忙，但很乐于参与！
hdcola 回复于：2005-05-18 09:04:40
很久没回来看了。我来告诉大家为什么当初会考虑smpp类的协议来做消息存储的分布式文件系统的一种协议。
1.smpp是全异步的协议，理论上可以非常多，但通常的应用中它通过十六到三十二个窗口来并发处理，从而达到在服务器端没有及时处理完工作的情况下在一个连接中处理下一个指令。这可以大量的减少服务器端的并发连接数量。
2.消息类存储写后都不会有大量的改。这样在save时可以考虑使用存储转发机制，在服务器端难以响应或出问题时解决消息的问题。
这只是一个建议。多一个想法而已。
^_^
yftty 回复于：2005-05-18 10:11:29
引用：原帖由 "hdcola" 发表：
很久没回来看了。我来告诉大家为什么当初会考虑smpp类的协议来做消息存储的分布式文件系统的一种协议。
1.smpp是全异步的协议，理论上可以非常多，但通常的应用中它通过十六到三十二个窗口来并发处理，从而达到在服..........
欢迎大家多提意见和建议 >;_>; 我们都会在选型中作对应的评估和测试 :)
具体的工作会分为
client, data server, metadata server,  namespace, datapath, log, recovery, networking (or on wire protocol), migration/replication, utilities, etc. 几部分. 欢迎大家就感兴趣的部分参与到工作中来.
或者可以分几个主题分别讨论相关的技术领域. 算是我们作分布式协作的尝试;)
欢迎大家也就开源协作模式作讨论
mozilla121 回复于：2005-05-18 15:15:28
頂一下
nizvoo 回复于：2005-05-18 15:58:58
i wanna do some part!
yftty 回复于：2005-05-18 16:13:51
引用：原帖由 "nizvoo"]i wanna do some part!
发表：
If you said the great golden saying "I wanna do some part!", please recite your tech. background or interests domain so as I can give more info to let you get into the work smoothly.
Speak another way, do you consider as I say : Just do it ! make sense ;)
uplooking 回复于：2005-05-18 16:47:21
yftty 大侠的东西要顶，再说这个东西学会了会有很好的发展呀
yftty 回复于：2005-05-18 16:55:37
http://tech.sina.com.cn/it/2005-05-08/0920600573.shtml
新华网北京5月7日电 (记者李斌) 中国青年软件振兴计划工作委员会等单位日前进行的一项4400多人的“中国软件人才生存状况”调查表明，中国软件人才不仅“后继乏人”，而且由于培训缺乏、教育模式等原因“后继乏力”。
软件业知识更新速度快，然而调查发现，60%的国内软件企业没有对员工提供必要的职业规划，表明国内软件企业在员工培训方面不够重视。
调查表明，虽然大部分软件从业人员都希望自己可以通过培训提高自身能力，可是社会环境却很难提供这样的机会：一方面是供职的企业不支持，另一方面是社会上能够及时提供新技术培训的机构少之又少。
77%的软件从业人员的工作时间在8个小时以上，处于中间层次的程序员们没有时间去接受新的技术、新的理念，没有时间去提高自身能力。大多数软件专业本科毕业生月工资水平在2000元左右，年薪能够达到10万元的软件人才估计不足全部软件从业人员的5%。调查发现，教育体制的落后导致了软件专业毕业生缺乏实际编程能力，无法适应企业的实际需要。而软件企业自身又不愿提供相应的培训，这样一来编程人员的数量几乎是处在一种“净减”状态。
同时，中国缺少专门的软件开发管理人才培训机构，只有自身具备良好管理天赋的软件工程师或者程序员幸运地成为软件开发管理人员，出现了“软件人才就业难”和“软件企业招不到合适员工”的怪现象。
-------------
希望Uplooking.com能为这个行业培养出更多的系统级开发人才 :)
nizvoo 回复于：2005-05-18 17:24:42
3 years c++/windows/opengl/dx
yftty 回复于：2005-05-18 17:38:33
引用：原帖由 "nizvoo"]3 years c++/windows/opengl/dx
发表：
本季度属于孕酿阶段,这个季度末我会向公司汇报或探讨可能的运作形式;请大家也就这方面提供意见和建议.关于像一个这样的项目的生存和发展.
使这个成为一个成功的行业级软件,并取得强大的生命力.
同时从这个贴子开始作起,去探索一个东东如何去保持其持续的生命力;)
年青,美丽, 永远!
yftty 回复于：2005-05-18 17:39:26
http://lists.danga.com/pipermail/mogilefs/2004-December/000018.html
On Dec 20, 2004, at 11:50, Brad Fitzpatrick wrote:
Excellent! I did a project implementing exactly
same idea two years ago for a project related
to storage of mail messages for GSM carrier and
can appreciate the beauty of the solution! It is
great to have such product in open source.
uplooking 回复于：2005-05-18 17:47:59
这个东西国内做的人多吗？
yftty 回复于：2005-05-18 18:29:21
不多,但想想刚开始或现在华为作电信设备的时候也没多少人,所以他每年需要培养那么多;)
人们总喜欢称商业规则为 Game Rule, Game 也可以说是个赌博, 所以对公司在说,在一定程度上他是在赌大众心理. 赌对了的就活的舒服一点, 你觉得行业的趋势和大众的心理在哪里呢?
这样说对你有吸引力么;)
http://www.blogchina.com/new/display/72595.html
遗憾人物”的最大缺陷就是资源利用和行业整合能力的欠缺，以及企业管理能力的平庸。
sttty 回复于：2005-05-18 23:55:59
将此项目支持到底。有机会，好好学学。
说到uplooking 课程。前几天去听公开课。感觉不错，课程很实用。我发现听课的人水平都不低。
当时感觉很惭愧。 :oops:
yftty 回复于：2005-05-19 09:36:25
引用：原帖由 "sttty" 发表：
将此项目支持到底。有机会，好好学学。
说到uplooking 课程。前几天去听公开课。感觉不错，课程很实用。我发现听课的人水平都不低。
当时感觉很惭愧。 :oops:
对于一个社团来说, 它存在的价值在于:
首先它能帮助大家成长,
其次它能大家带来更多的机会.
请发布宣传性的东东如上以此为出发点;) 呵呵
nizvoo 回复于：2005-05-19 09:46:41
ok, i know it. I need learn more FS knowledge. keep touch.my mail : nizvoo"AT"gmail.com.
deltali 回复于：2005-05-19 10:11:26
what's the role of locks in a distributed filesystem？
thanks!
yftty 回复于：2005-05-19 11:03:55
引用：原帖由 "deltali" 发表：
what's the role of locks in a distributed filesystem？
thanks!
The locks in a distributed filesystem is managed by Distributed Lock Manager (DLM),
A distributed filesystem need to addressing the problem of delivering aggregate performance to a large number of clients.
DLM is the basis of scalable clusters. In a DLM based cluster all nodes can write to all shared resources and co-ordinate their action using the DLM.
This sort of technology is mainly intended for CPU and/or ram intensive processing, not for disc intensive operations nor for reliblity.
Digital >; Compaq >; HP...  HP own the Digital DLM technology, available in Tru64 Unix (was Digital Unix and OpenVMS 8.)
Compaq/HP licensed the DLM technology to Oracle who have base their cluster/grid software on the DLM
Sun Solaris also has a DLM based cluster technology.
Now Sun and HP are fighting blog wars...
http://blogs.zdnet.com/index.php?p=661&tag=nl.e539
http://www.chillingeffects.org/responses/notice.cgi?NoticeID=1460
Where I see DLM being good is for rendering and scientific calculation. These processes could really benifit from having a central data store but will not put a huge load on the DLM hardware..
Some more deeply knowledge:
http://kerneltrap.org/mailarchive/1/message/56956/thread
http://kerneltrap.org/mailarchive/1/message/66678/thread
http://lwn.net/Articles/135686/
Clusters and distributed lock management
The creation of tightly-connected clusters requires a great deal of supporting infrastructure. One of the necessary pieces is a lock manager - a system which can arbitrate access to resources which are shared across the cluster. The lock manager provides functions similar to those found in the locking calls on a single-user system - it can give a process read-only or write access to parts of files. The lock management task is complicated by the cluster environment, though; a lock manager must operate correctly regardless of network latencies, cope with the addition and removal of nodes, recover from the failure of nodes which hold locks, etc. It is a non-trivial problem, and Linux does not currently have a working, distributed lock manager in the mainline kernel.
David Teigland (of Red Hat) recently posted a set of distributed lock manager patches (called "dlm"), with a request for inclusion into the mainline. This code, which was originally developed at Sistina, is said to be influenced primarily by the venerable VMS lock manager. An initial look at the code confirms this statement: callbacks are called "ASTs" (asynchronous system traps, in VMS-speak), and the core locking call is an eleven-parameter monster:
int dlm_lock(dlm_lockspace_t *lockspace,
int mode,
struct dlm_lksb *lksb,
uint32_t flags,
void *name,
unsigned int namelen,
uint32_t parent_lkid,
void (*lockast) (void *astarg),
void *astarg,
void (*bast) (void *astarg, int mode),
struct dlm_range *range);
Most of the discussion has not been concerned with the technical issues, however. There are some disagreements over issues like how nodes should be identified, but most of the developers who are interested in this area seem to think that this implementation is at least a reasonable starting point. The harder issue is figuring out just how a general infrastructure for cluster support can be created for the Linux kernel. At least two other projects have their own distributed lock managers and are likely to want to be a part of this discussion; an Oracle developer recently described the posting of dlm as "a preemptive strike." Lock management is a function needed by most tightly-coupled clustering and clustered filesystem projects; wouldn't it be nice if they could all use the same implementation?
The fact is that the clustering community still needs to work these issues out; Andrew Morton doesn't want to have to make these decisions for them:
Not only do I not know whether this stuff should be merged: I don't even know how to find that out. Unless I'm prepared to become a full-on cluster/dlm person, which isn't looking likely.
The usual fallback is to identify all the stakeholders and get them to say "yes Andrew, this code is cool and we can use it", but I don't think the clustering teams have sufficent act-togetherness to be able to do that.
Clustering will be discussed at the kernel summit in July. A month prior to that, there will also be a clustering workshop held in Germany. In the hopes that these two events will help bring some clarity to this issue, Andrew has said that he will hold off on any decisions for now.
wolfg 回复于：2005-05-19 14:36:08
关注
ufoor 回复于：2005-05-19 23:38:16
看的有些晕了,还得多学
相关的东西还是先看中文的比较好些,效率高些.如果中文的没有再看英文的
Zer4tul 回复于：2005-05-20 03:08:11
好像是HD想出的主意吧？不错啊……可惜我水平不够……就在一边加油好了……过两天仔细看看Google FS的文档。
yftty 回复于：2005-05-20 08:15:56
引用：原帖由 "ufoor" 发表：
看的有些晕了,还得多学
相关的东西还是先看中文的比较好些,效率高些.如果中文的没有再看英文的
看中文的有利于迅速建立相关的概念, 但几个概念建立起来之后, 就不要看中文的了, 否则会越看越糊涂.
yftty 回复于：2005-05-20 08:19:46
引用：原帖由 "Zer4tul"]好像是HD想出的主意吧？不错啊……可惜我水平不够……就在一边加油好了……过两天仔细看看Google FS的文档。
发表：
hehe, HD can be considerred the Godfather of the Project !
Also great project need great man. Do you want to let me know and merge your brilliant ideas as what to do or how to do.
Let's inspiring each to other ;-) !
akadoc 回复于：2005-05-20 13:17:40
up，up，up。关注中。。。
yftty 回复于：2005-05-20 17:03:08
引用：原帖由 "akadoc"]up，up，up。关注中。。。
发表：
您想关注那一点或哪一部分呢,是组织还是技术呢,还是技术的哪一部分呢:)
请看与我们类似的MogileFS提供的Features.
http://www.danga.com/mogilefs/
MogileFS is our open source distributed filesystem. Its properties and features include:
* Application level -- no special kernel modules required.
* No single point of failure -- all three components of a MogileFS setup (storage nodes, trackers, and the tracker's database(s)) can be run on multiple machines, so there's no single point of failure. (you can run trackers on the same machines as storage nodes, too, so you don't need 4 machines...) A minimum of 2 machines is recommended.
* Automatic file replication -- files, based on their "class", are automatically replicated between enough different storage nodes as to satisfy the minimum replica count as requested by their class. For instance, for a photo hosting site you can make original JPEGs have a minimum replica count of 3, but thumbnails and scaled versions only have a replica count of 1 or 2. If you lose the only copy of a thumbnail, the application can just rebuild it. In this way, MogileFS (without RAID) can save money on disks that would otherwise be storing multiple copies of data unnecessarily.
* "Better than RAID" -- in a non-SAN RAID setup, the disks are redundant, but the host isn't. If you lose the entire machine, the files are inaccessible. MogileFS replicates the files between devices which are on different hosts, so files are always available.
* Transport Neutral -- MogileFS clients can communicate with MogileFS storage nodes (after talking to a tracker) via either NFS or HTTP, but we strongly recommend HTTP.
* Flat Namespace -- Files are identified by named keys in a flat, global namespace. You can create as many namespaces as you'd like, so multiple applications with potentially conflicting keys can run on the same MogileFS installation.
* Shared-Nothing -- MogileFS doesn't depend on a pricey SAN with shared disks. Every machine maintains its own local disks.
* No RAID required -- Local disks on MogileFS storage nodes can be in a RAID, or not. It's cheaper not to, as RAID doesn't buy you any safety that MogileFS doesn't already provide.
* Local filesystem agnostic -- Local disks on MogileFS storage nodes can be formatted with your filesystem of choice (ext3, ReiserFS, etc..). MogileFS does its own internal directory hashing so it doesn't hit filesystem limits such as "max files per directory" or "max directories per directory". Use what you're comfortable with.
MogileFS is not:
* POSIX Compliant -- you don't run regular Unix applications or databases against MogileFS. It's meant for archiving write-once files and doing only sequential reads. (though you can modify a file by way of overwriting it with a new version) Notes:
o Yes, this means your application has to specifically use a MogileFS client library to store and retrieve files. The steps in general are 1) talk to a tracker about what you want to put or get, 2) read/write to the NFS path for that storage node (the tracker will tell you where) or do an HTTP GET/PUT to the storage node, if you're running with an HTTP transport instead of NFS (which is highly recommended)
o We've briefly tinkered with using FUSE, which lets Linux filesystems be implemented in userspace, to provide a Linux filesystem interface to MogileFS, but we haven't worked on it much.
* Completely portable ... yet -- we have some Linux-isms in our code, at least in the HTTP transport code. Our plan is to scrap that and make it portable, though.
scrazy77 回复于：2005-05-20 20:50:59
引用：原帖由 "yftty" 发表：
您想关注那一点或哪一部分呢,是组织还是技术呢,还是技术的哪一部分呢:)
请看与我们类似的MogileFS提供的Features.
http://www.danga.com/mogilefs/
MogileFS is our open source distributed filesystem..........
MogileFS 可視為簡單版的google gfs 實作，
概念上是很接近的，
只是其最小單位是以 file為主，而google gfs最小單位是一個Chunk (64MB)
但目前使用MogileFS 要用application client來access，
使用上的方便性還是不如像RedHat GFS這類的 Distribute share storage，
或Netapp Filer...
當然MogileFS可能是最便宜的solution
目前在我內部的cluster已經在進行測試，
使用php的client，應用於多server access的blog & album system，
如要實作為POSIX filesystem，使用FUSE應該是可以很快作出來，
danga他們好像也有此計劃
Eric Chang
yftty 回复于：2005-05-21 00:30:32
>; MogileFS 可視為簡單版的google gfs 實作，
>; 概念上是很接近的，
是的,都属于非对称式集群文件系统的用户空间实现的一个子集
同时它们可以被看作是文件管理库函数,而不是个文件系统.
>; 只是其最小單位是以 file為主，而google gfs最小單位是一個Chunk (64MB)
MogileFS 以 File 为最小管理单位, 所以只需要处理文件名字空间,无需处理磁盘块空间.
GoogleFS 将原来的磁盘块操作提升为基于文件的 Chunk (64MB) 操作,以使存储管理有个合适的管理最小细度,降底用于管理方面的开销.
>; 但目前使用MogileFS 要用application client來access，
>; 使用上的方便性還是不如像RedHat GFS這類的 Distribute share storage，
GFS 属于基于SAN的对称式的分布式文件系统
>; 或Netapp Filer...
Netapp Filer 属于优化的NFS Server
>; 當然MogileFS可能是最便宜的solution
>; 目前在我內部的cluster已經在進行測試，
Good job !
>; 使用php的client，應用於多server access的blog & album system，
>; 如要實作為POSIX filesystem，使用FUSE應該是可以很快作出來，
这个应该是说的开发流程;) 我们刚开始也是这个思路,但由此带来的工作量大大增加,所以就不在FUSE里面作试验了.
>; danga他們好像也有此計劃
>; Eric Chang
我菜我怕谁回复于：2005-05-21 09:09:00
嗨,unix本身偶还没有搞懂，还是潜水吧！！
yftty 回复于：2005-05-21 10:36:30
引用：原帖由 "我菜我怕谁"]嗨,unix本身偶还没有搞懂，还是潜水吧！！
发表：
HOHO,这个在一定程度上和Unix没关系;) 我也很很不是很明白Unix,呵呵;
IT业作为由美国主导,硅谷精英发起的"消费型经济",向以眼花缭乱的概念为噱头赢利,从而令大众的购买力大大超支. 同时,他们不但构造了技术壁垒,市场壁垒,还有这种心理上的壁垒.  :em03:  莫要被它吓倒喔.
大项目都是纸老虎,要从战略上鄙视它,这样才能从战术上操纵它  :em02:
再大的项目每个人所参与的都是一小部分,但我是否因为这一小小的一部分,可以说"我"在参与了这个领域,或这个社会的进步了呢 ;) ;漫长的历程仅仅是因为目标的不明确:)
附: 王国维所言的作事情的三种境界 --
1. 昨夜西风调碧树,独上高楼,望尽天涯路!
2. 衣带渐宽终不悔,为伊消的人憔悴.
3. 众里寻它千百度,蓦然回首,那人却在,灯火阑珊处. (是你么  :em18:  )
kofwang 回复于：2005-05-21 10:45:58
有道理，不过你算是找对了方向。对于一般人来说：
1、昨夜烧酒空寒心，欲上高楼，无觅天涯路
2、体力透支终不支，钱包依旧若空池
3、杀场拼争三百年，卸甲归田，却发现，无家可归 :em16:
sttty 回复于：2005-05-21 10:47:34
好一个
1. 昨夜西风调碧树,独上高楼,望尽天涯路!
2. 衣带渐宽终不悔,为伊消的人憔悴.
3. 众里寻它千百度,蓦然回首,那人却在,灯火阑珊处
一句话惊醒梦中人呀
kofwang 回复于：2005-05-21 10:53:52
“以眼花缭乱的概念为噱头赢利”
如今正是概念经济大行其道的时候。对于中国人来说，“家庭影院”，“自驾游”，“三个代表”，吸引了多少眼球阿
yftty 回复于：2005-05-21 10:58:34
引用：原帖由 "kofwang" 发表：
有道理，不过你算是找对了方向。对于一般人来说：
1、昨夜烧酒空寒心，欲上高楼，无觅天涯路
2、体力透支终不支，钱包依旧若空池
3、杀场拼争三百年，卸甲归田，却发现，无家可归 :em16:
在牢房里望出去,一人看到了泥土,一人看到了星星  :wink:
人更多的是在看曲折后的坦途;所以这也是悲剧如<梁祝>;更容易流传于世一样
在病态的执着后面你是否有这样的感受,早上总是被惊醒,但又不知道在担心或该担心什么?
akadoc 回复于：2005-05-21 14:23:17
引用：原帖由 "yftty" 发表：
对于一个社团来说, 它存在的价值在于:
首先它能帮助大家成长,
其次它能大家带来更多的机会.
请发布宣传性的东东如上以此为出发点;) 呵呵
Hoping to see a team as U say，in this project！
chifeng 回复于：2005-05-21 22:37:24
不知道像我这样的菜鸟能否帮上忙?
做点具体的事情.....:)
tclwp 回复于：2005-05-22 17:25:14
如果整和进新的开拓性技术，前途光明
yftty 回复于：2005-05-22 20:56:29
引用：原帖由 "akadoc" 发表：
Hoping to see a team as U say，in this project！
团队已经建立起来了。目前有两位成员，第三位会在七月份到位；）都有分布式文件系统的成功产品经验   :idea:
当然希望有更多的人参与到我们的工作中来  :em02:  和我们一起探索这方面的技术和相关的管理※工程经验。  :em02:
yftty 回复于：2005-05-22 20:59:56
引用：原帖由 "chifeng" 发表：
不知道像我这样的菜鸟能否帮上忙?
做点具体的事情.....:)
呵呵，人因为工作而有相应的水平，而不是有了那个水平才去做那个事情。成长应该是一个人毕生的追求，所以我们总是在用已知的去探索未知的；）
我们一直在努力！  :em02:
sttty 回复于：2005-05-22 22:45:32
成功的人都是这样一步步走出来的。希望我在几年后，也延续这条路走下去。
yftty 回复于：2005-05-22 23:46:03
引用：原帖由 "tclwp"]如果整和进新的开拓性技术，前途光明
发表：
像一个这样的或类似的项目,研发(新技术)的风险是相对来说比较小的,更大的是在工程方面.呵呵,通过作这件事情,我也渐渐明白了Google.com的两个创始人为什么一个负责技术,一个负责工程(当然我的理解可能有偏差).
在这样一个系统里，任何一个单独部分拿出来，都是比较简单的东西,并且从其它许多地方都能看到它的影子。但所有的东西整合到一起的时候，或我们通常说的形成一个系统的时候，技术的复杂性就上来了。特别是商业关键业务系统，其复杂性就更加明显。比如：一个大型的并发系统存在着非常多的Corner Cases, 优化的部分非常多从而难于把握具体的原因。而性能往往就是这个工程追求的唯一目标  :em03:  大家多支持多探讨 :)
whoto 回复于：2005-05-23 10:29:19
我不懂Google fs，我对yfttyFS（姑且这么叫）理解是：
在一个虚拟的yfttyFS根文件系统下，提供提供对多种存储设备、多种文件系统、多种操作系统提供的存储空间、多种协议、包括yfttyFS本身的连接（挂接）能力，形成一个统一的存储系统，提供存储服务。
望高手多指教。
yfttyFS/--yfttyFS/X1
|
+--yfttyFS/X2
|
+--yfttyFS/X...
|
+-/Xdev/--HD
|       |--SCSI
|       |--CD
|       |--DVD
|       |--etc.
|
+-/Xfs/--UFS
|      |--UFS2
|      |--Ext2
|      |--NTFS
|      |--ISO9660
|      |--etc.
|
+-/Xsys/--BSD(s)
|       |--Linux(s)
|       |--Windows(s)
|       |--UNIX(s)
|       |--etc.
|
+-/Xprotocol/--TFTP
|            |--FTP
|            |--HTTP
|            |--NFS
|            |--etc.
|
+--/etc.
|
WEB       --|
MAIL      --|
VOD/IPTV  --|---base on--yfttyFS
Library   --|
etc.      --|
yftty 回复于：2005-05-23 11:28:33
hehe, I never think about and dare not to name it xxxFS as you said. As most ideas are stole from various resources, and there are members in our team much more intelligent than I. Here I disclose it just to want more insight into our project, as to benifit to the project and guys who contribute.
Yes, seems you really know what we want to do ;) Yes, the storage is a pool, and is always on-demand ! As the air around you.
And the tricky for my nickname:
Here I can see your masterpiece saying is cause now 'yf' is before a 'tty' ;)
Solaris12 回复于：2005-05-25 18:43:30
引用：原帖由 "yftty" 发表：
团队已经建立起来了。目前有两位成员，第三位会在七月份到位；）都有分布式文件系统的成功产品经验   :idea:
当然希望有更多的人参与到我们的工作中来  :em02:  和我们一起探索这方面的技术和相关的管理※工..........
怎么和你联系，对这个项目很感兴趣，
可以在技术和工程管理方面多多交流。
yftty 回复于：2005-05-27 00:24:38
引用：原帖由 "Solaris12" 发表：
怎么和你联系，对这个项目很感兴趣，
可以在技术和工程管理方面多多交流。
工程管理方面我们准备使用 PSP/TSPi and XP , 欢迎大家就这方面探讨  :em02:
另: 书都买了,还没来得及看  :em06:
javawinter 回复于：2005-05-27 01:15:51
UP
Solaris12 回复于：2005-05-27 13:03:10
引用：原帖由 "yftty" 发表：
工程管理方面我们准备使用 PSP/TSPi and XP , 欢迎大家就这方面探讨  :em02:
恕本人无知，PSP/TSPi是什么？
XP是指极限编程吗？
根据我的理解，XP比较适合开发人员少，以客户需求为导向的项目。FS的产品不必要套用XP。
当然，在软件开发中确实有很多best practice，我们可以根据自己的实际情况作出相应的调整，找到效率和流程的平衡点：
1. 关于SCM：
要做好一个产品，必须制定关于SCM的一系列政策和标准，主要在一下几方面：
版本控制管理
变化跟踪管理
2.关于process
需要制定代码集成的一些标准。
开发：概念性文档-->;开发-->;code review->;代码集成
测试：测试计划-->;测试开发-->;测试->;测试报告
对于比较小和资源有限的开发团队，SCM和process不宜搞得复杂，尽量减少开发文档，强化配置管理和code review
测试方面，最好能找到开源的测试工具，但这就要求，FS的编程接口不能是专有的，应尽量符合某种标准
yftty 回复于：2005-05-27 13:48:07
(13:43:29) j-fox: 不管用什么管理模式，作好计划（各种计划，特别是风险应对计划）和状态监控是最主要的，先先开始拿一个小任务去尝试找到适用的方法
(13:45:45) j-fox: 先准备好开发文档
(13:46:04) yftty -- A dream makes a team, and the team builds the dream !: 好,我先把你的贴上
xuediao 回复于：2005-05-27 14:10:04
引用：原帖由 "Solaris12" 发表：
XP比较适合开发人员少，以客户需求为导向的项目。
如同Solaris12所说，XP是一个强调快速灵活，而PSP和TSPi是CMMi的一个延伸，强调计划和过程控制。
虽然说这是一个大型的工程项目，又以分布式开发为主，但同时实施这两个方法难度很大啊。
在这两个方法中取得平衡点，说不定即将开创一个新式的软件工程学，呵呵  :D
xuediao 回复于：2005-05-27 14:16:26
引用：原帖由 "yftty" 发表：
(13:43:29) j-fox: 不管用什么管理模式，作好计划（各种计划，特别是风险应对计划）和状态监控是最主要的，先先开始拿一个小任务去尝试找到适用的方法
(13:45:45) j-fox: 先准备好开发文档
(13:46:04) yftty -- ..........
我比较赞同j-fox的观点，开发状态监控和风险应对是最重要的，如果单纯公司内部开发可能实施TSP要容易得多，对于国内的分布式开发，这算是一个尝试和学习的过程吧。
mozilla121 回复于：2005-05-27 14:29:27
嚴格使用這套流程在執行上會比較難. 只有一個非常認同這種流程的團對才有可能執行下去.
yftty 回复于：2005-05-27 14:51:16
引用：原帖由 "xuediao" 发表：
如同Solaris12所说，XP是一个强调快速灵活，而PSP和TSPi是CMMi的一个延伸，强调计划和过程控制。
虽然说这是一个大型的工程项目，又以分布式开发为主，但同时实施这两个方法难度很大啊。
在这两个方法中取得?..........
是啊,西学为用,中学为体;)
现在仅是模仿一点点了,对这件事情本身的理解也在不断深化中; 用了XP的增量模式, 和TSP的监控和评估. 我现在也是边学边卖, 可能给整的有点不伦不类了吧, 但愿那可以被成为是创新;)
现在在把目前的用户空间的实现往FreeBSD的Kernel里面挪, 真是感激六祖惠能所创的禅宗里的"顿悟".
yftty 回复于：2005-05-27 14:54:38
引用：原帖由 "mozilla121"]嚴格使用這套流程在執行上會比較難. 只有一個非常認同這種流程的團對才有可能執行下去.
发表：
"自知","自胜";"知足","强行". -- <<道德经>;>;
xuediao 回复于：2005-05-27 14:54:46
呵呵，这也是中庸之道，抑或是新式的洋务运动吧
小平哥说得好，管他黑猫白猫，能逮老鼠就是好猫！
Solaris12 回复于：2005-05-28 21:03:45
引用：原帖由 "xuediao" 发表：
如同Solaris12所说，XP是一个强调快速灵活，而PSP和TSPi是CMMi的一个延伸，强调计划和过程控制。
虽然说这是一个大型的工程项目，又以分布式开发为主，但同时实施这两个方法难度很大啊。
在这两个方法中取得?.........
其实CMM这类东西非常适合外包公司做的。
我所在的开发团队，即不是XP，也不是CMM，但是却非常有效。
而且，你会在里面找到其他软件工程方法的影子，
所以，任何流程部重要，最重要的是和你拥有的资源匹配，
在我看来，很多国内软件公司最大的问题主要是以下几点：
1. SCM(软件配置管理)方面
没有称职的release engineer.
无法做到真正的版本管理
没有变化跟踪管理系统，无法捕捉系统的每一个变化
没有daily build，没有automatic 的 sanity test
和system test.
更重要的是，很多公司建立项目之初，就没有统一的
SCM的政策，比如code integreate criteria
2. 开发流程方面
没有民主权威机构来控制市场和软件体系结构的需求及功能改变
没有code review
没有automatic的regression test对应每一个daily build
不过任何软件工程和方法都是要占用额外资源的，
关键是每一个软件公司都能认识并且投入。
其实仔细看很多知名的开源项目的开发模式，
以上这些东西都能很好的满足，比如说：
你可以随时拿到它的daily build或者snapshot,
看到该build是否通过测试。还有bugtraq系统，
记录到了每一次的改动，包括bugfix,和新功能
yftty 回复于：2005-06-01 12:37:15
To Solaris12,
现在也是按照你所说的思路去一步步实施的,但还没有建立起来.
1.SCM, 现在仅仅是简单的Commit Rules (参照的是Lustre的流程).也是为了和现有的资源相匹配.
2. 开发流程, 现在仅有设计Rivew.其它的需要人员去建立.
另: 现在突然觉得有点丢掉了那曾经熟悉的东西.
james.liu 回复于：2005-06-01 13:42:18
看完这个帖子，第一印象不是这个项目或者牵涉的技术，而是yftty这个家伙
很能侃。
我不懂，但是我想看看，，，我该通过何种方式来旁观这个项目呢？
yftty 回复于：2005-06-01 14:05:20
引用：原帖由 "james.liu" 发表：
看完这个帖子，第一印象不是这个项目或者牵涉的技术，而是yftty这个家伙
很能侃。
我不懂，但是我想看看，，，我该通过何种方式来旁观这个项目呢？
技术方面有明确的问题我还是回答的,比如前面关于分布式锁的(distributed lock, as dlm).
如何旁观或参与,这也是这个发贴的意图. 对于国内和国外的系统和内核开发来说,就我的感觉也没有太好的入手方式. Kernel-mentors mailing list 算是个这方面的尝试,并且显示了初步的效果.
当然我很抱歉我说的话令您或者其他人产生误解或其它意思. 但我相信每个人都希望给自己和他人以成长的机会.
同时我感觉一个人的做事方式和他的个人性格有很大的关系.没想清楚的事情,在风险可以承担的情况下,我会先把它丢出去再根据情况作随机. 就如踢足球,无法进攻的时候先把球传给对方前锋.
风暴一族回复于：2005-06-03 09:26:48
不错的说~
yftty 回复于：2005-06-07 09:41:22
here is the current sanity testing & results
[yf@yftty xxxfs]$ tests/xxxfs_sanity -v
000010:000001:1118108292.377965:4560:(socket.c:63:xxxfs_net_connect()) Process entered
config finished, ready to do the sanity testing !
xxxFS file creation testing succeeded !
xxxFS file read testing succeeded !
xxxFS file deletion testing succeeded !
xxxFS Sanity testing pid (4560) succeeded 1 !
[yf@yftty xxxfs]$
yftty 回复于：2005-06-07 11:38:57
项目到现在已快要过两个季度,经过这些时间的实践和思考
我的浅见是这个项目从流程来说上面的发贴所谈的已经比较完善了
从分工和组织来说,大家看下面的是否合适?
__________           __________
| 理论指导 |   <->;   | 开发指导 |
----------           ----------
|        \         /       |
|         \       /        |
------       ------       --------
| 研发 | <->; | 开发 | <->; |  测试  |
------       ------       --------
另:
这样还是有问题, 晕.
yftty 回复于：2005-06-08 09:30:30
Dan Stromberg wrote:
>; The lecturer at the recent NG storage talk at Usenix in Anaheim,
>; indicated that it was best to avoid "active/active" and get
>; "active/passive" instead.
>;
>; Does anyone:
>;
>; 1) Know what these things mean?
In the clustering world, active/active means 2 or more servers are
active at a time, either operating on separate data (and thus acting as
passive failover partners to each other), or operating on the same data
(which requires the use of a cluster filesystem or other similar
mechanism to allow coherent simultaneous access to the data).
>; 2) Know why active/passive might be preferred over active/active?
Well, if you're talking about active/passive vs. active/active with a
cluster filesystem or such, the active/passive is tons easier to
implement and get right. Plus, depending on your application, the added
complexity of a cluster filesystem might not actually buy you much more
than you could get with, say, NFS or Samba (CIFS).
--
Paul
yftty 回复于：2005-06-08 11:00:35
http://tech.blogchina.com/53/2005-06-07/372338.html
想了解Google的企业文化，需要从Google创立时的一个插曲开始：当谢尔盖·布林（Sergey Brin）和拉里·佩奇（Larry Page）想将自己的网络梦想付诸实际，最大的障碍是，他们并没有足够的资金来购买价格昂贵的设备。于是两人花费数百美元购买了一些个人电脑来代替那些数百万美元的服务器。
在实际应用中，这些普通电脑的故障率自然要高于专业服务器。他们需要确保任何一台普通电脑发生故障时都不会影响到用户正常得出搜索结果，于是Google 决定自己开发软件工具来解决这些问题。比如Google文件系统。这种文件系统不仅能够高效处理大型数据，还能够随时应付突然发生的存储故障。配合 Google的三重备份体制，这些个人电脑组成的系统就可以完成那些服务器的工作。
而这种遇到任何问题都全力解决之的理念，极大的影响了后来Google的文化。至今，Google依旧保持着网络公司的风貌。拥有2700名员工的公司总部里有900人是技术人员，而且在这里没有几间办公室。在施密特衣柜般的小办公室楼下，布林和佩奇共用一间办公室。而那里就像一间大学宿舍，里面摆着冰球装备、滑板和遥控飞机模型、懒人椅等等。
...
没有人质疑Google拥有魔幻般的技术和创新，但没有一家伟大的公司仅仅依靠出色的技术而成为世界级的公司。伟大的公司需要伟大的管理来帮助公司更上层楼。谁是Google的灵魂？当然是布林、佩奇再加上施密特组成的三人组。但谈到管理层面，49岁的施密特的确起到了至关重要的作用。
49岁的施密特曾经是Sun公司的CTO以及Novell公司的CEO，他至今仍清晰记得刚到这家公司时董事会对他的交待：“别把公司弄糟了，艾利克。公司的起点非常非常好，可别进行太大的改革。”他完全理解投资者的担心，他们不想这家创造力十足的公司变得僵化死板。
1999年施密特刚到这家公司的时候这里根本谈不上有什么管理，但他也不想照搬传统大公司那一套管理方法，他希望根据实际情况形成Google自己的管理模式。大多数情况下施密特和2位创始人一起行动，作出决策。通常情况下是施密特主持管理层会议，而2位创始人主持员工会议。当遇到重大问题需要解决的时候，Google3人组就会根据少数服从多数的基本规则作出决定。并且许多决定他们是当着员工的面得出结果的。公司管理层刻意保持企业文化中率直、自由的工程师文化，他们认为这是他们抗衡Yahoo和微软这样大规模公司的有力武器。
哈佛商学院教授大卫·友菲（David Yoffie）却并不看好这种管理模式：“如果很多人同时作决定，那等于没有决定任何事情。在Google每天会同时作出成千上万的计划，需要有一个人作出最终决断。”
施密特表示实际上他所扮演的角色更倾向于COO。他以雅虎和eBay举例来说，在这些公司里都是创始人来制定远景战略，尽管他们并不拥有首席执行官的头衔。但施密特的支持者认为，这名CEO的个人风格掩盖了他在公司中的实际地位。而曾经担任CEO的佩奇如今担任产品总裁。前董事长布林则担任技术总裁。而施密特则在过去的4年中为Google搭建了完善的架构。
布林和佩奇的管理哲学完全源于他们当初所在的斯坦福大学计算机科学实验室。Google的经理很少要求那些工程师去完成什么项目，取而代之的则是公司会宣布一个100项优先完成项目列表，工程师们根据自己的喜好参加不同的流动工作组，以周或者月为时间单位完成工作。
liuzhentaosoft 回复于：2005-06-10 23:49:57
openMosix：
5.1 What Is openMosix?
Basically, the openMosix software includes both a set of kernel patches and support tools. The patches extend the kernel to provide support for moving processes among machines in the cluster. Typically, process migration is totally transparent to the user. However, by using the tools provided with openMosix, as well as third-party tools, you can control the migration of processes among machines.
Let's look at how openMosix might be used to speed up a set of computationally expensive tasks. Suppose, for example, you have a dozen files to compress using a CPU-intensive program on a machine that isn't part of an openMosix cluster. You could compress each file one at a time, waiting for one to finish before starting the next. Or you could run all the compressions simultaneously by starting each compression in a separate window or by running each compression in the background (ending each command line with an &). Of course, either way will take about the same amount of time and will load down your computer while the programs are running.
However, if your computer is part of an openMosix cluster, here's what will happen: First, you will start all of the processes running on your computer. With an openMosix cluster, after a few seconds, processes will start to migrate from your heavily loaded computer to other idle or less loaded computers in the clusters. (As explained later, because some jobs may finish quickly, it can be counterproductive to migrate too quickly.) If you have a dozen idle machines in the cluster, each compression should run on a different machine. Your machine will have only one compression running on it (along with a little added overhead) so you still may be able to use it. And the dozen compressions will take only a little longer than it would normally take to do a single compression.
If you don't have a dozen computers, or some of your computers are slower than others, or some are otherwise loaded, openMosix will move the jobs around as best it can to balance the load. Once the cluster is set up, this is all done transparently by the system. Normally, you just start your jobs. openMosix does the rest. On the other hand, if you want to control the migration of jobs from one computer to the next, openMosix supplies you with the tools to do just that.
OSCAR：
Setting up a cluster can involve the installation and configuration of a lot of software as well as reconfiguration of the system and previously installed software. OSCAR (Open Source Cluster Application Resources) is a software package that is designed to simplify cluster installation. A collection of open source cluster software, OSCAR includes everything that you are likely to need for a dedicated, high-performance cluster. OSCAR takes you completely through the installation of your cluster. If you download, install, and run OSCAR, you will have a completely functioning cluster when you are done.
The design goals for OSCAR include using the best-of-class software, eliminating the downloading, installation, and configuration of individual components, and moving toward the standardization of clusters. OSCAR, it is said, reduces the need for expertise in setting up a cluster. In practice, it might be more fitting to say that OSCAR delays the need for expertise and allows you to create a fully functional cluster before mastering all the skills you will eventually need. In the long run, you will want to master those packages in OSCAR that you come to rely on. OSCAR makes it very easy to experiment with packages and dramatically lowers the barrier to getting started.
OSCAR was created and is maintained by the Open Cluster Group (http://www.openclustergroup.org), an informal group dedicated to simplifying the installation and use of clusters and broadening their use. Over the years, a number of organizations and companies have supported the Open Cluster Group, including Dell, IBM, Intel, NCSA, and ORNL, to mention only a few.
Because OSCAR is an extensive collection of software, it is beyond the scope of this book to cover every package in detail. Most of the software in OSCAR is available as standalone versions, and many of the key packages included by OSCAR are described in later chapters in this book. Consequently, this chapter focuses on setting up OSCAR and on software unique to OSCAR. By the time you have finished this chapter, you should be able to judge whether OSCAR is appropriate for your needs and know how to get started.
Rocks：
NPACI Rocks is a collection of open source software for building a high-performance cluster. The primary design goal for Rocks is to make cluster installation as easy as possible. Unquestionably, they have gone a long way toward meeting this goal. To accomplish this, the default installation makes a number of reasonable assumptions about what software should be included and how the cluster should be configured. Nonetheless, with a little more work, it is possible to customize many aspects of Rocks.
When you install Rocks, you will install both the clustering software and a current version of Red Hat Linux updated to include security patches. The Rocks installation will correctly configure various services, so this is one less thing to worry about. Installing Rocks installs Red Hat Linux, so you won't be able to add Rocks to an existing server or use it with some other Linux distribution.
Default installations tend to go very quickly and very smoothly. In fact, Rocks' management strategy assumes that you will deal with software problems on a node by reinstalling the system on that node rather than trying to diagnose and fix the problem. Depending on hardware, it may be possible to reinstall a node in under 10 minutes. Even if your systems take longer, after you start the reinstall, everything is automatic, so you don't need to hang around.
In this chapter, we'll look briefly at how to build and use a Rocks cluster. This coverage should provide you with enough information to decide whether Rocks is right for you. If you decide to install Rocks, be sure you download and read the current documentation. You might also want to visit Steven Baum's site。
yftty 回复于：2005-06-14 15:49:41
This is a short taxonomy of the kinds of distributed filesystems you can find today (Febrary 2004). This was assembled with some help from Garth Gibson and Larry Jones.
Distributed filesystem - the generic term for a client/server or "network" filesystem where the data isn't locally attached to a host. There are lots of different kinds of distributed filesystems, the first ones coming out of research in the 1980s. NFS and CIFS are the most common distributed filesystems today
Global filesystem - this refers to the namespace, so that all files have the same name and path name when viewed from all hosts. This obviously makes it easy to share data across machines and users in different parts of the organization. For example, the WWW is a global namespace because a URL works everywhere. But, filesystems don't always have that property because your share definitions may not match mine, we may not see the same file servers or the same portions of those file servers.
AFS was an early provider of a global namespace - all files were organized under /afs/cellname/... and you could assemble AFS cells even from different organizations (e.g., different universities) into one shared filesystem. The Panasas filesystem (PanFS) supports a similar structure, if desired.
SAN filesystem - these provide a way for hosts to share Fibre Channel storage, which is traditionally carved into private chunks bound to different hosts. To provide sharing, a block-level metadata manager controls access to different SAN devices. A SAN Filesystem mounts storage natively in only one node, but connects all nodes to that storage and distributes block addresses to other nodes. Scalability is often an issue because blocks are a low-level way to share data placing a big burden on the metadata managers and requiring large network transactions in order to access data.
Examples include SGI cXFS, IBM GPFS, Red Hat Sistina, IBM SanFS, EMC Highroad and others.
Symmetric filesystems - A symmetric filesystem is one in which the clients also run the metadata manager code; that is, all nodes understand the disk structures. A concern with these systems is the burden that metadata management places on the client node, serving both itself and other nodes, which may impact the ability of the client to perform its intended compute jobs. Examples include Sistina GFS, GPFS, Compaq CFS, Veritas CFS, Polyserve Matrix
Asymmetric filesystems - An asymmetric filesystem is one in which there are one or more dedicated metadata managers that maintain the filesystem and its associated disk structures. Examples include Panasas ActiveScale, IBM SanFS, and Lustre. Traditional client/server filesystems like NFS and CIFS are also asymmetric.
Cluster filesystem - a distributed filesystem that is not a single server with a set of clients, but instead a cluster of servers that all work together to provide high performance service to their clients. To the clients the cluster is transparent - it is just "the filesystem", but the filesystem software deals with distributing requests to elements of the storage cluster.
Examples include: HP (DEC) Tru64 cluster and Spinnaker is a clustered NAS (NFS) service. Panasas ActiveScale is a cluster filesystem
Parallel filesystem - file systems with support for parallel applications, all nodes may be accessing the same files at the same time, concurrent read and write. Examples of this include: Panasas ActiveScale, Lustre, GPFS and Sistina.
Finally, these definitions overlap. A SAN filesystem can be symmetric or asymmetric. Its servers can be clustered or single. And it can support parallel apps or not.
raidcracker 回复于：2005-06-14 18:49:36
raidcracker 写到:
开发测试一个人做,要出乱子的.:>;
http://bbs.chinaunix.net/forum/viewtopic.php?t=544517&show_type=&postdays=0&postorder=asc&start=80
给我们的开发分工和流程提点建议如何?
------------------------------------------------------------
对项目管理我并没有很多的经验,只是深恶痛绝开发凌驾于测试之上.
我感觉你的分工就有这个倾向.测试应该是独立的并行于开发指导,而且测试是测试工程师的职责,调试是开发工程师的职责.不能和二为一.
最后不要误会,我不是搞测试的来为测试说话的,而是搞Raid和SAN的研发的.
yftty 回复于：2005-06-16 00:13:43
按照甘特图的画法, 应该是这个样子的吧 :)
|
|  研发 ->;
|    ^ 开发 ->;
|       ^ QA ->;
|          ^ 运营(维护) ->;
----------------------------
那你对 FC  SCSI  iSCSI NFS SAMBA CACHE RAID 等很熟悉喽  :em09:
BigMonkey 回复于：2005-06-16 11:36:13
"帖子总数发表于: 2005-06-16 00:13    发表主题: To: raidcracker"
楼主老大这么晚还在
BigMonkey 回复于：2005-06-16 11:41:06
By the way, is yf your real name's acronym?
yftty 回复于：2005-06-16 14:49:06
引用：原帖由 "BigMonkey"]By the way, is yf your real name's acronym?
发表：
Hehe, the secrect answer is 'yes'. And yftty is the short of 'yf before a tty' ;)
By the way, the FS need also supply a PHP interface, so please share some SWIG experiences if you have or want  :P
raidcracker 回复于：2005-06-16 16:49:27
引用：原帖由 "yftty" 发表：
按照甘特图的画法, 应该是这个样子的吧 :)
|
|  研发 ->;
|    ^ 开发 ->;
|       ^ QA ->;
|          ^ 运营(维护) ->;
----------------------------
那你对 FC  SCSI  iSCSI NFS ..........
吃饭的家伙,不熟不行啊.楼主涉猎众多,让人羡慕啊.
yftty 回复于：2005-06-17 10:58:20
http://lwn.net/Articles/136579/
The second version of Oracle's cluster filesystem has been in the works for some time. There has been a recent increase in cluster-related code proposed for inclusion into the mainline, so it was not entirely surprising to see an OCFS2 patch set join the crowd. These patches have found their way directly into the -mm tree for those wishing to try them out.
As a cluster filesystem, OCFS2 carries rather more baggage than a single-node filesystem like ext3. It does have, at its core, an on-disk filesystem implementation which is heavily inspired by ext3. There are some differences, though: it is an extent-based filesystem, meaning that files are represented on-disk in large, contiguous chunks. Inode numbers are 64 bits. OCFS2 does use the Linux JBD layer for journaling, however, so it does not need to bring along much of its own journaling code.
To actually function in a clustered mode, OCFS2 must have information about the cluster in which it is operating. To that end, it includes a simple node information layer which holds a description of the systems which make up the cluster. This data structure is managed from user space via configfs; the user-space tools, in turn, take the relevant information from a single configuration file (/etc/ocfs2/cluster.conf). It is not enough to know which nodes should be part of a cluster, however: these nodes can come and go, and the filesystem must be able to respond to these events. So OCFS2 also includes a simple heartbeat implementation for monitoring which nodes are actually alive. This code works by setting aside a special file; each node must write a block to that file (with an updated time stamp) every so often. If a particular block stops changing, its associated node is deemed to have left the cluster.
Another important component is the distributed lock manager. OCFS2 includes a lock manager which, like the implementation covered last week, is called "dlm" and implements a VMS-like interface. Oracle's implementation is simpler, however (its core locking function only has eight parameters...), and it lacks many of the fancier lock types and functions of Red Hat's implementation. There is also a virtual filesystem interface ("dlmfs") making locking functionality available to user space.
There is a simple, TCP-based messaging system which is used by OCFS2 to talk between nodes in a cluster.
The remaining code is the filesystem implementation itself. It has all of the complications that one would expect of a high-performance filesystem implementation. OCFS2, however, is meant to operate with a disk which is, itself, shared across the cluster (perhaps via some sort of storage-area network or multipath scheme). So each node on the cluster manipulates the filesystem directly, but they must do so in a way which avoids creating chaos. The lock manager code handles much of this - nodes must take out locks on on-disk data structures before working with them.
There is more to it than that, however. There is, for example, a separate "allocation area" set aside for each node in the cluster; when a node needs to add an extent to a file, it can take it directly from its own allocation area and avoid contending with the other nodes for a global lock. There are also certain operations (deleting and renaming files, for example) which cannot be done by a node in isolation. It would not do for one node to delete a file and recycle its blocks if that file remains open on another node. So there is a voting mechanism for operations of this type; a node wanting to delete a file first requests a vote. If another node vetoes the operation, the file will remain for the time being. Either way, all nodes in the cluster can note that the file is being deleted and adjust their local data structures accordingly.
The code base as a whole was clearly written with an eye toward easing the path into the mainline kernel. It adheres to the kernel's coding standards and avoids the use of glue layers between the core filesystem code and the kernel. There are no changes to the kernel's VFS layer. Oracle's developers also appear to understand the current level of sensitivity about the merging of cluster support code (node and lock managers, heartbeat code) into the kernel. So they have kept their implementation of these functionalities small and separate from the filesystem itself. OCFS2 needs a lock manager now, for example, so it provides one. But, should a different implementation be chosen for merging at some future point, making the switch should not be too hard.
One assumes that OCFS2 will be merged at some point; adding a filesystem is not usually controversial if it is implemented properly and does not drag along intrusive VFS-layer changes. It is only one of many cluster filesystems, however, so it is unlikely to be alone. The competition in the cluster area, it seems, is just beginning.
yftty 回复于：2005-06-17 11:05:20
http://lwn.net/Articles/136579/
Plan 9 started as Ken Thompson and Rob Pike's attempt to address a number of perceived shortcomings in the Unix model. Among other things, Plan 9 takes the "everything is a file" approach rather further than Unix does, and tries to do so in a distributed manner. Plan 9 never took off the way Unix did, but it remains an interesting project; it has been free software since 2003.
One of the core components of Plan 9 is the 9P filesystem. 9P is a networked filesystem, somewhat equivalent to NFS or CIFS, but with its own particular approach. 9P is not as much a way of sharing files as a protocol definition aimed at the sharing of resources in a networked environment. There is a draft RFC available which describes this protocol in detail.
The protocol is intentionally simple. It works in a connection-oriented, single-user mode, much like CIFS; each user on a Plan 9 system is expected to make one or more connections to the server(s) of interest. Plan 9 operates with per-user namespaces by design, so each user ends up with a unique view of the network. There is a small set of operations supported by 9P servers; a client can create file descriptors, use them to navigate around the filesystem, read and write files, create, rename and delete files, and close things down; that's about it.
The protocol is intentionally independent of the underlying transport mechanism. Typically, a TCP connection is used, but that is not required. A 9P client can, with a proper implementation, communicate with a server over named pipes, zero-copy memory transports, RDMA, RFC1149 avian links, etc. The protocol also puts most of the intelligence on the server side; clients, for example, perform no caching of data. An implication of all these choices is that there is no real reason why 9P servers have to be exporting filesystems at all. A server can just as easily offer a virtual filesystem (along the lines of /proc or sysfs), transparent remote access to devices, connections to remote processes, or just about anything else. The 9P protocol is the implementation of the "everything really is a file" concept. It could thus be used in a similar way as the filesystems in user space (FUSE) mechanism currently being considered for merging. 9P also holds potential as a way of sharing resources between virtualized systems running on the same host.
There is a 9P implementation for Linux, called "v9fs"; Eric Van Hensbergen has recently posted a v9fs patch set for review with an eye toward eventual inclusion. v9fs is a full 9P client implementation; there is also a user-space server available via the v9fs web site.
Linux and Plan 9 have different ideas of how a filesystem should work, so a fair amount of impedance matching is required. Unix-like systems prefer filesystems to be mounted in a global namespace for all users, while Plan 9 filesystems are a per-user resource. A v9fs filesystem can be used in either mode, though the most natural way is to use Linux namespaces to allow each user to set up independently authenticated connections. The lack of client-side caching does not mix well with the Linux VFS, which wants to cache heavily. The current v9fs implementation disables all of this caching. In some areas, especially write performance, this lack of caching makes itself felt. In others, however, v9fs claims better performance than NFS as a result of its simpler protocol. Plan 9 also lacks certain Unix concepts - such as symbolic links. To ease interoperability with Unix systems, a set of protocol extensions has been provided; v9fs uses those extensions where indicated.
The current release is described as "reasonably stable." The basic set of file operations has been implemented, with the exception of mmap(), which is hard to do in a way which does not pose the risk of system deadlocks. Future plans include "a more complete security model" and some thought toward implementing limited client-side caching, perhaps by using the CacheFS layer. See the patch introduction for pointers to more information, mailing lists, etc.
(Posted Jun 6, 2005 16:53 UTC (Mon) by guest stfn) (Post reply)
The design philosophy shares something with the recently popular "REpresentational State Transfer" style of web services. They each chose one unifying metaphor and a minimal interface: either everything is a file and accessed through file system calls, or everything is a resource and accessed through HTTP methods on a URL.
That might be a naive simplification* but other have observed the same:
http://www.xent.com/pipermail/fork/2001-August/002801.html
http://rest.blueoxen.net/cgi-bin/wiki.pl?RestArchitectura...
* It's only one aspect of the design and, on the other hand, there's all kinds of caching in the web and URIs if not URLs are meant to form a global namespace that all users share.
yftty 回复于：2005-06-17 11:15:04
http://lwn.net/Articles/100321/
Many filesystems operate with a relatively slow backing store. Network filesystems are dependent on a network link and a remote server; obtaining a file from such a filesystem can be significantly slower than getting the file locally. Filesystems using slow local media (such as CDROMs) also tend to be slower than those using fast disks. For this reason, it can be desirable to cache data from these filesystems on a local disk.
Linux, however, has no mechanism which allows filesystems to perform local disk caching. Or, at least, it didn't have such a mechanism; David Howells's CacheFS patch changes that.
With CacheFS, the system administrator can set aside a partition on a block device for file caching. CacheFS will then present an interface which may be used by other filesystems. There is a basic registration interface, and a fairly elaborate mechanism for assigning an index to each file. Different filesystems will have different ways of creating identifiers for files, so CacheFS tries to impose as little policy as possible and let the filesystem code do what it wants. Finally, of course, there is an interface for caching a page from a file, noting changes, removing pages from the cache, etc.
CacheFS does not attempt to cache entire files; it must be able to deal with the possibility that somebody will try to work with a file which is bigger than the entire cache. It also does not actually guarantee to cache anything; it must be able to perform its own space management, and things must still function even in the absence of an actual cache device. This should not be an obstacle for most filesystems which, by their nature, must be prepared to deal with the real source for their files in the first place.
CacheFS is meant to work with other filesystems, rather than being used as a standalone filesystem in its own right. Its partitions must be mounted before use, however, and CacheFS uses the mount point to provide a view into the cached filesystem(s). The administrator can even manually force files out of the cache by simply deleting them from the mounted filesystem.
Interposing a cache between the user and the real filesystem clearly adds another failure point which could result in lost data. CacheFS addresses this issue by performing journaling on the cache contents. If things come to an abrupt halt, CacheFS will be able to replay any lost operations once everything is up and functioning again.
The current CacheFS patch is used only by the AFS filesystem, but work is in progress to adapt others as well. NFS, in particular, should benefit greatly from CacheFS, especially when NFSv4 (which is designed to allow local caching) is used. Expect this patch to have a relatively easy journey into the mainstream kernel. For those wanting more information, see the documentation file included with the patch.
(Log in to post comments)
CacheFS & Security
(Posted Sep 2, 2004 16:41 UTC (Thu) by subscriber scripter) (Post reply)
I wonder what the security implications of CacheFS are. Does each file inherit the permissions of the original? Is confidentiality a problem? What if you want to securely erase a file?
CacheFS & Security
(Posted Sep 3, 2004 19:49 UTC (Fri) by subscriber hppnq) (Post reply)
Not knowing anything about CacheFS internals, I would say these are cases of "don't do it, then". ;-)
CacheFS & Security
(Posted Sep 13, 2004 18:49 UTC (Mon) by guest AnswerGuy) (Post reply)
The only difference between accessing a filesystem directly and through CacheFS should be that the CacheFS can store copies of the accessed data on a local block device. In other words that there's a (potentially persistent) footprint of all accesses.
Other than that CacheFS should preserve the same permissions semantics as if a given user/host were accessing the backend filesystem/service directly.
A general caching filesystem
(Posted Sep 14, 2004 2:13 UTC (Tue) by subscriber xoddam) (Post reply)
This seems to me like a really complicated reimplementation of virtual
memory.
All filesystems already use VM pages for caching, don't they?
I'd have thought that attaching backing store to those pages would have
been a much simpler task than writing a whole new cache interface.
But then I'm not really a filesystem hacker.
A general caching filesystem
(Posted Oct 25, 2004 0:55 UTC (Mon) by subscriber jcm) (Post reply)
xoddam writes:
>; This seems to me like a really complicated reimplementation of
>; virtual memory.
No it's really not. By virtual memory your are referring to an aspect of VM implementations known as paging, and that in itself only really impacts upon so called ``anonymous memory''. There is a page cache for certain regular filesystems but it's not possible for all filesystems to exploit the page cache to full effect and in any case this patch adds the ability to use a local disk as an additional cache storage for even slower stuff like network mounted filesystems - so the page cache can always sit between this disk and user processes which use it.
Jon.
Improve "Laptop mode"
(Posted Oct 7, 2004 18:57 UTC (Thu) by subscriber BrucePerens) (Post reply)
I haven't looked at the CacheFS code yet, but this is what I would like to do with it, or something like it.
Put a cache filesystem on a FLASH disk plugged into my laptop. My laptop has a 512M MagicGate card, which looks like a USB disk. Use it to cache all recently read and written blocks from the hard disk, and allow the hard disk to remain spun down most of the time. Anytime the disk has to be spun up, flush any pending write blocks to it.
This would be an improvement over "laptop mode" in that it would not require system RAM and could thus be larger, and would not be as volatile as a RAM write cache.
Bruce
yftty 回复于：2005-06-20 14:07:11
1. Introduction to the BeOS and BFS
1.1 History leading up to BFS
The Solution
Starting in September 1996, Cyril Meurillon and I set about to define a new I/O architecture and file system for BeOS. We knew that the existing split of file system and database would no longer work. We wanted a new, high-performance file system that supported the database functionality the BeOS was known for as well as a mechanism to support multiple file systems. We also took the opportunity to clean out some of the accumulated cruft that had worked its way into the system over the course of the previous five years of development.
The task we had to solve had two very clear components. First there was the higher-level file system and device interface. This half of the project involved defining an API for file system and device drivers, managing the name space, connecting program requests for files into file descriptors, and managing all the associated state. The second half of the project involved writing a file system that would provide the functionality required by the rest of the BeOS. Cyril, being the primary kernel architect at Be, took on the first portioin of the task. The most difficult portion of Cyril's project involved defining the file system API in such a way that it was as multithreaded as possible, correct, deadlock-free, and efficient. That task involved many major iterations as we battled over what a file system had to do and what the kernel layer would manage. There is some discussion of this level of the file system in Chapter 10, but it is not the primary focus of this book.
My half of the project involved defining the on-disk data structures, managing all the nity-gritty physical details of the raw disk blocks, and performing the I/O requests made by programs. Because the disk block cache is intimately intertwined with the file system (especially a journaled file system), I also took on the task of rewriting the block cache.
1.2 Design Goals
...
In addition to the above design goals, we had the long-standing goals of making the system as multithreaded and as efficient as possible, which meant fine-grained locking everywhere and paying close attention to the overhead introduced by the file system. Memory usage was also a big concern. ...
1.3 Design Constraints
There were also several design constraints that the project had to contend with. The first and foremost was the lack of engineering resources. The Be engineering staff is quite small, at the time only 13 engineers. Cyril and I had to wrok alone because everyone else was busy with other projects. We also did not have very much time to complete the project. Be, Inc., tries to have regular software releases, once every four to six months. The initial target was for the project to take six months. The short amount of time to complete the project and the lack of engineering resources meant that there was little time to explore different designs and to experiment with completely untested ideas. In the end it took nine months for the first beta release of BFS. The final version of BFS shipped the following month.
2. What is a File System ?
2.1 The Fundamentals
It is important to keep in mind the abstract goal of what a file system must achieve: to store, retrieve, locate, and manipulate information. Keeping the goal stated in general terms frees us to think of alternative implementations and possibilities that might not otherwise occur if we were to only think of a file system as a typical, strictly hierarchical, disk-based structure.
...
2.3 The Abstractions
...
Extents
Another technique to manage mapping from logical positions in a byte stream to data blocks on disk is to use extent lists. An extent list is similar to the simple block list described previously except that each block address is not just for a single block but rather for a range of blocks. That is, every block address is given as a starting block and a length (expressed as the number of successive blocks following the starting block). The size of an extent is usually larger than a simple block address but is potentially able to map a much larger region of disk space.
...
Although extent lists are a more compact way to refer to large amounts of data, they may still require use of indirect or double-indirect blocks. If a file system becomes highly fragmented and each extent can only map a few blocks of data, then the use of indirect and double-indirect blocks becomes a necessity. One disadvantage to using extent lists is that locating a specific file position may require scanning a large number of extents. Because the length of an extent is variable, when locating a specific position the file system must start at the first extent and scan through all of them until it finds the extent that covers the position of interest. ...
Storing Directory Entries
...
Another method of organizing directory entries is to use a sorted data structure suitable for on-disk storage. One such data structure is a B- tree (or its variants, B+ tree and B* tree). A B- tree keeps the keys sorted by their name and is efficient at looking up whether a key exists in the directory. B- trees also scale well and are able to deal efficiently with directories that contain many tens of thousands of files.
2.5 Extended file system Operations
...
Indexing
File attributes allow users to associate additional information with files, but there is even more that a file system can do with extended file attributes to aid users in managing and locating their information. If the file system also indexes the attributes. For example, if we added a *keywork* attribute to a set of files and the *keyword* attribute was indexed, the user could then issue queries asking which files contained various keywords regardless of their location in the hierarchy.
When coupled with a good query language, indexing offers a powerful alternative interface to the file system. With queries, users are not restricted to navigating a fixed hierarchy of files; instead they can issue queries to find the working set of files they would like to see, regardless of the location of the files.
Journaling/Logging
Avoiding corruption in a file system is a difficult task. Some file systems go to great lengths to avoid corruptioin problems. They may attempt to oder disk writes in such a way that corruption is recoverable, or they may force operations that can cause corruption to be synchronous so that the file system is always in a known state. Still other systems simply avoid the issue and depend on a very sophisticated file system check program to recover in the event of failures. All of these approaches must check the disk at boot time, a potentially lengthy operation (especially as disk size increase). Further, should a crash happen at an inopportune time, the file system may still be corrupt.
A more modern approach to avoiding corruption is *journaling*. Journaling, a technique borrowed from the database world, avoids corruption by batching groups of changes and committing them all at once to a transaction log. The batched changes guarantee the atomicity of multiple changes. That atomicity guarantee allows the file system to guarantee that operations either happen completely or not at all. Further, if a crash does happen, the system need only replay the transaction log to recover the system to a known state. Replaying the log is an operation that takes at most a few seconds, which is considerably faster than the file system check that nonjournaled file systems must make.
Guaranteed bandwidth/Bandwidth Reservationo
The desire to guarantee high-bandwidth I/O for multimedia applications drives some file system designers to provide special hooks that allow applications to guarantee that they will receive a certain amount of I/O bandwidth (within the limits of the hardware). To accomplish this the file system needs a great deal of knowledge about the capabilities of the underlying hardware it uses and must schedule I/O requests. This problem is nontrivial and still an area of research.
Access Control Lists
Access Control Lists (ACLs) provide an extended mechanism for specifying who may access a file and how they may access it. The traditional POSIX approach of three sets of permissions - for the owner of a file, the group that the owner is in, and everyone else - is not sufficient in some settings. An access control list specifies the exact level of access that any person may have to a file. This allows for fine-grained control over the access to a file in comparison to the braod divisions defined in the POSIX security model.
uplooking 回复于：2005-06-21 02:38:48
说实话yftty搞得东西如果能成，那么真是水平可以了！如果有有时间可能搞了！顶一下
yftty 回复于：2005-06-22 10:50:12
1) http://zgp.org/linux-elitists/20040101205016.E5998@shaitan.lightconsulting.com.html
2) http://zgp.org/linux-elitists/20040101205016.E5998@shaitan.lightconsulting.com.html
3. Elastic Quota File System (EQFS) Proposal
23 Jun 2004 - 30 Jun 2004 (46 posts) Archive Link: "Elastic Quota File System
(EQFS)"
People: Amit Gud, Olaf Dabrunz, Mark Cooke
Amit Gud said:
Recently I'm into developing an Elastic Quota File System (EQFS). This file
system works on a simple concept ... give it to others if you're not using
it, let others use it, but on the guarantee that you get it back when you
need it!!
Here I'm talking about disk quotas. In any typical network, e.g.
sourceforge, each user is given a fixed amount of quota. 100 Mb in case of
sourceforge. 100 Mb is way over some project requirements and too small for
some projects. EQFS tries to solve this problem by exploiting the users'
usage behavior at runtime. That is the user's quota which he doesn't need
is given to the users who need it, but on 100% assurance that the originl
user can any time reclaim his/her quota.
Before getting into implementation details I want to have public opinion
about this system. All EQFS tries to do is it maximizes the disk space
usage, which otherwise is wasted if the user doesn't really need the
allocated user..on the other hand it helps avoid the starvation of the user
who needs more space. It also helps administrator to get away with the
problem of variable quota needs..as EQFS itself adjusts according to the
user needs.
Mark Watts asked how it would be possible to "guarantee" that the user would
get the space back when they wanted it. Amit expanded:
Ok, this is what I propose:
Lets say there are just 2 users with 100 megs of individual quota, user A
is using 20 megs and user B is running out of his quota. Now what B could
do is delete some files himself and make some free space for storing other
files. Now what I say is instead of deleting the files, he declares those
files as elastic.
Now, moment he makes that files elastic, that much amount of space is added
to his quota. Here Mark Cooke's equation applies with some modifications: N
no. of users, Qi allocated quota of ith user Ui individual disk usage of
ith user ( should be <= allocated quota of ith user ), D disk threshold;
thats the amount of disk space admin wants to allow the users to use
(should be >;= sum of all users' allocated quota, i.e. summation Qi ; for i
= 0 to N - 1).
Total usage of all the users (here A & B) should be at _anytime_ less than
D. i.e. summation Ui <= D; for i = 0 to N - 1.
The point to note here is that we are not bothering how much quota has been
allocated to an individual user by the admin, but we are more interested in
the usage pattern followed by the users. E.g. if user B wants additional
space of say 25 megs, he picks up 25 megs of his files and 'marks' them
elastic. Now his quota is increased to 125 megs and he can now add more 25
megs of files; at the same time allocated quota for user A is left
unaffected. Applying the above equation total usage now is A: 20 megs, B:
125 megs, now total 145 <= D, say 200 megs. Thus this should be ok for the
system, since the usage is within bounds.
Now what happens if Ui >; D? This can happen when user A tries to recliam
his space. i.e. if user A adds say more 70 megs of files, so the total
usage is now - A: 90 megs, B: 125 megs; 215 ! <= D. The moment the total
usage crosses the value, 'action' will be taken on the elastic files. Here
elastic files are of user B so only those will be affected and users A's
data will be untouched, so in a way this will be completely transparent to
user A. What action should be taken can be specified by the user while
making the files elastic. He can either opt to delete the file, compress it
or move it to some place (backup) where he know he has write access. The
corresponding action will be taken until the threshold is met.
Will this work?? We are relying on the 'free' space ( i.e. D - Ui ) for the
users to benefit. The chances of having a greater value for D - Ui
increases with the increase in the number of users, i.e. N. Here we are
talking about 2 users but think of 10000+ users where all the users will
probably never use up _all_ the allocated disk space. This user behavior
can be well exploited.
EQFS can be best fitted in the mail servers. Here e.g. I make whole
linux-kernel mailing list elastic. As long as Ui <= D I get to keep all the
messages, whenever Ui >; D, messages with latest dates will be 'acted' upon.
For variable quota needs, admin can allocate different quotas for different
users, but this can get tiresome when N is large. With EQFS, he can
allocate fixed quota for each user ( old and new ) , set up a value for D
and relax. The users will automatically get the quota they need. One may
ask that this can be done by just setting up value of D, checking it
against summation Ui and not allocating individual quotas at all. But when
summation Ui crosses D value, whose file to act on? Moreover with both
individual quotas and D, we give users 'controlled' flexibility just like
elastic - it can be stretched but not beyond a certain range.
What happens when an user tries to eat up all the free ( D - Ui ) space?
This answer is implementation dependent because you need to make a
decision: should an user be allowed to make a file elastic when Ui == D . I
think by saying 'yes' we eliminate some users' mischief of eating up all
free space.
Olaf Dabrunz replied:
+ having files disappear at the discretion of the filesystem seems to be
bad behaviour: either I need this file, then I do not want it to just
disappear, or I do not need it, and then I can delete it myself.
Since my idea of which files I need and which I do not need changes
over time, I believe it is far better that I can control which files I
need and which I do not need whenever other constraints (e.g. quota
filled up) make this decision necessary. Also, then I can opt to try to
convince someone to increase my quota.
+ moving the file to some other place (backup) does not seem to be a
viable option:
o If the backup media is always accessible, then why can't the user
store the "elastic" files there immediately?
->; advantages:
# the user knows where his file is
# applications that remember the path to a file will be able to
access it
o If the backup media will only be accessible after manually
inserting it into some drive, this amounts to sending an E-Mail to
the backup admin and then pass a list of backup files to the backup
software.
But now getting the file back involves a considerable amount of
manual and administrative work. And it involves bugging the backup
admin, who now becomes the bottleneck of your EQFS.
So this narrows down to the effective handling of backup procedures and the
effective administration of fixed quotas and centralization of data.
If you have many users it is also likely that there are more people
interested in big data-files. So you need to help these people organize
themselves e.g. by helping them to create mailing-list, web-pages or
letting them install servers that makes the data centrally available with
some interface that they can use to select parts of the data.
I would rather suggest that if the file does not fit within a given quota,
the user should apply for more quota and give reasons for that.
I believe that flexible or "elastic" allocation of ressources is a good
idea in general, but it only works if you have cheap and easy ways to
control both allocation and deallocation. So in the case of CBQ in networks
this works, since bandwidth can easily and quickly be allocated and
deallocated.
But for filesystem space this requires something like a "slower (= less
expensive), bigger, always accessible" third level of storage in the "RAM,
disk, ..." hierarchy. And then you would need an easy or even transparent
way to access files on this third level storage. And you need to make sure
that, although you obviously *need* the data for something, you still can
afford to increase retrieval times by several orders of magnitude at the
discretion of the filesystem.
But usually all this can be done by scripts as well.
Still, there is a scenario and a combination of features for such a
filesystem that IMHO would make it useful:
+ Provide allocation of overquota as you described it.
+ Let the filesystem move (parts of) the "elastic" files to some
third-level backing-store on an as-needed basis. This provides you with
a not-so-cheap (but cheaper than manual handling) resource management
facility.
Now you can use the third-level storage as a backing store for hard-drive
space, analoguous to what swap-space provides for RAM. And you can "swap
in" parts of files from there and cache them on the hard drive. So
"elastic" files are actually files that are "swappable" to backing store.
This assumes that the "elastic" files meet the requirements for a "working
set" in a similar fashion as for RAM-based data. I.e. the swap operations
need only be invoked relatively seldom.
If this is not the case, your site/customer needs to consider buying more
hard drive space (and maybe also RAM).
The tradeoff for the user now is:
+ do not have the big file(s) OR
+ have them and be able to use them in a random-access fashion from any
application, but maybe only with a (quite) slow access time, but
without additional administrative/manual hassle
Maybe this is a good tradeoff for a significant amount of users. Maybe
there are sites/customers that have the required backing store (or would
consider buying into this). I do not know. Find a sponsor, do some field
research and give it a try.
yftty 回复于：2005-06-22 11:11:22
我在很多電腦書上看到scalability這個詞
cluster computing也有這一項特性
但不是很了解他其中的意義
知道的人可不可以請說明一下  謝謝
If your computer or cluster has a bottleneck,
you want to solve the bottleneck.
Scalability is the ability to do that.
Usually there are three types of bottlenecks:
1. CPU
You add more CPU to your computer. add more
compute nodes (horizontal scalability) or
buy a bigger computer (vertical scalability).
2. Network - add more switches, network cards or buy
Myrinet, etc.
3. I/O - disk I/O bottlenecks can be solved by
better I/O (IDE>;SCSI, for example) or I/O clustering
(http://www.erexi.com.tw/solutions/NFS_fileserve_solution.pdf)
and similar.
Some applications are scalable (Oracle DB: you can
scale horizontally by moving from 9i to 9i RAC), some
are not (Linux VI Editor).
Then you have scalability limits (e.g. "x scales to up to
16 nodes"), etc.
Sean
The official definition according to webopedia (www.webopedia.com)
is:
(1) A popular buzzword that refers to how well a hardware or software system can adapt to increased demands. For example,
a scalable network system would be one that can start with just a few nodes but can easily expand to thousands of nodes.
Scalability can be a very important feature because it means that you can invest in a system with confidence you won't outgrow it.
(2) Refers to anything whose size can be changed.
For example, a font is said to be scalable if it can be represented in different sizes.
(3) When used to describe a computer system, the ability to run more than one processor.
Normally scalability means that it can adapted to changes you've made to your system.
For example, supports more CPU etc...
Hope I make myself clear.
yftty 回复于：2005-06-23 09:32:34
Allan Fields wrote:
>; On Tue, Jun 21, 2005 at 09:35:56AM -0500, Eric Anderson wrote:
>;
>;>;This is something I've brought up before on other lists, but I'm curious
>;>;if anyone is interested in developing a BSD licensed clustered
>;>;filesystem for FreeBSD (and anyone else)?
>;
>;
>; A few questions:
>;
>; Could this be done as a stackable file system (vnode layer distributed
>; file system) or did you have something else in mind (i.e. specifically
>; a full implementation of a network filesystem including storage
>; layer)?
Hmm.  I'm not sure if it can or not.  I'll try to explain what I'm
dreaming of.  I currently have about 1000 clients needing access to the
same pools of data (read/write) all the time.  The data changes
constantly.  There is a lot of this data.  We use NFS currently.
FreeBSD is *very* fast and stable at serving NFS data.  The problem is,
that even though it is very fast and stable, I still cannot pump out
enough bits fast enough with one machine, and if that one machine fails
(hardware problems, etc), then all my machines are hung waiting for me
to bring it back online.
So, what I would love to have, is this kind of setup: shared media
storage (fibre channel SAN, iscsi, or something like ggated possibly),
connected up to a cluster of hosts running FreeBSD.  Each FreeBSD server
has access to the logical disks, same partitions, and can mount them all
r/w.  Now, I can kind of do this now, however there are obviously some
issues with this currently.  I want all machines in this cluster to be
able to serve the data via NFS (or http, or anything else for that
matter really - if you can make NFS work, anything will pretty much
work) simultaneously from the same partitions, and see writes
immediately as the other hosts in the cluster commit them.
I currently have a solution just like this for Linux - Polyserve
(http://www.polyserve.com) has a clustered filesystem for linux, that
works very well.  I've even tried to convince them to port it to
FreeBSD, but it falls on deaf ears, so it's time to make our own.
>; Why not a port of an existing network filesystem say from Linux?
>; (A BSD rewrite could be done, if the code was GPLed.)  Would
>; cross-platform capabilities make sense?
That would work fine I'm sure - but I have found some similar threads in
the past that claim it would be just as hard and time consuming to port
one as it would be to create one from scratch.   Cross platform
capabilities would be great, but I'm mostly interested in getting
FreeBSD into this arena (as it will soon be an extremely important one
to be in).
>; How do you see this comparing to device-level solutions?  I know
>; the argument can be made to implement file systems/storage
>; abstractions at multiple layers, but I thought I might ask.
I'm not sure of a device level solution that does this.  I think the OS
has to know to commit the meta-data to a journal, or otherwise let the
other machines know about locking, etc, in order for this to work.
>; The other thing is there a wealth of filesystem papers out there,
>; any in specific caught your eye?
No - can you point me to some?
I'll be honest here - I'm not a code developer.  I would love to learn
some C here, and 'just do it', but filesystems aren't exactly simple, so
I'm looking for a group of people that would love to code up something
amazing like this - I'll support the developers and hopefully learn
something in the process.  My goal personally would be to do anything I
could to make the developers work most productively, and do testing.  I
can probably provide equipment, and a good testbed for it.
Eric
yftty 回复于：2005-06-23 09:33:39
>; Hmm.  I'm not sure if it can or not.  I'll try to explain what I'm
>; dreaming of.  I currently have about 1000 clients needing access to the
>; same pools of data (read/write) all the time.  The data changes
>; constantly.  There is a lot of this data.  We use NFS currently.
Sounds like you want SGI's clustered xfs....
>; I'll be honest here - I'm not a code developer.  I would love to learn
>; some C here, and 'just do it', but filesystems aren't exactly simple, so
>; I'm looking for a group of people that would love to code up something
>; amazing like this - I'll support the developers and hopefully learn
>; something in the process.  My goal personally would be to do anything I
>; could to make the developers work most productively, and do testing.  I
>; can probably provide equipment, and a good testbed for it.
If you are not a seasoned programmer in _some_ language, this
will not be easy at all.
One suggestion is to develop an abstract model of what a CFS
is.  Coming up with a clear detailed precise specification is
not an easy task either but it has to be done and if you can
do it, it will be immensely helpful all around.  You will
truly understand what you are doing, you have a basis for
evaluating design choices, you will have made choices before
writing any code, you can write test cases, writing code is
far easier etc.  etc.  Google for clustered filesystems.
The citeseer site has some papers as well.
A couple FS specific suggestions:
- perhaps clustering can be built on top of existing
filesystems.  Each machine's local filesystem is considered
a cache and you use some sort of cache coherency protocol.
That way you don't have to deal with filesystem allocation
and layout issues.
- a network wide stable storage `disk' may be easier to do
given GEOM.  There are atleast N copies of each data block.
Data may be cached locally at any site but writing data is
done as a distributed transaction.  So again cache
coherency is needed.  A network RAID if you will!
But again, let me stress that one must have a clear *model*
of the problem being solved.  Getting distributed programs
right is very hard even at an abstract model level.
Debugging a distributed program that doesn't have a clear
model is, well, for masochists (nothing against them -- I
bet even they'd rather get their pain some other way:-)
BigMonkey 回复于：2005-06-23 09:42:06
引用：原帖由 "yftty" 发表：
我在很多電腦書上看到scalability這個詞
cluster computing也有這一項特性
但不是很了解他其中的意義
知道的人可不可以請說明一下  謝謝
.
Scalability
the ease with which a system or component can be modified to fit the problem area.
SEI的定义
http://www.sei.cmu.edu/str/indexes/glossary/scalability.html
yftty 回复于：2005-06-26 11:41:51
http://www.onlamp.com/pub/a/onlamp/2005/06/23/whatdevswant.html
"Irrespective of the language programmers choose for expressing solutions, their wants and needs are similar. They need to be productive and efficient, with technologies that do not get in the way but rather help them produce high-quality software. In this article, we share our top ten list of programmers' common wants and needs."
yftty 回复于：2005-06-27 18:52:26
http://bbs.chinaunix.net/forum/viewtopic.php?t=568208&show_type=
我們在工作領域上，即使薪水、股票拿的再多，那是挑水；
而卻忘記把握下班後的時間，挖一口屬於自己的井
培養自己另一方面的實力；
未來當您年紀大了，體力拼不過年輕人了，
您還是有水喝，
而且還要喝得很悠閒喔
yftty 回复于：2005-06-28 12:04:04
AMENABLE TO EXTENSIVE PARALLELIZATION, GOOGLE’S WEB SEARCH APPLICATION LETS DIFFERENT QUERIES RUN ON DIFFERENT PROCESSORS AND, BY PARTITIONING THE OVERALL INDEX, ALSO LETS A SINGLE QUERY USE MULTIPLE PROCESSORS. TO HANDLE THIS WORKLOAD, GOOGLE’S ARCHITECTURE FEATURES CLUSTERS OF MORE THAN 15,000 COMMODITYCLASS PCS WITH FAULT-TOLERANT SOFTWARE. THIS ARCHITECTURE ACHIEVES SUPERIOR PERFORMANCE AT A FRACTION OF THE COST OF A SYSTEM BUILT FROM FEWER, BUT MORE EXPENSIVE, HIGH-END SERVERS.
Luiz André Barroso
Jeffrey Dean
Urs H?lzle
Few Web services require as much computation per request as search engines. On average, a single query on Google reads hundreds of megabytes of data and consumes tens of billions of CPU cycles. Supporting a peak request stream of thousands of queries per second requires an infrastructure comparable in size to that of the largest supercomputer installations. Combining more than 15,000 commodity-class PCs with fault-tolerant software creates a solution that is more cost-effective than a comparable system built out of a smaller number of high-end servers.
Here we present the architecture of the Google cluster, and discuss the most important factors that influence its design: energy efficiency and price-performance ratio. Energy efficiency is key at our scale of operation, as power consumption and cooling issues become significant operational factors, taxing the limits of available data center power densities.
Our application affords easy parallelization:
Different queries can run on different processors, and the overall index is partitioned so that a single query can use multiple processors. Consequently, peak processor performance is less important than its price/performance. As such, Google is an example of a throughput-oriented workload, and should benefit from processor architectures that offer more on-chip parallelism, such as simultaneous multithreading or on-chip multiprocessors.
Google architecture overview
Google’s software architecture arises from two basic insights. First, we provide reliability in software rather than in server-class hardware, so we can use commodity PCs to build a high-end computing cluster at a low-end price. Second, we tailor the design for best aggregate request throughput, not peak server response time, since we can manage response times by parallelizing individual requests.
We believe that the best price/performance tradeoff for our applications comes from fashioning a reliable computing infrastructure from clusters of unreliable commodity PCs. We provide reliability in our environment at the software level, by replicating services across many different machines and automatically detecting and handling failures. This software-based reliability encompasses many different areas and involves all parts of our system design. Examining the control flow in handling a query provides insight into the highlevel structure of the query-serving system, as
well as insight into reliability considerations.
Serving a Google query
When a user enters a query to Google (such as www.google.com/search?q=ieee+society), the user’s browser first performs a domain name system (DNS) lookup to map www.google.com to a particular IP address. To provide sufficient capacity to handle query traffic, our service consists of multiple clusters distributed worldwide. Each cluster has around a few thousand
machines, and the geographically distributed setup protects us against catastrophic data center failures (like those arising from earthquakes and large-scale power failures). A DNS-based load-balancing system selects a cluster by accounting for the user’s geographic proximity to each physical cluster. The load-balancing system minimizes round-trip time for the user’s request, while also considering the available
capacity at the various clusters.
The user’s browser then sends a hypertext transport protocol (HTTP) request to one of these clusters, and thereafter, the processing of that query is entirely local to that cluster. A hardware-based load balancer in each cluster monitors the available set of Google Web servers (GWSs) and performs local load balancing of requests across a set of them. After receiving a query, a GWS machine coordinates the query execution and formats the results into a Hypertext Markup Language
(HTML) response to the user’s browser. Figure 1 illustrates these steps.
Query execution consists of two major phases.1 In the first phase, the index servers consult an inverted index that maps each query word to a matching list of documents (the hit list). The index servers then determine a set of relevant documents by intersecting the hit lists of the individual query words, and they compute a relevance score for each document. This relevance score determines the order of results on the output page.
The search process is challenging because of the large amount of data: The raw documents comprise several tens of terabytes of uncompressed data, and the inverted index resulting from this raw data is itself many terabytes of data. Fortunately, the search is highly parallelizable by dividing the index into pieces
(index shards), each having a randomly chosen subset of documents from the full index. A pool of machines serves requests for each shard, and the overall index cluster contains one pool for each shard. Each request chooses a machine within a pool using an intermediate load balancer—in other words, each query goes to one
machine (or a subset of machines) assigned to each shard. If a shard’s replica goes down, the load balancer will avoid using it for queries, and other components of our cluster-management system will try to revive it or eventually replace it with another machine. During the downtime, the system capacity is reduced in proportion to the total fraction of capacity that this machine represented. However, service remains uninterrupted, and all parts of the index remain available.
The final result of this first phase of query execution is an ordered list of document identifiers (docids). As Figure 1 shows, the second phase involves taking this list of docids and computing the actual title and uniform resource locator of these documents, along with a query-specific document summary. Document servers (docservers) handle this job, fetching each document from disk to extract the title and the keyword-in-context snippet. As with the index lookup phase, the strategy is to partition the processing of all documents byrandomly distributing documents into
smaller shards having multiple server replicas responsible for handling each shard, and routing requests through a load balancer. The docserver cluster must have access to an online, low-latency copy of the entire Web. In fact, because of the replication required for performance and availability, Google stores dozens of copies of the Web across its clusters.
In addition to the indexing and document-serving phases, a GWS also initiates several other ancillary tasks upon receiving a query, such as sending the query to a spell-checking system and to an ad-serving system to generate relevant dvertisements (if any). When all phases are complete, a GWS generates the appropriate HTML for the output page and returns it to the user’s browser.
Using replication for capacity and fault-tolerance
We have structured our system so that most accesses to the index and other data structures involved in answering a query are read-only: Updates are relatively infrequent, and we can often perform them safely by diverting queries away from a service replica during an update. This principle sidesteps many of the consistency issues that typically arise in using a general-purpose database.
We also aggressively exploit the very large amounts of inherent parallelism in the application: For example, we transform the lookup of matching documents in a large index into many lookups for matching documents in a set of smaller indices, followed by a relatively inexpensive merging step. Similarly, we divide the query stream into multiple streams, each handled by a cluster. Adding machines to each
pool increases serving capacity, and adding shards accommodates index growth. By parallelizing the search over many machines, we reduce the average latency necessary to answer a query, dividing the total computation across more CPUs and disks. Because individual shards don’t need to communicate with each
other, the resulting speedup is nearly linear. In other words, the CPU speed of the individual index servers does not directly influence the search’s overall performance, because we can increase the number of shards to accommodate slower CPUs, and vice versa. Consequently, our hardware selection process focuses on machines that offer an excellent request throughput for our application, rather than machines that offer the highest single-thread performance.
In summary, Google clusters follow three key design principles:
.
Software reliability. We eschew fault-tolerant hardware features such as redundant power supplies, a redundant array of inexpensive disks (RAID), and high-quality components, instead focusing on tolerating failures in software.
.
Use replication for better request throughput and availability. Because machines are
inherently unreliable, we replicate each of our internal services across many achines. Because we already replicate services across multiple machines to obtain sufficient capacity, this type of fault tolerance almost comes for free.
?
Price/performance beats peak performance. We purchase the CPU generation that
currently gives the best performance per unit price, not the CPUs that give the
best absolute performance.
?
Using commodity PCs reduces the cost of computation. As a result, we can afford to use more computational resources per query, employ more expensive techniques
in our ranking algorithm, or search a larger index of documents. Leveraging commodity parts
Google’s racks consist of 40 to 80 x86-based servers mounted on either side of a custom made rack (each side of the rack contains twenty 20u or forty 1u servers). Our focus on price/performance favors servers that resemble mid-range desktop PCs in terms of their components, except for the choice of large disk drives. Several CPU generations are in active service, ranging from single-processor 533MHz Intel-Celeron-based servers to dual 1.4GHz Intel Pentium III servers. Each server
contains one or more integrated drive electronics (IDE) drives, each holding 80 Gbytes. Index servers typically have less disk space than document servers because the former have a more CPU-intensive workload. The servers on each side of a rack interconnect via a 100-Mbps Ethernet switch that has one or two gigabit uplinks to a core gigabit switch that connects all racks together.
Our ultimate selection criterion is cost per query, expressed as the sum of capital expense (with depreciation) and operating costs (hosting, system administration, and repairs) divided by performance. Realistically, a server will not last beyond two or three years, because of its disparity in performance when compared to newer machines. Machines older than three years are so much slower than current-generation machines that it is difficult to achieve proper load distribution and configuration in clusters containing both types. Given the relatively short mortization period, the equipment cost figures prominently in the overall cost equation.
Because Google servers are custom made, we’ll use pricing information for comparable PC-based server racks for illustration. For example, in late 2002 a rack of 88 dual-CPU 2-GHz Intel Xeon servers with 2 Gbytes of RAM and an 80-Gbyte hard disk was offered on RackSaver.com for around $278,000. This
figure translates into a monthly capital cost of $7,700 per rack over three years. Personnel and hosting costs are the remaining major contributors to overall cost.
The relative importance of equipment cost makes traditional server solutions less appealing for our problem because they increase performance but decrease the price/performance. For example, four-processor motherboards are expensive, and because our application parallelizes very well, such a motherboard doesn’t recoup its additional cost with better performance. Similarly, although SCSI disks are faster and more reliable, they typically cost two or three times as much as an equal-capacity IDE drive.
The cost advantages of using inexpensive, PC-based clusters over high-end multiprocessor servers can be quite substantial, at least for a highly parallelizable application like ours. The example $278,000 rack contains 176 2-GHz Xeon CPUs, 176 Gbytes of RAM, and 7 Tbytes of disk space. In comparison, a typical x86-based server contains eight 2-GHz Xeon CPUs, 64 Gbytes of RAM, and 8 Tbytes of disk space; it costs about $758,000.2 In other words, the multiprocessor server is about three times more expensive but has 22 times fewer CPUs, three times less RAM, and slightly more disk space. Much of the cost difference derives from the much higher interconnect bandwidth and reliability of a high-end server, but again, Google’s highly redundant architecture does not rely on either of these attributes.
Operating thousands of mid-range PCs instead of a few high-end multiprocessor
servers incurs significant system administration and repair costs. However, for a relatively homogenous application like Google, where most servers run one of very few applications, these costs are manageable. Assuming tools to install and upgrade software on groups of machines are available, the time and cost to maintain 1,000 servers isn’t much more than the cost of maintaining 100 servers because all machines have identical configurations. Similarly, the cost of monitoring a
cluster using a scalable application-monitoring system does not increase greatly with cluster size. Furthermore, we can keep repair costs reasonably low by batching repairs and ensuring that we can easily swap out components with the highest failure rates, such as disks and power supplies.
The power problem
Even without special, high-density packaging, power consumption and cooling issues can become challenging. A mid-range server with dual 1.4-GHz Pentium III processors draws about 90 W of DC power under load: roughly 55 W for the two CPUs, 10 W for a disk drive, and 25 W to power DRAM and the motherboard. With a typical efficiency of about 75 percent for an ATX power supply, this translates into 120 W of AC power per server, or roughly 10 kW per rack. A rack comfortably fits in 25 ft2 of space, resulting in a power density of
400 W/ft2. With higher-end processors, the power density of a rack can exceed 700 W/ft2.
Unfortunately, the typical power density for commercial data centers lies between 70 and 150 W/ft2, much lower than that required for PC clusters. As a result, even low-tech PC clusters using relatively straightforward packaging need special cooling or additional space to bring down power density to that which is tolerable in typical data centers. Thus, packing even more servers into a rack could be of
limited practical use for large-scale deployment as long as such racks reside in standard data centers. This situation leads to the question of whether it is possible to reduce the power usage per server.
Reduced-power servers are attractive for large-scale clusters, but you must keep some caveats in mind. First, reduced power is desirable, but, for our application, it must come without a corresponding performance penalty: What counts is watts per unit of performance, not watts alone. Second, the lower-power server must not be considerably more expensive, because the cost of depreciation typically outweighs the cost of power. The earlier-mentioned 10 kW rack consumes about 10 MW-h of power per month (including cooling overhead). Even at a generous 15 cents per kilowatt-hour (half for the actual power, half to amortize uninterruptible power
supply [UPS] and power distribution equipment), power and cooling cost only $1,500 per month. Such a cost is small in comparison to the depreciation cost of $7,700 per month. Thus, low-power servers must not be more expensive than regular servers to have an overall cost advantage in our setup.
Hardware-level application characteristics
Examining various architectural characteristics of our application helps illustrate which hardware platforms will provide the best price/performance for our query-serving system. We’ll concentrate on the characteristics of the index server, the component of our infrastructure whose price/performance most heavily impacts overall price/performance. The main activity in the index server consists of decoding
compressed information in the inverted index and finding matches against a set of documents that could satisfy a query. Table 1 shows some basic instruction-level measurements of the index server program running on a 1-GHz dual-processor Pentium III system.
The application has a moderately high CPI, considering that the Pentium III is capable of issuing three instructions per cycle. We expect such behavior, considering that the application traverses dynamic data structures and that control flow is data dependent, creating a significant number of difficult-to-predict branches. In fact, the same workload running on the newer Pentium 4 processor exhibits nearly twice the CPI and approximately the same branch prediction performance, even though the Pentium 4 can issue more instructions concurrently and has superior branch prediction logic. In essence, there isn’t that much exploitable instruction-level parallelism (ILP) in the workload. Our measurements suggest that the level of aggressive out-oforder, speculative execution present in modern processors is already beyond the point of diminishing performance returns for such programs.
A more profitable way to exploit parallelism for applications such as the index server is to leverage the trivially parallelizable computation. Processing each query shares mostly read-only data with the rest of the system, and constitutes a work unit that requires little communication. We already take advantage of that at the cluster level by deploying large numbers of inexpensive nodes, rather than fewer high-end ones. Exploiting such abundant thread-level parallelism at the microarchitecture level appears equally promising. Both simultaneous multithreading (SMT) and chip multiprocessor (CMP) architectures target thread-level parallelism and should improve the performance of many of our servers. Some early experiments with a dual-context (SMT) Intel Xeon processor show more than a 30 percent performance improvement over a single-context setup. This speedup is at the upper bound of improvements reported by Intel for their SMT implementation.
We believe that the potential for CMP systems is even greater. CMP designs, such as Hydra4 and Piranha,5 seem especially promising. In these designs, multiple (four to eight) simpler, in-order, short-pipeline cores replace a complex high-performance core. The penalties of in-order execution should be minor given how little ILP our application yields, and shorter pipelines would reduce or eliminate branch mispredict penalties. The available thread-level parallelism should allow near-linear speedup with the number of cores, and a shared L2 cache of reasonable size would speed up interprocessor communication.
Memory system
Table 1 also outlines the main memory system performance parameters. We observe
good performance for the instruction cache and instruction translation look-aside buffer, a result of the relatively small inner-loop code size. Index data blocks have no temporal locality, due to the sheer size of the index data and the unpredictability in access patterns for the index’s data block. However, accesses within an index data block do benefit from spatial locality, which hardware prefetching (or possibly larger cache lines) can exploit. The net effect is good overall cache hit ratios, even for relatively modest cache sizes.
Memory bandwidth does not appear to be a bottleneck. We estimate the memory bus utilization of a Pentium-class processor system to be well under 20 percent. This is mainly due to the amount of computation required (on average) for every cache line of index data brought into the processor caches, and to the data-dependent nature of the data fetch stream. In many ways, the index server’s memory system behavior resembles the behavior reported for the Transaction Processing Performance Council’s benchmark D (TPC-D).6 For such workloads, a memory system with a relatively modest sized L2 cache, short L2 cache and memory latencies, and longer (perhaps 128 byte) cache lines is likely to be the
most effective.
Large-scale multiprocessing
As mentioned earlier, our infrastructure consists of a massively large cluster of inexpensive desktop-class machines, as opposed to a smaller number of large-scale shared-memory machines. Large shared-memory machines are most useful when the computation-to-communication ratio is low; communication patterns or data partitioning are dynamic or hard to predict; or when total cost of ownership dwarfs hardware costs (due to management overhead and software licensing prices). In those situations they justify their high price tags.
At Google, none of these requirements apply, because we partition index data and
computation to minimize communication and evenly balance the load across servers. We also produce all our software in-house, and minimize system management overhead through extensive automation and monitoring, which makes hardware costs a significant fraction of the total system operating expenses. Moreover, large-scale shared-memory machines still do not handle individual hardware component or software failures gracefully, with most fault types causing a full system crash. By deploying many small multiprocessors, we contain the effect of faults to smaller pieces of the system. Overall, a cluster solution fits the performance and availability
requirements of our service at significantly lower costs.
At first sight, it might appear that there are few applications that share Google’s characteristics, because there are few services that require many thousands of servers and petabytes of storage. However, many applications share the essential traits that allow for a PC-based cluster architecture. As long as an application orientation focuses on the price/performance and can run on servers that have no private state (so servers can be replicated), it might benefit from using a similar
architecture. Common examples include high-volume Web servers or application servers that are computationally intensive but essentially stateless. All of these applications have plenty of request-level parallelism, a characteristic exploitable by running individual requests on separate servers. In fact, larger Web sites already commonly use such architectures.
At Google’s scale, some limits of massive server parallelism do become apparent, such as the limited cooling capacity of commercial data centers and the less-than-optimal fit of current CPUs for throughput-oriented applications. Nevertheless, using inexpensive PCs to handle Google’s large-scale computations
has drastically increased the amount of computation we can afford to spend per query, thus helping to improve the Internet search experience of tens of millions of users.
Acknowledgments
Over the years, many others have made contributions to Google’s hardware architecture that are at least as significant as ours. In particular, we acknowledge the work of Gerald Aigner, Ross Biro, Bogdan Cocosel, and Larry Page.
2000, pp. 282-293.
6.
L.A. Barroso, K. Gharachorloo, and E. Bugnion, “Memory System Characterization
of Commercial Workloads,” Proc. 25th ACM Int’l Symp. Computer Architecture, ACM Press, 1998, pp. 3-14.
Luiz André Barroso is a member of the Systems Lab at Google, where he has focused on improving the efficiency of Google’s Web search and on Google’s hardware architecture. Barroso has a BS and an MS in electrical engineering from Pontifícia Universidade Católica, Brazil, and a PhD in computer engineering
from the University of Southern California. He is a member of the ACM.
Jeffrey Dean is a distinguished engineer in the Systems Lab at Google and has worked on the crawling, indexing, and query serving systems,
References
1.
S. Brin and L. Page, “The Anatomy of a
Large-Scale Hypertextual Web Search
Engine,” Proc. Seventh World Wide Web
Conf. (WWW7), International World Wide
Web Conference Committee (IW3C2), 1998,
pp. 107-117.
2.
“TPC Benchmark C Full Disclosure Report
for IBM eserver xSeries 440 using Microsoft
SQL Server 2000 Enterprise Edition and
Microsoft Windows .NET Datacenter Server
2003, TPC-C Version 5.0,” http://www.tpc.
org/results/FDR/TPCC/ibm.x4408way.c5.fdr.
02110801.pdf.
3.
D. Marr et al., “Hyper-Threading Technology
Architecture and Microarchitecture: A
Hypertext History,” Intel Technology J., vol.
6, issue 1, Feb. 2002.
4.
L. Hammond, B. Nayfeh, and K. Olukotun,
“A Single-Chip Multiprocessor,” Computer,
vol. 30, no. 9, Sept. 1997, pp. 79-85.
5.
L.A. Barroso et al., “Piranha: A Scalable
Architecture Based on Single-Chip
Multiprocessing,” Proc. 27th ACM Int’l
Symp. Computer Architecture, ACM Press,
with a focus on scalability and improving relevance. Dean has a BS in computer science
and economics from the University of Minnesota and a PhD in computer science from
the University of Washington. He is a member of the ACM.
Urs H?lzle is a Google Fellow and in his previous role as vice president of engineering was
responsible for managing the development
and operation of the Google search engine
during its first two years. H?lzle has a diploma from the Eidgen?ssische Technische
Hochschule Zürich and a PhD from Stanford
University, both in computer science. He is a
member of IEEE and the ACM.
Direct questions and comments about this
article to Urs H?lzle, 2400 Bayshore Parkway,
Mountain View, CA 94043; urs@google.com.
For further information on this or any other
computing topic, visit our Digital Library at
http://computer.org/publications/dlib.
yftty 回复于：2005-06-30 17:54:25
http://www-128.ibm.com/developerworks/opensource/library/os-openafs/
Next-generation NFS-like file system might be the answer to data headaches
Level: Introductory
Frank Pohlmann (frank@linuxuser.co.uk)
U.K. Technical Editor, Linuxuser and Developer
17 May 2005
Distributed file systems haven't had much press lately because it's mostly corporate and educational networks that use them, adding up to only thousands of users. Conceptually, it isn't always clear how such systems fit into the open source file system puzzle. The Open Andrew File System (OpenAFS) is a mature alternative to the Network File System (NFS), which scales only to large numbers of users and doesn't relieve management pain.
Users understand the concept of a file system in two ways. The first is a way to organize files, the directories that contain them, and the partitions holding a directory structure. And second, a file system is the way in which files are organized and mapped to the raw metal. Naturally, further layers exist in between, like the virtual file system (VFS) layer and the actual memory management routines, but regarding managing structured information accessible to users, it makes sense for power users to peer into file system internals and get just a sulfurous whiff of the kernel's infernal recesses.
The metal might consist of RAM or hard disks, but in either case, file system data structures organize the sectors and bytes formatted by the hardware manufacturer. Although rather crude, users can sustain this conceptual split fairly comfortably in their working lives. Tools are available that increase, for example, the speed with which users can access files greater than a certain size. Tools are also available to help reorganize directories and files, but these tools keep us safe from bits, bytes, and sectors.
File system metaconcepts
A classic case of this conceptual distinction is the way that FreeBSD -- harking back to the BSD UNIX? world -- uses UNIX File System V2 (UFS2) to organize data on the disk and the Flash File System (FFS) to organize files into directories and optimize directory access. Linux? systems work a bit differently because Linux permits much more than just one or two file systems natively. Thus, the VFS layer makes it possible for Linux users to add new file system support without worrying too much about the way in which Linux manages memory.
When I talk about further distinctions like static and journal file systems, I'm emphasizing the consistency and, to some extent, security of file system contents. Again, in terms that the BSD UNIX world used to view things, static and journal file systems relate to the way in which the UNIX File System (UFS) organizes and secures files. Although Linux file systems have encompassed journal file systems since the Journal File System (JFS), the next-generation file system (XFS), and the early ReiserFS were made available, another area in which neither technical journalism nor corporate publicity sheds much light is distributed file systems.
What we learned from NFS
This state of affairs is related to the fact that today, it would be judged imprudent to make networkwide file system layers available via TCP or User Datagram Protocol (UDP) to a large number of users. Horror scenarios surrounding pre-V3 NFS put off many administrators managing networks with less than a few dozen users. In addition, the appearance of multiple-processor architectures supported by extremely fast motherboard architectures seems to make distributed file system issues a lesser priority. Speed seems guaranteed by hardware, rather than by intelligently implemented distributed systems. Given that distributed file systems tend to rely on underlying file system implementations -- for example, the existing ext2, ext3, and ReiserFS file system drivers -- distributed file systems appear to be confined to the realms of large university networks and the occasional scientific or corporate network.
So, are distributed file systems a third layer on top of the two we have mentioned? One large issue in modern networking is getting heterogeneous networks to cooperate. (Samba is a prominent example.) But you need to understand that today, we have three major players in the file system puzzle: the group of Microsoft? Windows? file systems (FAT16, FAT32, and NTFS file system); Apple Mac OS X (HFS+); and native Linux journal file systems (mostly ReiserFS and ext3). Samba helps get Windows and Linux file systems to cooperate, but it is not meant to make access to files on all major file systems uniformly quick and easy to administer.
One could cite NFS V4 as an attempt to resolve this problem, but given that Request for Comments (RFC) 3530 dealing with NFS V4 is only two years old and NFS4 for kernel V2.6 is fairly new, I'd hesitate to recommend it for production servers. Fedora cores 2 and 3 provide NFS4 patches and NFS4 utilities that demonstrate the rather impressive progress developers have made since NFS forced suffering network administrators to open more ports and configure separate clients for each namespace exported to nervous users. RFC 3530 addresses most security concerns. Still, NFS directories have to be mounted individually. You can make things secure using unified sign-ons and Kerberos, but it all needs work.
OpenAFS rationale
OpenAFS tries to take the pain out of installing and administering software that makes differing file systems cooperate. OpenAFS also works to make differing file systems cooperate efficiently. Although the original metaphor for UNIX and its fascinating successor, Plan 9, was the file, commercial realities dictated that rather than rearchitect modern networked file systems completely, another distributed file system layer had to be added.
Carnegie Mellon University programmers developed AFS in 1983. Soon after, the university set up a company called Transarc to sell services based on AFS. IBM acquired Transarc in 1998 and made AFS available as an open source product under the name OpenAFS. The saga does not end there, however, because OpenAFS spawned other distributed file systems like Coda and Arla, which I cover later. Clients exist for all major operating systems, and documentation is plentiful, if somewhat dated. Gentoo.org made a special effort for OpenAFS to be accessible to Linux users, even though other organizations still seem to refer to NFS when they need distributed file systems.
OpenAFS architecture
OpenAFS is organized around a group of file servers, known as a cell. Each server's identity is usually hidden under the file system itself. Users logging in from an AFS client would not be able to tell which server they were working on because from the users' point of view, they would work on a single system with recognizable UNIX file system semantics. File system content is usually replicated across the cell so that failure of one hard disk would not impair working at the OpenAFS client. OpenAFS requires large client-caching facilities of up to 1 GB to enable accessing frequently used files. It also works as a fully secure Kerberos-based system that uses access control lists (ACLs) to make fine-grained access possible that is not based on the usual Linux and UNIX security models.
Except for the cache manager, which happens to be part of OpenAFS -- curiously only running with ext2 as an underlying file system -- the basic superficial structure of OpenAFS resembles modern NFS implementations. The basic architectures do not look alike at all, though, and you must view any parallels with a large dose of skepticism. For those of us who still prefer to use NFS, but would like to take advantage of OpenAFS facilities, it is possible to use a so-called NFS/AFS translator. As long as an OpenAFS client machine is configured as an NFS server machine, you should be able to enjoy the advantages of both file systems.
How OpenAFS manages its world
NFS is location-dependent, mapping local directories to remote file system locations. OpenAFS hides file locations from users. Because all source files are likely to be saved in read-write copies at various replicated file server locations, you must keep the replicated copies in sync. You do so through a technology known as Ubik, a play on the word ubiquitous and in Eastern European spelling. Ubik processes keep the files, directories, and volumes on the AFS file system in sync, but usually systems with more than three file server processes running benefit the most. A system administrator can group several AFS cells -- the old AFS abbreviation has been retained within OpenAFS file system semantics -- to an AFS site. The administrator would decide on the amount of AFS cells and the extent to which the cells can make storage and files available to other AFS cells within the site.
Partitions, volumes, and directories
AFS administrators divide cells into so-called volumes. Although volumes can be co-extensive with hard-disk partitions, most administrators would not fill a complete partition with a single volume. AFS volumes are actually managed by a separate UNIX-type process called the Volume Manager. You can mount a volume in a manner familiar from a UNIX file system directory. However, you can move an AFS volume from file server to file server -- again, a UNIX-type process -- but a UNIX directory cannot be physically moved from partition to partition. AFS automatically tracks the location of volumes and directories via the Volume Location Manager and keeps track of replicated volumes and files. Therefore, the user never needs to worry whenever a file server ceases operation unexpectedly because AFS would just switch the user to a replicated volume on a different file server machine without the user likely noticing.
Users never work on files located on AFS servers. They work on files that have been fetched from file servers by the client-side cache managers. The Cache Manager is a rather interesting beast that lives in the client's operating system kernel. In the case of Linux, a patch would be added to the kernel. (You can run the Cache Manager on any kernel from 2.4 onward.)
Cache Manager
The Cache Manager can respond to requests from a local application to fetch a file from across the AFS file system. Of course, if the file is a source file you change often, it might not be ideal that the file is likely to exist in several replicated versions. Because users are likely to change an often-requested source file frequently, you have two sets of problems: First, the file is likely to be kept in the client cache, as well as on several replicated volumes on several file server machines; and second, the Cache Manager has to update all volumes. The file server process sends the file to the client cache with a callback attached to it so that the system can deal with any changes happening somewhere else. If a user adds changes to a replicated file cached somewhere else, the original file server will activate the callback and remind the original cached version that it needs to be updated.
Distributed version control systems face this classic problem, but with an important difference: Distributed version control systems work perfectly well when disconnected, while AFS cannot have part of its file system cut off. The separated AFS section would not be able to reconnect with the original file system. File server processes that fail have to resynchronize with the still-running AFS file servers, but cannot add new changes that might have been preserved locally after it was cut off.
AFS descendants
AFS has provided an obvious point of departure for several attempts at new file systems. Two such systems incorporate lessons developers learned from the original distributed file system architecture: Coda and the Swedish open source volunteer effort, Arla.
The Coda file system was the first attempt at improving the original AFS. Starting in 1987 at Carnegie Mellon University, developers meant for Coda to be a conscious improvement on AFS, which had reached V2.0 by that time. In the late 1980s and early '90s, the Coda file system premiered a different cache manager: Venus. Although the basic feature set of Coda resembles that of AFS, Venus enables continued operation for the Coda-enabled client even if the client has been disconnected from the distributed file system. Venus has exactly the same function as the AFS Cache Manager, which takes its file system jobs from the VFS layer inside the kernel.
Connection breakdowns between Coda servers and the Venus cache manager are not always detrimental to network function: A laptop client must be able to work away from the central servers. Thus, Venus stores all updates in the client modification log. When the cache manager reconnects to the central servers, the system reintegrates the client modification log, making all file system updates available to the client.
Disconnected operation can create other problems, but the Venus cache manager illustrates that distributed file systems can be extended to encompass much more than complex networks that are always running in a connected fashion.
Programmers have been developing Arla, a Swedish project that provides a GPLed implementation of OpenAFS, since 1993, even though most of the development and ports have taken place since 1997. Arla imitates OpenAFS fairly well, except that the XFS file system must function on all operating systems that Arla runs on. Arla has reached V0.39 and, just like OpenAFS, runs on all BSD flavors, a good number of Linux kernels since kernel V2.0x, and Sun Solaris. Arla does partly implement a feature for AFS that was not originally in the AFS code: disconnected operation. Mileage may vary, however, and developers have not completed testing.
Other AFS-type file systems are available, like the GPLed InterMezzo, but they do not replicate AFS command-line semantics or its architecture. The world of open source distributed file systems is very much alive, and other distributed file systems have found applications in the mobile computing world.
Resources
* Check out OpenAFS for sources, binaries, and documentation.
* NFS has progressed, and you can find the RFC and other documentation on the NFS Version 4 Web site.
* Find information about the original Andrew File System, although many commands are identical to the OpenAFS version.
* Carnegie Mellon University still maintains the Coda file system.
* Find Coda file system documentation, even though this version is somewhat dated.
* Arla provides an entry point. Documentation tends to be between terse and nonexistent.
* A fairly popular attempt at writing a new distributed file system is the InterMezzo distributed file system.
* Gentoo offers downloads, documentation, and news about this compile-it-from-scratch version of Linux.
* Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
* Innovate your next open source development project with IBM trial software, available for download or on DVD.
* Find hundreds of discounted books on open source topics in the Open source section of The Developer Bookstore, including many books about Linux.
* Get involved in the developerWorks community by participating in developerWorks blogs.
About the author
Author photoFrank Pohlmann dabbled in the history of Middle Eastern religions before various funding committees decided that research in the history of religious polemics was quite irrelevant to the modern world. He has focused on his hobby -- free software -- ever since. He admits to being the technical editor of the U.K.-based Linuxuser and Developer and has had an interest in scripts and character sets since the days when he was trying to learn Old and Middle Persian.
Randome 回复于：2005-07-02 22:21:34
小弟正在根据goole的相关资料开发基于linux的GFS原型，用c++实现，有2万多行源代码，基本能跑起来了，但没找到根多相关的资料，也没有很详细的同类文件系统的资料，如果哪位有还望能赐教，有兴趣可以共同开发，非商业。
Randome 回复于：2005-07-02 22:35:03
小弟对该项目很有兴趣，希望能加入。我做的是GFS中Master的部分，即meta data server，请版主告知具体步骤。另外有些问题也请有兴趣的兄弟讨论：
1、google文件系统的确实现了一些创新，但它主要是为google自己的Application服务，所以对此在设计时作了许多优化，但拿来作为通用的未必有利（例如用户接口设计和文件分块方式）。
2、GFs采用的是单metadata server，对此各方有很多批评，pvfs第一版采用的也是单metadata server，第二版中改成了分布式的多metadata server，但这样也带来了许多问题，就此不知大家有何高见。
3、集群（并行）文件系统现在大多一中间件的方式出现，下一步能不能制定标准集成到os中？
yftty 回复于：2005-07-02 22:53:30
引用：原帖由 "Randome"]小弟正在根据goole的相关资料开发基于linux的GFS原型，用c++实现，有2万多行源代码，基本能跑起来了，但没找到根多相关的资料，也没有很详细的同类文件系统的资料，如果哪位有还望能赐教，有兴趣可以共同开发，非商?..........
发表：
There are no docs about GoogleFS except http://www.cs.rochester.edu/sosp2003/papers/p125-ghemawat.pdf . And GoogleFS can be considerred as coming from OpenAFS, Coda, Lustre, and V9FS.
Randome 回复于：2005-07-02 22:58:34
这篇文章已经有了，多谢了 :D 。
楼主原来在线啊，能不能聊聊，小弟qq：382502660
yftty 回复于：2005-07-02 23:09:20
>;小弟对该项目很有兴趣，希望能加入。我做的是GFS中Master的部分，即meta data
给你发站内信息了,请看.
>;server，请版主告知具体步骤。另外有些问题也请有兴趣的兄弟讨论：
>;1、google文件系统的确实现了一些创新，但它主要是为google自己的Application服
>;务，所以对此在设计时作了许多优化，
是的,它本身可以说是针对某个应用的一个数据管理库函数,是应用了成熟的业界技术,达到了Performance/Cost的最优方案.
>;但拿来作为通用的未必有利（例如用户接口设计和文件分块方式）。
它是不能作为通用集群文件系统的,它并不支持POSIX标准.
>;2、GFs采用的是单metadata server，对此各方有很多批评，pvfs第一版采用的也是单
因为存在可能的系统瓶颈,但从工程上来说是可取的,同时它可以通过控制MDS的检索粒度平衡MDS的负载.
>;metadata server，第二版中改成了分布式的多metadata server，但这样也带来了许多问
>;题，就此不知大家有何高见。
MDS的数据同步问题.
>;3、集群（并行）文件系统现在大多一中间件的方式出现，下一步能不能制定标准集
>;成到os中？
这个可以看看ReiserFS v4, Lustre试图进入Mainline的努力 ;)
beyondsky 回复于：2005-07-03 17:17:02
我不是搞开发的
但对这个也还是比较感兴趣，弄过GFS，OGFS，PVFS之类的
而且从事的也还是HA相关的工作，不知道能做些什么呢？~
yftty 回复于：2005-07-05 12:46:35
引用：原帖由 "beyondsky" 发表：
我不是搞开发的
但对这个也还是比较感兴趣，弄过GFS，OGFS，PVFS之类的
而且从事的也还是HA相关的工作，不知道能做些什么呢？~
那你是否有兴趣整理下需求呢? 例如作为一个存储系统的管理方案,如何利于使用和管理等,或者不需要人的管理介入. 以及你所提的几个FS对于用户来说的体验区别;)
例如这个需求文档 < Tri-Labs/DOD SGS File System RFP>;  http://www.lustre.org/docs/SGSRFP.pdf
yftty 回复于：2005-07-07 09:50:46
Here is my suggestion
A client connects to the cluster through an Access Point (AP)
File Data Servers (each one called a node)
- files/copyes are identified by an uniquie ID and DELTA (version of the copy)
- node stores only whole files (no idea how to implement striping yet)
- don`t bother with any directory structure
Database Servers
- Implement directories and directory tree, hold information for file
ID and DELTA
Meta data server
- gather node`s traffic/freespace/resource load
- hold nodes` file list (node, file ID, file DELTA, popularity), so we
know where each copy is and it`s version
- all read/write requests go through them
Unit - MDS with all nodes that belong to it
Control servers (CS)
- move/replicate/delete copies accross the cluster, implement
traffic/freespace/load balancing accross the cluster
- takes care for the number of copies present and delta. if a file is
very popular then make more copies and spread them accross less
traffic loaded nodes. if not much popular remove some copies from
heavy traffic loaded nodes
- make sure that there are N copies on different units (we do not want
to have several copies over one Unit because when unit`s mds crashes
we may encounter a situation when there is no available copy present
in the cluster)
- holds list of MDS and their load
- takes care wich MDS a node will use (node may change MDS on the fly?)
On client write:
1) AP asks DBS for the file ID and DELTA
2) AP puts a request to MDS for writing
3) MDS looks where copies of the file (ID and DELTA) are and returns
the address of the node with appropriate load (also mark that copy
with write flag)
4) client (or AP - don`t know now which one is appropriate) connects
to the node and performs the write
5) after the write the node informs MDS of the successful write
6) MDS removes the write flag of the copy, MDS updates the DELTA, MDS
informs DBS of the DELTA change
7) inform CS for the event (CS initiates copies update to the new DELTA)
On client read:
1) AP asks DBS of the file ID and DELTA
2) AP puts read request to MDS
3) MDS looks where copies of the file (ID&DELTA) are and returns the
address of the node with appropriate load
4) client (or AP) connects to the node and performs read
On client read/write dir:
1) AP informs DBS
2) if file removal: DBS puts removal request ot CS
Copies update:
1) CS puts request to MDS for writing
2) MDS looks where copies of the file (ID and DELTA) are and returns
the address of the node with appropriate load (also mark that copy
with write flag)
3) CS informs the node to update the file (gives file ID and the
address of the node form which ot update)
Each node is connected to one MDS (on MDS crash go to another MDS).
on new node/MDS down
1) the node puts a broadcast
2) CS tells which MDS to use
MDS are interconnected (maybe some sort of a tree, don`t have an idea yet)
DBS - no idea for the connection yet
CS - some sort of communication, so we can avoid:
* several CS initiate same action on one node
MDS checks child node`s availability
on node down, mark his file list as inaccessible. after N minutes of
an unavailability remove file list
on new node/node up, node sends file list ot the MDS
Well that is all from me, thank you for reading all this. I hope you
will find something useful in my idea.
Best regards
Momchil
Randome 回复于：2005-07-07 10:09:48
上次谈的关于把名字空间分布于不同的MDS上可以解决单点瓶颈问题，但是却破坏了namespace的单一性并产生了更新后的一致性问题，对于每个Client看到的还是应该是一致的namespace，我觉得不管怎么分布还应该有一个统一的名字空间，虽然不一定在一台机器上，不知老兄有何高见。
beyondsky 回复于：2005-07-07 14:39:53
I suggest that create a group of QQ for communion!
okay?~
yftty 回复于：2005-07-07 14:59:05
引用：原帖由 "beyondsky" 发表：
I suggest that create a group of QQ for communion!
okay?~
I consider a Forum is better, which can be devided into many channels as client, fds, mds,  namespace, datapath, log, recovery, networking, migration/replication, utilities, etc.
which is about 4 hundreds pages book.
yftty 回复于：2005-07-08 22:16:59
引用：原帖由 "Randome"]上次谈的关于把名字空间分布于不同的MDS上可以解决单点瓶颈问题，但是却破坏了namespace的单一性并产生了更新后的一致性问题，对于每个Client看到的还是应该是一致的namespace，我觉得不管怎么分布还应该有一个统一?..........
发表：
With the globle namespace, all clients always see the same single directory tree, though the sub-directories under it come from many servers. You can get the two ways of building global namespace as AFS/Coda and Autofs. With the modern distributed filesystem, client side caches and local filesystem based servers need more attention on its relation with the global namespace.
Randome 回复于：2005-07-10 21:49:46
除了IBM的GPFS， CFS的lustre，RH的GlobalFS，Google的GoogleFS，现在商用的集群文件系统还有哪些？好像当前该领域还没有占压倒性优势的成熟的系统，大家的开发思路和大体结构我感觉都差不多。对于系统的开发，我觉得框架结构是最重要的，结构决定了系统的性能和发展空间，当前比较流行的结构大都是三段式：Client端，metadata server， data server(iod)，其中MDS由单一和分布式的区别，不知道还有没有其他的结构。
yftty 回复于：2005-07-10 23:30:47
引用：原帖由 "Randome"]除了IBM的GPFS， CFS的lustre，RH的GlobalFS，Google的GoogleFS，现在商用的集群文件系统还有哪些？好像当前该领域还没有占压倒性优势的成熟的系统，大家的开发思路和大体结构我感觉都差不多。对于系统的开发，我觉?..........
发表：
Please read at "发表主题: What is a Cluster Filesystem?" at the page 9. And in page 10, "The competition in the cluster area, it seems, is just beginning." ;)
yftty 回复于：2005-07-14 09:48:15
OpenBSD-NFSv4 -- openbsd-nfsv4
http://mailman.theapt.org/listinfo/openbsd-nfsv4
Ocfs2-devel --
http://oss.oracle.com/mailman/listinfo/ocfs2-devel
PVFS2-developers -- A public mailing list for Parallel Virtual File System v2 developers
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
V9fs-developer --
https://lists.sourceforge.net/lists/listinfo/v9fs-developer
Lustre
https://lists.clusterfs.com/mailman/listinfo
OpenAFS
https://lists.openafs.org/mailman/listinfo
ReiserFS
http://www.namesys.com/mailinglist.html
Linux
http://vger.kernel.org/vger-lists.html
CU_runner 回复于：2005-07-15 16:33:57
yftty的工作我一直在工作之余进行着关注，虽然对FS还很不了解，但是，兴趣以及对FS的信心还是让我产生了很高的学习热情。
正如，yf大侠所说，GFS是目前一个成功的linux集群FS。学习GFS的知识，应该对大家有所帮助，尤其是我们刚开始接触FS的人。毫无疑问，google发布的GFS白皮书对我们是一个很好的参考读物。可惜，我没有在网上找到中文的翻译。那么大家能不能共同翻译一下，给更多的入门者提供参考？如果大家愿意，我们分工一下，然后汇集成册，请yf大侠review一下，也算是各位对中国的FS事业作点贡献吧。当然了，如果各位有中文版GFS白皮书，或者自己已经翻译了，请与我们分享：）
我刚才对概要部分作了翻译，如下：
概要
我们已经设计并开发了Google的可扩展的分布式文件系统，这个文件系统适合大型的、分布式的、高密度数据的应用。由于运行在廉价的普通硬件基础上，该文件系统有很好的容错能力，并对高访问量表现出了良好的性能。
尽管我们的设计与以前的分布式文件系统有着共同的目标，但是，考虑到我们应用的工作负载和技术环境，不管是现在还是将来，我们的设计都会显著的有别于早期的文件系统。对此，我们已经重新审视了传统的设计观念，并开始探索一个全新的设计思路。
目前，这个文件系统已经很好的迎合了我们的存储需求，并已经作为数据存储平台在Google进行了广泛的部署以产生和处理我们的服务、研究和开发所需要的那些大量数据。这个目前世界上最大的数据集群，通过成千上万的机器上的硬盘提供了上百T数量级的数据存储，他同时可以为几百个客户端请求作并发处理。
在这篇论文里，我们将介绍支持分布式应用的文件系统的接口的扩展性，我们还要讨论文件系统的设计的诸多方面，并且列举出性能测试和实际运行的测试结果。
我扔了砖头，请大家拿玉砸我呀。
chifeng 回复于：2005-07-16 16:06:25
提议yftty老大发起这个翻译gfs的项目，然后一起研究文件系统。
可以拿cosoft.org.cn做平台，呵呵。
yftty 回复于：2005-07-16 20:58:01
http://bbs.chinaunix.net/forum/viewtopic.php?t=578270
潇湘夜雨写到:
2005年  也是今年,是中国互联网商业化10周年的日子.
......
我可以负责任的告诉.com公司的兄弟们,GOOGLE、雅虎和你们一样需要海量的信息数据存储,GOOGLE 公司,1GB的数据存储成本在1美圆,新浪、搜狐、网易、百度、腾讯的CXO们你们自己来说说你们1GB 数据存储成本是多少?
我可以负责任的告诉你,是7美圆以上.
成本比人家高出7倍还多.你还有什么脸和人家GOOGLE比
技术不是天下第一  你告诉我技术是天下第几?
有本事你把你自己的成本降下去呀?七年前你是草根公司,七年后你手里美圆大把,你把你的成本降下去呀!
..........
yftty 回复于：2005-07-17 03:16:26
http://storage.ittoolbox.com/
http://www.snia.org/home
http://www.osta.org/
CU_runner 回复于：2005-07-17 17:38:22
尽管，以运营为主的公司，CXO很难同意技术第一的观点，但是，在技术含量高、从业人员稀缺又有很高回报的领域，比如FS，我想，CXO跪求有经验有热情的专业人员是迟早的事情。
希望yftty大侠能给我们多一些指点，也许，我们也可以为中国的FS作一些贡献。
soway 回复于：2005-07-18 17:11:27
我最近也开始关注GFS的东西（RH的GFS），因为在我的环境中用到了集群计算。
以前采用的是nfs为中心的文件服务器，然后nfs export出来，这样可以实现文件的一致性，但是问题出来了，瓶颈没有在网络速度上。大量的小文件读取，导致速度非常慢，因为任何一次计算，都需要到nfs server上面去读取文件，这些文件需要通过网络，通过nfs服务，所以速度变得只有本地读取的十分之一。
我个人建议先做一些比较能让大家做到实际系统的工作开始。比如写一个怎么系统搭建GFS的步骤。
文件系统肯定是以后计算环境和存储中的一个重要的部分，因为要考虑备份，速度，安全等很多问题，目前的计算机最慢的就算这部分了。
yftty 回复于：2005-07-18 19:22:35
引用：原帖由 "soway" 发表：
我最近也开始关注GFS的东西（RH的GFS），因为在我的环境中用到了集群计算。
......。大量的小文件读取，导致速度非常慢，因为任何一次计算，都需要到nfs server上面去读取文件，这些文件需要通过网络，通过nfs服务，所以速度变得只有本地读取的十分之一。?..........
NFS 只能解决文件的共享问题，但不能解决文件的并发访问问题。同时它是如你所说，是远程I／O方式，无本地Cache,　并且不支持离线操作。
引用：原帖由 "soway"]......? 我个人建议先做一些比较能让大家做到实际系统的工作开始。比如写一个怎么系统搭建GFS的步骤。 ..........
发表：
这个由你去写如何，去年和个同事作过GFS到Darwin的移植工作，使GFS能在Darwin上跑起来了，对它还有所了解，可以提供近可能的帮助 :)
引用：原帖由 "soway"]......? 文件系统肯定是以后计算环境和存储中的一个重要的部分，因为要考虑备份，速度，安全等很多问题，目前的计算机最慢的就算这部分了。 ..........
发表：
现在的趋势是存储的集中，例如有一个非常大的数据池，它可能是以一个省为单位，大家的PC机，手机，掌上机等的数据都存储在上面，而这最核心的，就是管理这个数据池的FS　；）
liuzhentaosoft 回复于：2005-07-18 19:56:54
PVFS中文介绍
PVFS描述
http:/parlweb.parl.clemson.edu/pvfs/desc.html
PC集群作为一个并行平台在逐步的普及，此平台上的软件需求也正在增长。在当今的集群中。
并行计算环境下，我们找到了许多有效的软件模块。比如可靠的操作系统，本地存储系统和基于
消息传递的系统。然而，并行I/O限制了集群的软件产品的生产。
并行虚拟文件系统(PVFS)工程为Linux集群提供了高性能和可扩展行的并行文件系统。PVFS是
开放原代码的，并且在GNU公共版权许可证下发布。它无需特殊的硬件设备和内核的改动。PVFS提供
重要的4个功能：
×一致性的访问名字空间。
×支持现存的系统访问方式。
×数据分布在集群节点不同机器不同的硬盘上。
×为应用程序提供高性能的数据访问方式。
为了PVFS易于安装和使用。它必须提供与集群访问相一致的名字空间，而且它必须达到我们易
用的习惯方式。PVFS文件必须同时安装到所有节点的相同目录里。使得所有节点能看到和访问PVFS
文件上的所有文件通过相同的配置。在已安装PVFS文件和目录能够运用类似的工具，比如ls,cp和rm
。
为了给访问很多客户端上文件系统的数据提供高性能，PVFS将数据散布于许多集群的节点上，
应用程序能够通过网络从不同的路径获得数据。这个消除了I/O路径的瓶颈，且增加了众多客户端
潜在的带宽，或者是总和的带宽。
当传统的系统调用机制提供了方便的数据访问给应用程序不同的文件系统的数据文件，是使用
在内核之上的方式。对PVFS来说应用程序可以以连接本地PVFS，API的方式访问文件系统。这类库
直接使用Unix操作与PVFS服务器门连接，而不是传递消息给内核。这个类库能被应用程序于与其他
类库使用。
yftty 回复于：2005-07-20 10:39:27
在 2005-07-19二的 21:16 -0500，Eric Anderson写道：
>; Bakul Shah wrote:
>; [..snip..]
>; >;>;:) I understand.  Any nudging in the right direction here would be
>; >;>;appreciated.
>; >;
>; >;
>; >; I'd probably start with modelling a single filesystem and how
>; >; it maps to a sequence of disk blocks (*without* using any
>; >; code or worrying about details of formats but capturing the
>; >; essential elements).  I'd describe various operations in
>; >; terms of preconditions and postconditions.  Then, I'd extend
>; >; the model to deal with redundancy and so on.  Then I'd model
>; >; various failure modes. etc.  If you are interested _enough_
>; >; we can take this offline and try to work something out.  You
>; >; may even be able to use perl to create an `executable'
>; >; specification:-)
>;
>; I've done some research, and read some books/articles/white papers since
>; I started this thread.
>;
>; First, porting GFS might be a more universal effort, and might be
>; 'easier'.  However, that doesn't get us a clustered filesystem with BSD
>; license (something that sounds good to me).
It has been said it would be a seven man-month efforts for a FS expert.
>;
>; Clustering UFS2 would be cool.  Here's what I'm looking for:
It is exactly how "Lustre" doing its work, though it build itself on
Ext3, and Lustre targets at  http://www.lustre.org/docs/SGSRFP.pdf .
>;
>; A clustered filesystem (or layer?) that allows all machines in the
>; cluster to see the same filesystem as if it were local, with read/write
>; access.  The cluster will need cache coherency across all nodes, and
>; there will need to be some sort of lock manager on each node to
>; communicate with all the other nodes to coordinate file locking.  The
>; filesystem will have to support journaling.
>;
>; I'm wondering if one could make a pseudo filesystem something like
>; nullfs that sits on top of a UFS2 partition, and essentially monitors
>; all VFS operations to the filesystem, and communicates them over TCP/IP
>; to the other nodes in the cluster.  That way, each node would know which
>; inodes and blocks are changing, so they can flush those buffers, and
>; they would know which blocks (or partial blocks) to view as locked as
>; another node locks it. This could be done via multicast, so all nodes in
>; the cluster would have to be running a distributed lock manager daemon
>; (dlmd) that would coordinate this.  I think also that the UFS2
>; filesystem would have to have a bit set upon mount that tracked it's
>; mount as a 'clustered' filesystem mount.  The reason for that is so that
>; we could modify mount to only mount 'clustered' filesystems (mount -o
>; clustered) if the dlmd was running, since that would be a dependency for
>; stable coherent file control on a mount point.
>;
>; Does anyone have any insight as to whether a layer would work?  Or maybe
>; I'm way off here and I need to do more reading :)
>;
>; Eric
>;
>;
--
yf-263
Unix-driver.org
beyondsky 回复于：2005-07-20 11:17:28
GFS、OGFS我有自己写的安装文档
PVFS2我也有自己内部的详细测试文档
但我说了，没具体的交流平台
这里很少有人每天来打开这个web页面来看下，而且就算每天关注也不能做到实时的讨论与交流
我觉得邮件列表和讨论组的建立是有必要的
yftty 回复于：2005-07-23 19:52:38
http://www.lustre.org/docs/dfsprotocols.pdf
Peter J. Braam
School of Computer Science Carnegie Mellon University
Abstract:
The protocols used by distributed file systems vary widely. The aim of this talk is to give an overview of these protocols and discuss their applicability for a cluster environment. File systems like NFS have weak semantics, making tight sharing difficult. AFS, Coda and InterMezzo give a great deal of autonomy to cluster members, and involve a persistent file cache for each system. True cluster file systems such as found in VMS VAXClusters, XFS, GFS introduce a shared single image, but introduce complex dependencies on cluster membership.
soway 回复于：2005-07-28 13:17:58
引用：原帖由 "beyondsky" 发表：
GFS、OGFS我有自己写的安装文档
PVFS2我也有自己内部的详细测试文档
但我说了，没具体的交流平台
这里很少有人每天来打开这个web页面来看下，而且就算每天关注也不能做到实时的讨论与交流
我觉得邮件列表和讨论?.........
同意建立一个邮件列表或者QQ群组（可能中国这个用的更多）。
我也是前段时间来看了一下，后来就没来过了
最近几天也一直很忙，所以更加没关注。
不过目前的情况，为了我本身系统的稳定，我可能还是只用nfs实现。
nfs的本地写“cache”大家可以在nfs服务器上面开启async做到性能改善。
不过我想以后，在商业计算过程中，存储必须集群化，因为目前它的瓶颈或者弱点已经明显的显示出来了。
yftty 回复于：2005-08-03 09:55:36
引用：原帖由 "soway" 发表：
不过目前的情况，为了我本身系统的稳定，我可能还是只用nfs实现。
nfs的本地写“cache”大家可以在nfs服务器上面开启async做到性能改善。
不过我想以后，在商业计算过程中，存储必须集群化，因为目前它的瓶颈或者弱点已经明显的显示出来了。..........
NFS的问题:
发件人:  Eric Anderson
收件人:  freebsd-fs@freebsd.org
主题:    Re: Cluster Filesystem for FreeBSD - any interest?
...
Hmm.  I'm not sure if it can or not.  I'll try to explain what I'm
dreaming of.  I currently have about 1000 clients needing access to the
same pools of data (read/write) all the time.  The data changes
constantly.  There is a lot of this data.  We use NFS currently.
FreeBSD is *very* fast and stable at serving NFS data.  The problem is,
that even though it is very fast and stable, I still cannot pump out
enough bits fast enough with one machine, and if that one machine fails
(hardware problems, etc), then all my machines are hung waiting for me
to bring it back online.
...
final fantasy 回复于：2005-08-03 10:40:25
能力有限
友情UP
yftty 回复于：2005-08-03 14:25:09
http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/doc/journaling.txt?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=cluster
o  Journaling & Replay
The fundamental problem with a journaled cluster filesystem is
handling journal replay with multiple journals.  A single block of
metadata can be modified sequentially by many different nodes in the
cluster.  As the block is modified by each node, it gets logged in the
journal for each node.  If care is not taken, it's possible to get
into a situation where a journal replay can actually corrupt a
filesystem.  The error scenario is:
1) Node A modifies a metadata block by putting a updated copy into its
incore log.
2) Node B wants to read and modify the block so it requests the lock
and a blocking callback is sent to Node A.
3) Node A flushes its incore log to disk, and then syncs out the
metadata block to its inplace location.
4) Node A then releases the lock.
5) Node B reads in the block and puts a modified copy into its ondisk
log and then the inplace block location.
6) Node A crashes.
At this point, Node A's journal needs to be replayed.  Since there is
a newer version of block inplace, if that block is replayed, the
filesystem will be corrupted.  There are a few different ways of
avoiding this problem.
1) Generation Numbers (GFS1)
Each metadata block has header in it that contains a 64-bit
generation number.  As each block is logged into a journal, the
generation number is incremented.  This provides a strict ordering
of the different versions of the block a they are logged in the FS'
different journals.  When journal replay happens, each block in the
journal is not replayed if generation number in the journal is less
than the generation number in place.  This ensures that a newer
version of a block is never replaced with an older version.  So,
this solution basically allows multiple copies of the same block in
different journals, but it allows you to always know which is the
correct one.
Pros:
A) This method allows the fastest callbacks.  To release a lock,
the incore log for the lock must be flushed and then the inplace
data and metadata must be synced.  That's it.  The sync
operations involved are: start the log body and wait for it to
become stable on the disk, synchronously write the commit block,
start the inplace metadata and wait for it to become stable on
the disk.
Cons:
A) Maintaining the generation numbers is expensive.  All newly
allocated metadata block must be read off the disk in order to
figure out what the previous value of the generation number was.
When deallocating metadata, extra work and care must be taken to
make sure dirty data isn't thrown away in such a way that the
generation numbers stop doing their thing.
B) You can't continue to modify the filesystem during journal
replay.  Basically, replay of a block is a read-modify-write
operation: the block is read from disk, the generation number is
compared, and (maybe) the new version is written out.  Replay
requires that the R-M-W operation is atomic with respect to
other R-M-W operations that might be happening (say by a normal
I/O process).  Since journal replay doesn't (and can't) play by
the normal metadata locking rules, you can't count on them to
protect replay.  Hence GFS1, quieces all writes on a filesystem
before starting replay.  This provides the mutual exclusion
required, but it's slow and unnecessarily interrupts service on
the whole cluster.
2) Total Metadata Sync (OCFS2)
This method is really simple in that it uses exactly the same
infrastructure that a local journaled filesystem uses.  Every time
a node receives a callback, it stops all metadata modification,
syncs out the whole incore journal, syncs out any dirty data, marks
the journal as being clean (unmounted), and then releases the lock.
Because journal is marked as clean and recovery won't look at any
of the journaled blocks in it, a valid copy of any particular block
only exists in one journal at a time and that journal always the
journal who modified it last.
Pros:
A) Very simple to implement.
B) You can reuse journaling code from other places (such as JBD).
C) No quiece necessary for replay.
D) No need for generation numbers sprinkled throughout the metadata.
Cons:
A) This method has the slowest possible callbacks.  The sync
operations are: stop all metadata operations, start and wait for
the log body, write the log commit block, start and wait for all
the FS' dirty metadata, write an unmount block.  Writing the
metadata for the whole filesystem can be particularly expensive
because it can be scattered all over the disk and there can be a
whole journal's worth of it.
3) Revocation of a lock's buffers (GFS2)
This method prevents a block from appearing in more than one
journal by canceling out the metadata blocks in the journal that
belong to the lock being released.  Journaling works very similarly
to a local filesystem or to #2 above.
The biggest difference is you have to keep track of buffers in the
active region of the ondisk journal, even after the inplace blocks
have been written back.  This is done in GFS2 by adding a second
part to the Active Items List.  The first part (in GFS2 called
AIL1) contains a list of all the blocks which have been logged to
the journal, but not written back to their inplace location.  Once
an item in AIL1 has been written back to its inplace location, it
is moved to AIL2.  Once the tail of the log moves past the block's
transaction in the log, it can be removed from AIL2.
When a callback occurs, the log is flushed to the disk and the
metadata for the lock is synced to disk.  At this point, any
metadata blocks for the lock that are in the current active region
of the log will be in the AIL2 list.  We then build a transaction
that contains revoke tags for each buffer in the AIL2 list that
belongs to that lock.
Pros:
A) No quiece necessary for Replay
B) No need for generation numbers sprinkled throughout the
metadata.
C) The sync operations are: stop all metadata operations, start and
wait for the log body, write the log commit block, start and
wait for all the FS' dirty metadata, start and wait for the log
body of a transaction that revokes any of the lock's metadata
buffers in the journal's active region, and write the commit
block for that transaction.
Cons:
A) Recovery takes two passes, one to find all the revoke tags in
the log and one to replay the metadata blocks using the revoke
tags as a filter.  This is necessary for a local filesystem and
the total sync method, too.  It's just that there will probably
be more tags.
Comparing #2 and #3, both do extra I/O during a lock callback to make
sure that any metadata blocks in the log for that lock will be
removed.  I believe #2 will be slower because syncing out all the
dirty metadata for entire filesystem requires lots of little,
scattered I/O across the whole disk.  The extra I/O done by #3 is a
log write to the disk.  So, not only should it be less I/O, but it
should also be better suited to get good performance out of the disk
subsystem.
KWP 07/06/05
mafa 回复于：2005-08-03 17:52:17
晕了，本来视力就不佳，刷过屏来老眼昏花。
yftty 回复于：2005-08-05 13:12:20
http://bbs.chinaunix.net/forum/viewtopic.php?t=588832
freebsd开发小组对内核的代码审查非常严格，绝大部分是开发小组编写的，代码得到充分的优化，很大程度上保持的代码的执行效率和安全性。另外，代码保证了BSD许可得到体现，也就是说有内核的代码和采用的技术都是freebsd开发小组所有，没有侵权的危机，在freebsd上开发的软件可以开源，也可以不开源，可以是免费，也可以共享或商业软件。但因为freebsd代码审查和开发太严格，导致很多新技术在freebsd上出现是很缓慢的，例如freebsd对smp的支持。
linux的代码中很大部分是世界各地开发者贡献的，虽然新技术体现得很快，但面临这维护困难，版权的问题。维护困难是指开发者可能由于各种原因不继续开发了，而且开发者的技术水平不一，代码的优化和安全性得不到保障；版权是指代码本身和使用的技术可能涉及到商业版权，甚至是公司行为，大公司有意的贡献自己的代码，造成linux使用面临着侵权的危险，sco告IBM就是一个很好的例子。
linux成立开发小组后，由于后继开发的版本都是在前面版本的基础上进行的，虽然不能一下子改变这个局面，但通过严格的审查和代码发布控制，相信情况是会慢慢好转起来的。linus torwalds前一段时间表示，在2.6版本开发出来以后，将会把工作重点转移到代码的优化，安全性和稳定性方面
suran007 回复于：2005-08-05 16:56:47
我们服务器原来用的是ＮＦＳ共享存贮，后来改的ＧＦＳ，但是现在很慢，服务器资源消耗很大，可是网上都说ｇｆｓ比ｎｆｓ要性能更优异,那位朋友有做过GFS的，说说怎么讲GFS性能设置得更优异一些
急等回复，那位大侠帮帮我，我的qq24242546
yftty 回复于：2005-08-05 17:09:03
引用：原帖由 "suran007" 发表：
我们服务器原来用的是ＮＦＳ共享存贮，后来改的ＧＦＳ，但是现在很慢，服务器资源消耗很大，可是网上都说ｇｆｓ比ｎｆｓ要性能更优异,那位朋友有做过GFS的，说说怎么讲GFS性能设置得更优异一些
急等回复，那位大侠?..........
使用有问题, QQ上说
yftty 回复于：2005-08-06 10:33:00
Daniel Phillips
Hi guys,
I'm interested in helping pull on the oars to help get the rest of the way to
Posix compliance.  On the "talk is cheap" principle, I'll start with some
cheap talk.
First, you don't have to add range locks to your DLM just to handle Posix
locks.  Posix locking can be handled by a completely independent and crude
mechanism.  You can take your time merging this with your DLM, or switching
to another DLM if that's what you really want to do.
In the long run, the DLM has to have range locking in order to implement
efficient, sub-file granularity caching, but the VFS still needs work to
support this properly anyway, so there is no immediate urgency.  And Posix
locking is even less of a reason to panic.
Now some comments on how the hookable Posix locking VFS interface works.
First off, it is not a thing of beauty.  The principle is simple.  By
default, the VFS maintains a simple minded range lock accounting structure.
It is just a linear list of range locks per inode.  When somebody wants a new
lock, the VFS searches the whole list to find collisions.  If a lock needs to
be split, merged or whatever, these are very basic list operations.  The code
is short.  It is obviously prone to efficiency bugs, but I have not heard
people complaining.
The hook works by simply short-circuiting into your filesystem just before the
VFS touches any of its own lock accounting data.  Your filesystem gets the
original ioctl arguments and you get to replicate a bunch of work that the
VFS would have done, using its own data structure.  There are about 5
shortcircuits like this.  See what I mean by not beautiful?  But it is fairly
obvious how to use this interface.
On a cluster, every node needs to have the same view of Posix locks for inodes
it is sharing.  The easiest way I know of to accomplish this is to have a
Posix lock server in the cluster that keeps the same, bad old linear list of
locks per Posix-locked inode, but exports operations on those locks over the
network.  Not much challenge: a user space server and some tcp messaging.
The server will count the nodes locking a given inode, and let the inode fall
away when all are done.  I think inode deletion just works: the filesystem
just has to wait for confirmation from the lock server that the Posix locks
are dropped before it allows the normal delete path to continue.
Inodes that aren't shared don't require consulting the server, an easy
optimization.
For failover, each node will keep its own list of Posix locks.  Exactly the
same code the VFS already uses will do, just cut and paste it and insert the
server messaging.  To fail over a lock server, the new server has to rollcall
all cluster nodes to upload their Posix locks.
Now, obviously this locking should really be distributed, right?  Sure, but
notice: the failover algorithm is the same regardless of single-server vs
distributed; in the DLM case it just has to be executed on every node for the
locks mastered on that node.  And oh, you have to deal with a new layer of
bookkeeping to keep track of lock masters, and fail that over too.  And of
course you want lock migration...
So my purpose today is to to introduce the VFS interface and to point out that
this doesn't have to blow up into a gigantic project just to add correct
Posix locking.
See here:
http://lxr.linux.no/source/fs/locks.c#L1560
if (filp->;f_op && filp->;f_op->;lock) {
Regards,
Daniel
_______________________________________________
Ocfs2-devel mailing list
yftty 回复于：2005-08-10 15:18:27
http://bbs.chinaunix.net/forum/46/050807/589800.html
新浪科技讯北京时间8月6日消息
AOL日前宣布已收购网络存储公司Xdrive，以满足消费者对收集数字音乐、图片及其它文件的需求。目前，这一交易涉及的具体金额还没有披露。
AOL表示，今后Xdrive将成为其全资子公司，继续通过Xdrive.com销售存储和备份服务。目前，Xdrive公司的34名员工已经转到AOL位于加州圣莫尼卡的总部办公。分析人士认为， AOL可能会在自己的现有服务中集成Xdrive技术，就像该公司去年收购反垃圾邮件公司Mailblocks之后所采取的一系列举措一样。通过Xdrive平台，AOL可以实现电子邮件、网络日志、图片以及其它服务的集中存储。到目前为止，AOL还没有提供更多细节。
近年来，AOL对于存储的需求持续增长。不久前，AOL开始通过“aim.com”为用户提供免费的电子邮箱，同时还为付费用户提供不限容量的电子邮箱存储空间。AOL希望通过此举吸引更多用户，提升自身竞争力。AOL的主要竞争对手Google已经推出了2GB的免费电子邮箱。AOL发言人尼古拉斯 -格拉汉姆(Nicholas Graham)，随着越来越多的家庭开始使用宽带网络，通过网络传输数字视频、电影等大文件逐渐成为可能，消费者对于存储的需求将日益增长。
格拉汉姆同时称，AOL并不会从头开始打造一个新的存储系统，因为Xdrive公司在这一领域已经投入了数百万美元的资金和数百名工程师的努力。(摩尔）
CU_runner 回复于：2005-08-10 15:22:05
34人就运营了一个存储公司，老外真牛呀
secondself 回复于：2005-08-14 23:03:25
rtsp://vodreal.stanford.edu/sns/0506/jobs.rm
Stay hungry, stay foolish !
yftty 回复于：2005-08-15 11:55:38
http://tech.sina.com.cn/i/2005-08-15/0924692923.shtml
...
微软做的调查发现，用户对于邮件系统的需要，第一位是稳定性、第二位是安全性，排在第三位才是邮箱的容量大小。罗川认为，邮件首先要满足最基本的功能，是要能够很稳定地收发邮件，人家发的你要能够收到，发出去的要能够很顺利地发出去。其次，所谓安全性目前来讲很重要的就是能够过滤掉垃圾邮件。罗川说:“如果邮箱容量很大，但这里头收到很多垃圾邮件，没有办法找到自己真正的信息，用户不会对这样的邮箱满意。”
...
目前微软有两套邮件服务产品。一个是基于网络的Hotmail，另外一个是基于客户端的 Outlook系列。相对目前的微软邮箱产品，将要推出的下一代以网络为基础的邮件体系能够跟客户端邮件的体系功能一样强大。也就是说像通讯录等原来存在个人电脑上的信息存在了网络邮件系统。
...
yftty 回复于：2005-08-16 09:12:18
From: Payal Rathod
Hi,
The whole filesystem idea is seemingly interesting to me. Can someone
point to some good documentation which explains superblock, inode block
and data blocks in details without going into too much of technicalities
(as I am a person with no technical background academically).
==================================================================
From: Matt Stegman
This page is all about ext2, not reiserfs, but I think it works as a good
primer for general information:
http://web.mit.edu/tytso/www/linux/ext2intro.html
This page provides more detailed information about reiserfs (that's
version 3, not 4, just to be clear):
http://homes.cerias.purdue.edu/~florian/reiser/reiserfs.php
If you're not familiar with the basic ideas behind hard disks and
partitions, and the difference between a partition and a filesystem,
Ranish's partitioning primer does a good job of explaining that.
http://www.ranish.com/part/primer.htm
yftty 回复于：2005-08-16 09:54:21
http://www.coda.cs.cmu.edu/maillists/codalist/codalist-2005/7292.html
From: Ivan Popov
Hi,
some days ago I found an insightful comment on a slashdot story about Google.
If Coda-alike is going to be written from scratch, it might be Google who does that... or anyone who sees the potential and has enough money.
Coda did it right, just evolved too slowly to come in time, or may be we still have a chance?
I mean a chance to have a full-scale free technology instead of a proprietary one
which otherwise would fill the niche and attract (and force) the community to use and pay for the technology, besides the services?
Coda is still unique, and there is a clear interest for global file service... Now it _is_ possible to use Coda for making money, offer services and be paid for them, not for the technology.
http://daltonlp.com/daltonlp.cgi?item_type=0&item_id=424
...
Comment by Mystified
...
Personally, I think it makes massive sense to build a "desktop" stack of applications delivered over the internet and with centrally stored, encrypted data and to sell that service on a recurring subscription basis. If the apps worked across platforms from my laptop to my cell phone and provided easy access to the knowledge of the world, I would seriously think about paying that monthly fee (provided they did not spam me with targetted ads or sell my PII).
...
The applications delivery is already taken care of by Konvalo.org. A decent "homedir" service with an application environment will fit nicely for those needs.
=====================================================================
From: Gabriel B.
I think that Plan9 would be better suited for that google OS idea. but it's getting way of topic :) Back to coda:
...
thomasie 回复于：2005-08-16 11:31:49
各位大虾，有人成功搭建了gfs 6.1 的环境吗，小弟在搭建过程中碰到了一些问题，希望能得到各位帮助。我的qq号是：463974306。
yftty 回复于：2005-08-16 12:52:25
引用：原帖由 "thomasie"]各位大虾，有人成功搭建了gfs 6.1 的环境吗，小弟在搭建过程中碰到了一些问题，希望能得到各位帮助。我的qq号是：463974306。
发表：
与这位朋友聊聊吧 -->; ""我们服务器原来用的是ＮＦＳ共享存贮，后来改的ＧＦＳ，但是现在很慢，服务器资源消耗很大，可是网上都说ｇｆｓ比ｎｆｓ要性能更优异,那位朋友有做过GFS的，说说怎么讲GFS性能设置得更优异一些急等回复，那位大侠帮帮我，我的qq24242546""
yftty 回复于：2005-08-17 10:04:46
Talk on writing a file system module
http://www.eecs.umich.edu/~ppadala/pubs/fs/slide001.html
ceacdong 回复于：2005-08-17 10:42:18
存储的内容多样化、存储的数据不能集中化、存储的数据会以用户/组/系统等为中心进行存储
很有意思
yftty 回复于：2005-08-17 14:54:59
引用：原帖由 "ceacdong" 发表：
存储的内容多样化、存储的数据不能集中化、存储的数据会以用户/组/系统等为中心进行存储
很有意思
呵呵,果然是专业人士,总结的很漂亮  :em02:
guotie 回复于：2005-08-17 16:43:13
为什么在bsd下，而不是在linux下？
yftty 回复于：2005-08-17 17:28:45
引用：原帖由 "guotie"]为什么在bsd下，而不是在linux下？
发表：
那就改成Unix下吧, 例如 Linux, FreeBSD, Darwin, Solaris, AIX, etc, ...
呵呵,不过改标题把"投票"给改丢了, 还要加上么?   :roll:
suran007 回复于：2005-08-24 16:40:50
我想做两台gnbd server+阵列柜，但我不做multipath，能不能提升gfs集群的性能，（我以前用的是一个gnbd server),但我找不到做多个gnbd server的文章，那位朋友帮帮我，推荐几篇作多个gnbd server的文章，谢谢了
yftty 回复于：2005-08-25 10:39:58
引用：原帖由 "suran007"]我想做两台gnbd server+阵列柜，但我不做multipath，能不能提升gfs集群的性能，（我以前用的是一个gnbd server),但我找不到做多个gnbd server的文章，那位朋友帮帮我，推荐几篇作多个gnbd server的文章，谢谢了
发表：
看看这个是么
http://www.redhat.com/magazine/008jun05/features/gfs/
suran007 回复于：2005-08-26 15:27:56
先谢谢楼主了，我看完之后，还是一头雾水，他讲的都是一些概念上的东东，要是有一些作多gnbd server的例子就好了，但是怎么也找不到，毕竟做这块的人还是很少的，郁闷中~~~~~
yftty 回复于：2005-08-26 22:26:36
引用：原帖由 "suran007"]先谢谢楼主了，我看完之后，还是一头雾水，他讲的都是一些概念上的东东，要是有一些作多gnbd server的例子就好了，但是怎么也找不到，毕竟做这块的人还是很少的，郁闷中~~~~~
发表：
理论上可行，但会导致可用性的下降，如下面邮件所述：
https://www.redhat.com/archives/linux-cluster/2005-August/msg00270.html
artifly 回复于：2005-08-29 16:41:03
我是计算机系的在读研究生，做的就是cluster fs这方面的工作
搞过一段时间的pvfs2，现在重点在研究lustre
在网上查资料的时候，看到了这个项目，
比较感兴趣，就把所有的回复都翻了一下，我想我也可以做点工作
不知道这个项目现在进展到什么程度，我可以参予吗？
yftty 回复于：2005-08-30 13:02:21
引用：原帖由 "artifly" 发表：
我是计算机系的在读研究生，做的就是cluster fs这方面的工作
搞过一段时间的pvfs2，现在重点在研究lustre
在网上查资料的时候，看到了这个项目，
比较感兴趣，就把所有的回复都翻了一下，我想我也可以做点工作
..........
"把所有的回复都翻了一下", 很好:) -->; 我LP问我贴的这些东西我都看过没有, 呵呵,这些我都是一个字一个字看过来才贴上来的;)
第一版本这个月就可以发布了.
找个时间来公司一起聊聊吧先.
yftty 回复于：2005-08-31 10:14:18
in Coda Lists
On Mon, Aug 15, 2005 at 01:31:43PM -0400, Kris Maglione wrote:
>; There's nothing stopping Coda (in theory. I haven't seen the code
>; relating to this) from implementing both partial and full file caching.
>; Whether it be a knob between two modes of caching, a switch to require
>; the fetching of all blocks (with unneeded ones at a lower priority, put
>; off until essential data is retrieved), or just a program like hoard
>; deciding what files need to be cached fully, and doing so. I'm not
>; saying that this should or will be implemented, but it is possible, in
>; theory. For Coda and AFS.
Actually there are many reasons to not have block level caching in Coda.
- VM deadlocks
Because we have a userspace cache manager we could get into the
situation where we are told to write out dirty data, but this causes
us to request one or more memory pages from the kernel, either
because we allocate memory, or are simply paging in some of the
application/library code.  The kernel might then decide to give us
pages that would require write-back of more dirty state to the
userspace daemon. We would have to push venus into the kernel, which
is what AFS did, but they aren't dealing with a lot of the same
complexities like replication and reintegration.
- Code complexity
It is already a hard enough problem to do optimistic replication and
reintegration with whole files. The last thing I need right now is
to add additional complexity so we suddenly have to reason about
situations where we only happen to have parts of a locally modified
file, which might already have been partially reintegration, but
then overwritten on the server by another client and how to commit,
revert or merge these local changes in the global replica(s). As
well as effectivly maintaining the required data structures. The
current RVM limitations are on number of file objects and not
dependent on file size. You can cache 100 zero length files with the
same overhead as far as the client in concerned as 100 files that
are 1GB in size.
- Network performance
It is more efficient to fetch a large file at once compared to
requesting individual blocks. Available network bandwidth keeps
increasing, but latency is bounded by the laws of physics. So the
60ms roundtrip from coast-to-coast will remain. So requesting 1000
individual 4KB blocks will always cost at least 60 seconds, while
fetching a the same 4MB as a single file will become cheaper over
time.
- Local performance
Handling upcalls is quite expensive, there are at least 2 context
switches and possibly some swapping/paging involved to get the
request up to the cache manager and the response back to the
application. Doing this on individual read and write operations
would make the system a lot less responsive.
- Consistency model
It is really easy to explain Coda's consistency model wrt other
clients. You fetch a copy of the file when it is opened, and it is
written back to the servers when it is closed (and it was modified).
Now try to do the same if the client uses block-level caching. The
picture quickly becomes very blurry, and Transarc AFS actually had
(has?) a serious bug that leads to unexpected data loss in this area
if people were assuming that it actually still provides AFS semantics.
Also once a system provides block-level access, people start to
expect the file system provides something close to UNIX semantics,
which is really not a very usable model for any distributed
filesystem.
Jan
suran007 回复于：2005-09-02 14:54:47
我正在做lustre的试验，用的是1.0.4的版本,
我的内核是2.4.20-28.9_lustre.1.0.4smp
我的拓扑是一个OST(n01)（OSS），一个MDS(n03)，一个CLIENT（n04）
我想在已有OST的OSS上再加入一个ost(n02)(oss)，还要保留原来在oss上保存的数据，
当我在MDS上执行lconf --write_conf newconfig.xml让MDS从新读取新的oss的信息，然后继续启动MDS，却出现如下错误：
MDSDEV: n03-mds1 n03-mds1_UUID /dev/sda1 ext3 200000 no
Lustre:2195 (fsfilt_ext3.c:815:fsfilt_ext3_setup () ) Enable PDIROPS
LustreError:2195:(obd_config.c:101:class_attach ()) obd OSC_n03_n01-ost1_n03-mds 1 already attached
LustreError:2197:(obd_config.c:285:class_cleanup ()) Device 1 not setup ! /usr/sbin/lctl (17):error:setup: File exists
这是为什么?我要是用lconf --reformat --node n03 newconfig.xml就可以启动mds，client也可挂载，挂在分区的容量已经变为两个ost容量的合计了，但是这样就无法保存我原来在/mnt/lustre/中的数据了
那位朋友做过lustre的，请帮帮我，我的qq24242546
以下是我的newconfig.sh脚本内容：
#!/bin/sh
#config.sh
#Create nodes
rm -f newconfig.xml
lmc -m newconfig.xml --add net --node n03 --nid n03 --nettype tcp
lmc -m newconfig.xml --add net --node n01 --nid n01 --nettype tcp
lmc -m newconfig.xml --add net --node n02 --nid n02 --nettype tcp
lmc -m newconfig.xml --add net --node client --nid '*' --nettype tcp
#Configure mds
lmc -m newconfig.xml --add mds --node n03 --mds n03-mds1 --fstype ext3 --dev /dev/sda1 --size 200000
#Configure ost
lmc -m newconfig.xml --add lov --lov lov1 --mds n03-mds1 --stripe_sz 1048576 --stripe_cnt 0 --stripe_pattern 0
lmc -m newconfig.xml --add ost --node n01 --lov lov1 --ost n01-ost1 --fstype ext3 --dev /dev/sda1 --size 1000000
lmc -m newconfig.xml --add ost --node n02 --lov lov1 --ost n02-ost1 --fstype ext3 --dev /dev/sda1 --size 1000000
#Configure client
lmc -m newconfig.xml --add mtpt --node client --path /mnt/lustre --mds n03-mds1 --lov lov1
Randome 回复于：2005-09-06 11:22:09
小弟前段时间翻译了GoogleFS的白皮书，不太完全，看到许多弟兄都有兴趣，贴出来与大家一起翻吧。译文比较多，另开了一个页面：
http://bbs.chinaunix.net/forum/viewtopic.php?show_type=&p=4054433#4054433
另向游兄问好，感谢关于FS体系结构方面的指导。
yftty 回复于：2005-09-06 11:56:53
hehe, you did a good job :D
I'll read it line to line :)
Randome 回复于：2005-09-06 21:11:03
Thanks a lot :) ,
these days I'm preparing my graduation thesis, I hate write doc so it's a terrible work for me, God save me! :(
szfrank 回复于：2005-09-12 18:39:12
初步看了googlefs的文档,还是有一些疑问：
1. 怎么使得文件大小为64M.
2.  master 怎么解决主备？
3. 如果解决chunk server的负载均衡问题?
请yftty兄回答，谢谢。
yftty 回复于：2005-09-12 22:24:09
引用：原帖由 "szfrank" 发表：
初步看了googlefs的文档,还是有一些疑问：
1. 怎么使得文件大小为64M.
2.  master 怎么解决主备？
3. 如果解决chunk server的负载均衡问题?
请yftty兄回答，谢谢。
1. 怎么使得文件大小为64M.
It's not that the file size is 64MB, but the chunk size is 64MB. So as many little files can be spliced into one chunk, and big file can be splitted into multiple chunks.
2.  master 怎么解决主备？
"If its machine or disk fails, monitoring infrastructure outside GFS starts a new master process elsewhere with the replicated operation log."
3. 如果解决chunk server的负载均衡问题?
"4.3 Creation, Re-replication, Rebalancing
Chunk replicas are created for three reasons: chunk creation, re-replication, and rebalancing.
...
Finally, the master rebalances replicas periodically: it examines the current replica distribution and moves replicas for better disk space and load balancing. ..."
yftty 回复于：2005-09-14 15:57:22
An anonymous reader writes "KernelTrap has an interesting interview with Hans Reiser, the author of two revolutionary Linux filesystems, Reiser3 and Reiser4. Reiser3 was the first journaling Linux filesystem. Reiser4 is a complete rewrite that is claimed to offer amazing performance and a new plugin architecture offering semantic enhancements to rival Microsoft's WinFS and Apple's Spotlight. Comparing Reiser4 to WinFS, Reiser says in the interview, "Reiser4 is a much more mature design, representing a 10 year effort"."
http://kerneltrap.org/node/5654?comments_per_page=1
http://hardware.slashdot.org/article.pl?sid=05/09/13/1455238
suran007 回复于：2005-09-15 12:56:29
以下是我写的双gnbdserver的gfs集群方案，请大家帮忙看看有没有什么问题，拓扑图无法贴上来，就是2个gnbd server连到阵列柜上，然后上面是3个gfs客户端
一、分区
一共分4个分区
Sdb1(主quorum分区)
Sdb2（从quorum分区）
Sdb3（cca分区）
Sdb4（gfs分区）
二、在两台gnbd server导出块设备
在data上
Gnbd_export –e cca –d /dev/sdb3 –c
Gnbd_export –e gfs –d /dev/sdb4 –c
在data2上
Gnbd_export –e cca2 –d /dev/sdb3 –c
Gnbd_export –e gfs2 –d /dev/sdb4 –c
三、各节点将块设备导入
Gnbd_import –i data
Gnbd_import –i data2
四、在node1上为4个块设备建立各自的池配置文件并且建池
Pool_tool –c alpha_cca.cf alpha_cca2.cf pool_gfs.cf pool_gfs2.cf
五、在各个节点激活池
Pool_assemble –a
六、建立集群配置文件（cluster.ccs,fence.ccs,nodes.ccs）
由于是做两个gnbd server，所以配置文件要相应更改
Cluster.ccs文件：
cluster {
name = "alpha"
lock_gulm {
servers = ["node01", "node02", "node03"]
}
}
Fence.ccs文件：
fence_devices {
gnbd {
agent = "fence_gnbd"
server = "data"
server = “data2”
}
}
Nodes文件：
nodes {
node01 {
ip_interfaces {
eth0 = "192.168.70.1"
}
fence {
server {
gnbd {
ipaddr = "192.168.70.200"
}
gnbd {
ipaddr = "192.168.70.300"
}
}
}
}
node02 {
ip_interfaces {
eth0 = "192.168.70.2"
}
fence {
server {
gnbd {
ipaddr = "192.168.70.200"
}
gnbd {
ipaddr = "192.168.70.300"
}
}
}
}
node03 {
ip_interfaces {
eth0 = "192.168.70.3"
}
fence {
server {
gnbd {
ipaddr = "192.168.70.200"
}
gnbd {
ipaddr = "192.168.70.300"
}
}
}
}
}
七、在cca设备上建立cca 记录
Ccs_tool create /root/alpha /dev/pool/alpha_cca01
Ccs_tool create /root/alpha /dev/pool/alpha_cca02
八、在每个节点启动集群系统进程
Ccsd –d /dev/pool/alpha_cca01
九、在每个节点启动lock_gulm服务
Lock_gulm
十、在node1上格式化gfs文件系统
gfs_mkfs -p lock_gulm -t alpha:gfs01 -j 3 /dev/pool/pool_gfs01
gfs_mkfs -p lock_gulm -t alpha:gfs02 -j 3 /dev/pool/pool_gfs02
十一、在每个节点挂载文件系统
mount -t gfs /dev/pool/pool_gfs01 /gfs01
mount -t gfs /dev/pool/pool_gfs02 /gfs02
suran007 回复于：2005-09-15 14:58:36
方法１、假如我现在就一个ｇｆｓ池，然后把三个网站的目录放再这个池中，就是三个网站目录，
方法２、我建立三个ｇｆｓ池，然后我把三个网站目录分别放到三个池中，一个池一个网站目录
方法１和方法２在性能方面有区别么
suran007 回复于：2005-09-22 09:45:54
麻烦问一下yftty朋友：
lustre集群中，我们的两台数据服务器，可不可以作为两个ost，然后后面接一个磁盘柜，两个ost可不可以共用一个磁盘柜？
yftty 回复于：2005-09-22 10:51:01
引用：原帖由 "suran007" 发表：
麻烦问一下yftty朋友：
lustre集群中，我们的两台数据服务器，可不可以作为两个ost，然后后面接一个磁盘柜，两个ost可不可以共用一个磁盘柜？
可以, 相当于一台机器上export两个NFS的share dir
[ 本帖最后由 yftty 于 2005-12-13 17:20 编辑 ]
zyangj 回复于：2005-10-13 18:55:31
各位，我这里有个问题啊：
我有三个机子，都有300G的空间，我想把这三个机子的所有除操作系统所需的空间之外的剩余空间，都融合起来，成一个整体的存储，这三个机器都可以往上面写数据，并且这上面数据三个机器都可以访问，这是集群之后，也就是GFS的效果吧？
yftty 回复于：2005-10-13 20:03:27
引用：原帖由 "zyangj" 发表：
各位，我这里有个问题啊：
我有三个机子，都有300G的空间，我想把这三个机子的所有除操作系统所需的空间之外的剩余空间，都融合起来，成一个整体的存储，这三个机器都可以往上面写数据，并且这上面数据三个机器都可..........
http://bbs.chinaunix.net/forum/viewtopic.php?t=544517&show_type=&postdays=0&postorder=asc&start=160
引用：原帖由 "yftty" 发表：
理论上可行，但会导致可用性的下降，如下面邮件所述：
https://www.redhat.com/archives/linux-cluster/2005-August/msg00270.html
RH GFS的性能也有待优化:
https://www.redhat.com/archives/linux-cluster/2005-September/msg00250.html
同时这里面还得考虑到数据的备份和容灾问题.
xichen 回复于：2005-10-13 21:58:52
跑题都跑飞了.
恩,现在试图做FS的好象还有jer什么的FS,还有据说新浪在开发Mail FS,估计已经有成型的东西了.
试图做FS是个好事情.关注中.
zyangj 回复于：2005-10-14 11:07:19
现在这个非常DIY的集群无论是再大的公司或是再小的公司都会用到，因为他有一个最大的好处，就是可以利用目前及以前所有的设备再做这个事，不会造成浪费，投入远远低于同等计算、存储规模的设备，所以无论你的公司或个人处理什么环境，只要有技术，就有可能利用这个东西，所以本人认为这是技术主导型。
目前这方面的资料确实太少，而且有的几乎是E文的，所以中文的资料迫切需要E文高手，不然中国的企业又要被哪些国外的设备商赚多少钱走。。。。
但是这个还是要靠决策层的认识。。。。。
yftty 回复于：2005-10-14 12:12:57
引用：原帖由 "zyangj"]现在这个非常DIY的集群无论是再大的公司或是再小的公司都会用到，因为他有一个最大的好处，就是可以利用目前及以前所有的设备再做这个事，不会造成浪费，投入远远低于同等计算、存储规模的设备，所以无论你的公司或?..........
发表：
RH GFS在设计上是使用FC的SAN环境的, 在现有的IP的JBOD环境下的应用的可靠性还有待提高,当然这方面的工作一直在作.
我觉得计策层对风险比对成本更关心,再说许多也不是花的自己的钱.
yftty 回复于：2005-10-14 13:20:38
http://bbs.chinaunix.net/forum/viewtopic.php?show_type=&p=4223184#4223184
coolzhh 回复于：2005-10-20 00:42:46
老兄,可以另开一个地方开始项目吧;)
这里面都几十页,不太方便看了
yftty 回复于：2005-10-20 08:58:12
呵呵, 好的, 可以建立个网站了
xichen 回复于：2005-10-20 10:10:02
建议你从
http://wiki.woodpecker.org.cn/moin/Compass/CompassSysfile
找点思路。
恩。
yftty 回复于：2005-10-20 11:10:03
好, 谢谢 ! 学习ing中...
yftty 回复于：2005-10-21 00:48:04
http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf_book.php
is a good start,
http://www.beowulf.org is another good place, it is also the home of the
original beowulf mailinglist.
Generally I would recommend digging through recent mailinglist postings
because
there are often very informed answers to questions.
Lon just answered a fencing question a few days ago:
"STONITH, STOMITH, etc. are indeed implementations of I/O fencing.
Fencing is the act of forcefully preventing a node from being able to
access resources after that node has been evicted from the cluster in an
attempt to avoid corruption.
The canonical example of when it is needed is the live-hang scenario, as
you described:
1. node A hangs with I/Os pending to a shared file system
2. node B and node C decide that node A is dead and recover resources
allocated on node A (including the shared file system)
3. node A resumes normal operation
4. node A completes I/Os to shared file system
At this point, the shared file system is probably corrupt.  If you're
lucky, fsck will fix it -- if you're not, you'll need to restore from
backup.  I/O fencing (STONITH, or whatever we want to call it) prevents
the last step (step 4) from happening.
How fencing is done (power cycling via external switch, SCSI
reservations, FC zoning, integrated methods like IPMI, iLO, manual
intervention, etc.) is unimportant - so long as whatever method is used
can guarantee that step 4 can not complete."
"GFS can use fabric-level fencing - that is, you can tell the iSCSI
server to cut a node off, or ask the fiber-channel switch to disable a
port.  This is in addition to "power-cycle" fencing."
Michael
--
Linux-cluster mailing list
yftty 回复于：2005-10-21 21:22:40
Google 搜索技术交流会报名
http://bbs.chinaunix.net/forum/viewtopic.php?t=631871
请提出对此次交流内容的意见和建议, 或者需要加入的内容, 从而令我们能够更多的从技术层面了解这个互联网乃至计算机行业的"跑路者".
注: 本此交流会为民间行为, 内容也是网上公开的信息, 如和某组织,个人相关事物有冲突, 请立即联系, 笔者会作相应修改.
yftty 回复于：2005-11-08 10:30:10
以后的交流平台要转移到 www.googlefs.com 上去了，请先看看这里 http://bbs.chinaunix.net/viewthread.php?tid=643771&extra=page%3D1
yftty 回复于：2005-12-05 21:21:33
http://channel9.msdn.com/Showpost.aspx?postid=142120
A journaling file system will ensure that individual metadata changes are atomic (such as creating directory entries, renaming files, etc.).  It only applies to one change.  If your system crashes, the file structure itself will be correct, but the data in your files might not be.
A logging file system (terminology is sometimes different), both the metadata and the actual data are journaled.  This allows it to make sure the contents of the file are correct, but it still only works on one operation.
A transactional file system (AFAICT for Vista) will do both data and metadata over many modifications.  So, you can create and modify many files atomically (either everything is done or nothing is done).
It's nice to see Microsoft moving us out of the stone age.  Trying to write safe file-system modification routines on most file systems is tricky (using atomic rename tricks and such) and requires a lot of "check for cleanup" code during startup if you want to handle it properly.  I work on systems that cannot fail (embedded systems where nobody can administrate them), and would really love something like this.
veiven26 回复于：2006-02-08 10:53:28
先支持一下，只是不知道要参与这个项目的话需要先学会哪些知识？
yftty 回复于：2006-02-08 23:44:23
引用：原帖由 veiven26 于 2006-2-8 10:53 发表
先支持一下，只是不知道要参与这个项目的话需要先学会哪些知识？
请看看 MogileFS (http://www.danga.com/mogilefs/) and FUSE (http://fuse.sourceforge.net/) 吧, socket, RPC, multi-thread programing, etc.
veiven26 回复于：2006-02-14 23:13:12
看到了上面的网站
[ 本帖最后由 veiven26 于 2006-2-15 16:51 编辑 ]
shellcode 回复于：2006-04-03 11:05:29
[2006/03/21]
我再想这样的一种思路，假如给我100台普通的服务器，配置为2CPU、2G内存、2块143G
硬盘，这些服务器作为普通的使用，可能100台就是100台去用。假如从另外一个角度，
从他能提供的资源的角度去看：
1。存储资源：200块143G硬盘，容量为 28T
2。内存资源：2T的资源
3。CPU资源：200颗CPU
从我们目前的服务器使用来看，很多的服务器资源使用都是没有充分使用的，有的CPU
大量空闲，有的内存大量空闲，有的存储资源大量空闲，为了让投资达到最大化的利用，
首先我们要实现分布式的架构设计，就是说，通过我们自己的一些开发工作，把100台
服务器的硬盘当成一个去用，把200G的内存当成一个整体去用，把200颗CPU当成一个
整体去用，这样，我们可以把这100台服务器看作一组专用的服务器，用他来提供3类
服务，使各类资源都能充分发挥作用，同时也降低了整体投资，提高了资源的利用率
（复用率）
我猜想google是不是就是这么干得，要不然他们用大量的服务器实现了存储，那么
服务器上的CPU和内存是全部或者部分空闲的嘛？现在amazon又能提供这样的海量
存储服务了，那么这些存储会不会是他们对空闲的服务器资源的发掘利用呢？
[2006/03/30]
今天又看了几个Google的技术论文，对上次关于Google把大量服务器即用于存储
也用于计算的可能性又有了进一步的肯定。他们的确是越来越多使用了分布式
计算了，尤其是BigTable的设计对我的启发很大，基于他们的那些技术，可以
实现超级强大的运算量，超级大的存储系统，而看不到任何的性能瓶颈，唯一
存在瓶颈的，只是电力问题了。
目前从我们看到的 Google 披露的技术资料中，可以看到下面的一些技术实现：
BigTable:  基于内存的分布式数据库，它基于GoogleFS和下面的 2 个技术
Cluster Scheduling Master:
实现集群节点的监控，错误处理，在 BigTable 的介绍中出现过这个系统。
Lock service:
在 BigTable 的介绍中出现的系统，实现 BigTable 中 metadata 的管理
Sawzall:   用于数据分布式处理的解释型语言，它基于下面的几种技术：
http://labs.google.com/papers/sawzall.html
Protocol Buffers:
类似XML，比其更加复杂一些，可以实现二机制数据格式的描述和组织。
WorkQueue: 分布式的任务调度队列，它调度计算任务、分配资源、报告状态和收集
状态，它还能把计算就近分配再存储数据的节点上，避免了数据传输。
MapReduce: 它是基于WorkQueue的应用程序调用的一个库，它实现3个功能：
实现并行数据处理的模型，
实现应用程序和底层分布式处理机制的隔离，比如数据分布、调度、容错等，
尽可能将计算指令分配在那些保存GFS数据的节点上，以减少网络的负载。
http://labs.google.com/papers/mapreduce.html
GoogleFS:  Google 的分布式文件系统，数据分布在上千台普通的PC服务器上，每台
服务器上的数据以64M的数据块为存储单位，每份数据至少在3个机器上
存在副本，并且大量文本型的数据压缩保存。
http://labs.google.com/papers/gfs.html
http://labs.google.com/papers/googlecluster.html
基于以上这些技术，Google 自然是可以把每台服务器的资源充分挖掘出来，简单的看，
假设还是100台服务器，他们可以把全部硬盘当成一个去用(GoogleFS)，把200G的内存
当成一个整体去用(BigTable)，把200颗CPU当成一个整体去用(MapReduce)，
由 WorkQueue 调度任务，Sawzall 编写分布式的计算程序来调用以上资源。
这么看上去，整个集群俨然已经是个超级计算机了。
所以他们可以采购大量廉价PC，即用于存储也用于计算。并且可以使用成本很低的
PC，具体的价格可以自己去算，总之我们可以做到用这样的廉价PC搭建出每T成本
和硬件的NAS存储一样的存储系统。不同的是，这种存储系统同时还有强大的计算
能力，和大量的内存可以使用，但是却更加消耗电力和IDC资源。
我担心的就是我们可以用技术把存储、运算的成本降的很低，但是IDC的使用成本却会
随着机器增加而成倍增加。
Google为什么逐渐发展成这样一个架构呢？是一开始就这么设计呢？还是逐渐发展的？
他们的这种模式适合我们嘛？
我也不知道答案，如果搞清除下面几个问题也许能得到答案：
1。因为中国IDC资源非常稀奇，所以租用IDC的成本未必比采购服务器便宜，而且
大量的服务器要消耗大量的电
2。我们有那么大量的计算需求吗？可能真的就是在搜索服务中有大量的计算处理，
其他的应用处理量并不是很大
3。如果我们用大量的廉价PC构造一个服务集群，存储也许可以充分利用，CPU和
内存能够充分利用吗？也许内存可以利用，但是CPU未必都能利用
4。开发相关的软件和应用需要多大代价？
5。如果我用大量的廉价PC实现分布式存储，同时作为Web前端使用可行吗？
我觉得这个也许可行。
6。单纯为了存储的需求去设计这样的集群似乎并不划算，但是如果加上它带来的计算
能力和内存资源，似乎值得。
7。还有其他。。。。。
Google 那种模式真的成本低吗？假如他们能充分利用所有的资源，要那么多CPU他们
有那么大的运算量吗？
原来看Google的GFS的文档和搜索集群文档时，还简单认为他们的集群主要靠大量
服务器负载均衡实现，但是现在看来他们更多的是在使用分布式处理和并行处理。
我想Google的业务类型决定了他们的分布式运算的模型并不复杂，因为大多都是
文本的处理。
参考资料：
http://homepages.inf.ed.ac.uk/mic/Skeletons/
http://www.cnblogs.com/tianchunfeng/archive/2005/03/17/120722.html
http://wikipedia.cnblog.org/wiki/Google
http://wikipedia.cnblog.org/wiki/Google_File_System
http://wikipedia.cnblog.org/wiki/MapReduce
http://labs.google.com/papers/mapreduce.html
http://labs.google.com/papers/sawzall.html

[精华] Unix下针对邮件,搜索,网络硬盘,照片，播客等海量存储的分布式文件系统项目 SCO UNIX下查看硬盘使用情况的命令网络技术的基础——网络硬盘存储 Ext2 文件系统的硬盘布局 Unix文件系统的Blocksize究竟多大 Linux下NFS(网络文件系统)的建立与配置方法海量数据搜索算法优化-存储\查询\排序算法最新免费网络存储硬盘资源硬盘之家-家用网络存储解决方案 UNIX的网络功能 FreeBSD下的内存文件系统 FreeBSD下的内存文件系统网银安全要注意不要将你的账号密码存储在网络硬盘中网银安全要注意不要将你的账号密码存储在网络硬盘中用QQ网络硬盘将(上传的)歌曲存储拿来外链做空间音乐 Nutch搜索引擎之分布式文件系统我的网络存储最新免费网络存储硬盘资源大看台【IT时代】最新免费网络存储硬盘资源大看台最新免费网络存储硬盘资源大看台硬盘 FAT 文件系统原理Ｐ２Ｐ海量存储的技术实现问题和市场预测 UNIX 下的端口介绍 Internet Archive 的海量存储浅析-存储,IA,草根网IT资讯精读(20ju...