從2010.03寫到現在,我只是想寫 -- 把我對社會、人文、科技、產業、教育的觀察和感想寫出來。每次寫出當下所思所想,似乎腦袋可以清淨一點、心靈可以輕爽些。文章大多先在臉書上與臉友分享,隨後再轉到這裡。臉書網址為:https://www.facebook.com/shihhaohung
2010年5月25日 星期二
Latest SSL acceleration paper
Mohamed Khalil-Hani, Vishnu P. Nambiar, M. N. Marsono, "Hardware
Acceleration of OpenSSL Cryptographic Functions for High-Performance
Internet Security," isms, pp.374-379, 2010 International Conference on
Intelligent Systems, Modelling and Simulation, 2010
2010年5月18日 星期二
Clouds and MapReduce for Scientific Applications
雲端運算是否是用於科學工程計算?大哉問!沒有人能斷言是或不是。Indiana University這篇報告列出了一些要點,是不錯的參考資料。
Cloud computing is at the peak of the Gartner technology hype curve[2] but there are good reasons to believe that as it matures that it will not disappear into their trough of disillusionment but rather move into the plateau of productivity as have for example service oriented architectures. Clouds are driven by large commercial markets where IDC estimates that clouds will represent 14% of IT expenditure in 2012 and there is rapidly growing interest from government and industry. There are several reasons why clouds should be important for large scale scientific computing...
http://grids.ucs.indiana.edu/ptliupages/publications/CloudsandMR.pdf
CMU's Parallel Data Lab
CMU的Parallel Data Lab以Storage Systems聞名,在cloud computing的時代,對於cloud storage也有許多不錯的研究。尤其是performance tools,是非常值得我們參考的。
注意到許多papers全部出現在conferences或是技術報告(technical reports),或有重複的部份,因為利用conference的機會到處宣揚,不足為奇。CMU強調實作的精神,也是這個實驗室的一大特色。
注意到許多papers全部出現在conferences或是技術報告(technical reports),或有重複的部份,因為利用conference的機會到處宣揚,不足為奇。CMU強調實作的精神,也是這個實驗室的一大特色。
http://www.pdl.cmu.edu/index.shtml
Latest development:
Selected papers for performance tools:
Latest development:
- Open Cirrus: A Global Cloud Computing Testbed. Arutyun I. Avetisyan, Roy Campbell, Indranil Gupta, Michael T. Heath, Steven Y. Ko, Gregory R. Ganger, Michael A. Kozuch, David O’Hallaron, Marcel Kunze, Thomas T. Kwan, Kevin Lai, Martha Lyons, Dejan S. Milojicic, Hing Yan Lee, Ng Kwang Ming, Jing-Yuan Luke, Han Namgong, Yeng Chai Soh. IEEE Computer, April 2010.
Selected papers for performance tools:
- Visual, Log-based Causal Tracing for Performance Debugging of MapReduce Systems. Jiaqi Tan*, Soila Kavulya, Rajeev Gandhi and Priya Narasimhan. 30th IEEE International Conference on Distributed Computing Systems (ICDCS) 2010, Genoa, Italy, Jun 2010.
- An Analysis of Traces from a Production MapReduce Cluster. Soila Kavulya, Jiaqi Tan, Rajeev Gandhi and Priya Narasimhan. 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010). May 17-20, 2010, Melbourne, Victoria, Australia. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-09-107, December, 2009.
- Kahuna: Problem Diagnosis for MapReduce-Based Cloud Computing Environments. Jiaqi Tan, Xinghao Pan, Eugene Marinelli, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan. Proceedings of the 12th IEEE/IFIP Network Operations and Management Symposium (NOMS) 2010, Osaka, Japan, Apr 2010.
- DiscFinder: A data-intensive scalable cluster finder for astrophysics. Bin Fu, Kai Ren, Julio López, Eugene Fink, and Garth Gibson. In Proceedings of the ACM International Symposium on High Performance Distributed Computing (HPDC), Chicago, IL. June, 2010. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-104.
- File System Virtual Appliances: Portable File System Implementations. Michael Abd-El-Malek, Matthew Wachs, James Cipar, Karan Sanghi, Gregory R. Ganger, Garth A. Gibson, Michael K. Reiter. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-105, April 2010.
- PLFS: A Checkpoint Filesystem for Parallel Applications. John Bent, Garth Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez, Milo Polte, Meghan Wingate. Supercomputing '09, November 15, 2009. Portland, Oregon.
- Understanding and Maturing the Data-Intensive Scalable Computing Storage Substrate. Garth Gibson, Bin Fan, Swapnil Patil, Milo Polte, Wittawat Tantisiriroj, Lin Xiao. Microsoft Research eScience Workshop 2009, Pittsburgh, PA, October 16-17, 2009.
2010年5月7日 星期五
Michael Franklin (Univ. of Berkeley) Visiting NTU
Michael Franklin (Univ. of Berkeley) came to NTU to give a talk. My students and I presented our work to him. His talk introduces the research work related to cloud computing done in RAD Lab (http://radlab.cs.berkeley.edu/).
Interesting References available: http://radlab.cs.berkeley.edu/publications
Michael focuses on this paper in his talk: PIQL: A Performance Insightful Query Language For Interactive Applications (http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-8.pdf)
Interesting References available: http://radlab.cs.berkeley.edu/publications
Michael focuses on this paper in his talk: PIQL: A Performance Insightful Query Language For Interactive Applications (http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-8.pdf)
HADOOP discussion found on the web
HADOOP discussion
http://cloudcamp.pbworks.com/Hadoop
A room of attendees (est. > 20) attended this discussion on Hadoop. 2 people had extensive discussions: Chris Wensel (Concurrent, inc.) and Adrian Cockcroft (Netflix).
Summary of topics:
1. HDFS
a. Designed for write-once, read many (more appropriate for applications depending on this characteristic)
b. Data is access serially
c. Access has latency
d. Multiple Hadoop instances loose seek time
e. Hadoop is not a transaction system
2. Virtualization
a. Few companies doing Hadoop level virtualization (no data who’s doing this)
3. MAP-REDUCE
a. To use Hadoop, need to program in map-reduce, which might not be easy; may need 2-3 guys to write in map-reduce
b. Also generates lots of data
c. Concurrent offers an abstraction on top of map-reduce so programmers can think in terms of streams, can now use Groovy or Jython; also offer options to deal with node failure (whether to continue or fail immediately); also compatible with PIG
d. One hadoop implementation does not just have one map-reduce, but has 5, 10, 20 such programs to manage
4. Database on Hadoop?
a. PIG
b. Facebook HIVE
b. Business.com CloudBase
c. Endpoint reducer write to file system, People loading HBASE that way
d. HBASE is a big-table implementation in Apache (built by Powerset; Powerset was purchased by Microsoft; Powerset went dark for a while, but now back)
5. Running on Amazon
a. Amazon is opaque: not much control on servers, so rack-awareness feature in Hadoop not useful;
6. Performance
a. Block size at 128MB (Yahoo, Facebook)
b. Rack-awareness policy is fairly static; build tree of data center to rack to …
c. EC2’s biggest problem – store in S3 in small chucks, could impact HDFS block sizes
d. Some people are using HDFS rather than S3 on 1000-node EC2
e. Hadoop usage offers infinite file size, caching is setting attributes, paying additional penalty with preloading
f. Data replication of name node is an issue – higher replication -> increases latency; replication is pipelined
g. HDFS does not work well with lots of small files (name node gets crushed)
h. New Hadoop has a way to bundle a bunch of small files into one file and still have a way to seek into them to find the files
i. It takes a minute to start or stop Hadopo, as there is a high latency to starting JVMs – “If the time to start a JVM is significant for your application, then your problem is not big enough to use Hadoop”
j. Gating factor is bandwidth - gets I/O bound (even with data locality) still need to do the replication
k. Scales linearly as you increase in size
l. Set number of map and reduce tasks higher than number of cores
m. At the map-reduce levels, several schedulers / strategies are available
n. A big issue is run-time monitoring to avoid current situation of idle nodes in a hadoop system (e.g. reduce task could be waiting for map task, but sitting idle)
o. George Porter (Sun Labs) is working on sensing Xtrace to do run-time monitoring to accumulate tracing to optimize (you doing the optimization). Could be most interesting in debugging in hadoop application. Would not know about CPU utilization (that's where Ganglia data come in)
p. SSD could make sense for the MAP phase
q. Currently no data (simulation or otherwise) on where to partition compute vs disk vs I/O
r. Supercomputer is a computer that turns a compute-bound problem to an I/O bound problem
7. Usage of Hadoop
a. Log analysis
b. Powerset search
c. Metaweb uses Hadoop (heavy Python user)
d. Hadoop is pluggable (KFS, …)
e. Using Hypertable
f. Companies running 1K nodes - primarily using it for storage (likely using Amazon small node - cost is low)
g. Google sponsors a 1,065 (approx) node cluster (hosted at IBM) for research purposes; Given Hadoop job, many tenants (university) already… you ask for how many nodes you want, (e.g. Maryland asks for 40 nodes), control number of universities who get access to it. No stability in number of nodes you can get - One day, you could get 200 nodes, another 500 nodes. Problem is how you split up your data, you don't move it around.
8. Hadoop single point of failure
a. Hadoop name node is single node, no clustering; name node holds all metadata
b. Failure could take up to 30 minutes for a system with 100 million files
c. A lot of people want redundant name nodes, but no one is working on it
d. No logging in HDFS
e. Concurrent using THRIFT?, which is type safe, … THRIFT? Is a great archival system
http://cloudcamp.pbworks.com/Hadoop
A room of attendees (est. > 20) attended this discussion on Hadoop. 2 people had extensive discussions: Chris Wensel (Concurrent, inc.) and Adrian Cockcroft (Netflix).
Summary of topics:
1. HDFS
a. Designed for write-once, read many (more appropriate for applications depending on this characteristic)
b. Data is access serially
c. Access has latency
d. Multiple Hadoop instances loose seek time
e. Hadoop is not a transaction system
2. Virtualization
a. Few companies doing Hadoop level virtualization (no data who’s doing this)
3. MAP-REDUCE
a. To use Hadoop, need to program in map-reduce, which might not be easy; may need 2-3 guys to write in map-reduce
b. Also generates lots of data
c. Concurrent offers an abstraction on top of map-reduce so programmers can think in terms of streams, can now use Groovy or Jython; also offer options to deal with node failure (whether to continue or fail immediately); also compatible with PIG
d. One hadoop implementation does not just have one map-reduce, but has 5, 10, 20 such programs to manage
4. Database on Hadoop?
a. PIG
b. Facebook HIVE
b. Business.com CloudBase
c. Endpoint reducer write to file system, People loading HBASE that way
d. HBASE is a big-table implementation in Apache (built by Powerset; Powerset was purchased by Microsoft; Powerset went dark for a while, but now back)
5. Running on Amazon
a. Amazon is opaque: not much control on servers, so rack-awareness feature in Hadoop not useful;
6. Performance
a. Block size at 128MB (Yahoo, Facebook)
b. Rack-awareness policy is fairly static; build tree of data center to rack to …
c. EC2’s biggest problem – store in S3 in small chucks, could impact HDFS block sizes
d. Some people are using HDFS rather than S3 on 1000-node EC2
e. Hadoop usage offers infinite file size, caching is setting attributes, paying additional penalty with preloading
f. Data replication of name node is an issue – higher replication -> increases latency; replication is pipelined
g. HDFS does not work well with lots of small files (name node gets crushed)
h. New Hadoop has a way to bundle a bunch of small files into one file and still have a way to seek into them to find the files
i. It takes a minute to start or stop Hadopo, as there is a high latency to starting JVMs – “If the time to start a JVM is significant for your application, then your problem is not big enough to use Hadoop”
j. Gating factor is bandwidth - gets I/O bound (even with data locality) still need to do the replication
k. Scales linearly as you increase in size
l. Set number of map and reduce tasks higher than number of cores
m. At the map-reduce levels, several schedulers / strategies are available
n. A big issue is run-time monitoring to avoid current situation of idle nodes in a hadoop system (e.g. reduce task could be waiting for map task, but sitting idle)
o. George Porter (Sun Labs) is working on sensing Xtrace to do run-time monitoring to accumulate tracing to optimize (you doing the optimization). Could be most interesting in debugging in hadoop application. Would not know about CPU utilization (that's where Ganglia data come in)
p. SSD could make sense for the MAP phase
q. Currently no data (simulation or otherwise) on where to partition compute vs disk vs I/O
r. Supercomputer is a computer that turns a compute-bound problem to an I/O bound problem
7. Usage of Hadoop
a. Log analysis
b. Powerset search
c. Metaweb uses Hadoop (heavy Python user)
d. Hadoop is pluggable (KFS, …)
e. Using Hypertable
f. Companies running 1K nodes - primarily using it for storage (likely using Amazon small node - cost is low)
g. Google sponsors a 1,065 (approx) node cluster (hosted at IBM) for research purposes; Given Hadoop job, many tenants (university) already… you ask for how many nodes you want, (e.g. Maryland asks for 40 nodes), control number of universities who get access to it. No stability in number of nodes you can get - One day, you could get 200 nodes, another 500 nodes. Problem is how you split up your data, you don't move it around.
8. Hadoop single point of failure
a. Hadoop name node is single node, no clustering; name node holds all metadata
b. Failure could take up to 30 minutes for a system with 100 million files
c. A lot of people want redundant name nodes, but no one is working on it
d. No logging in HDFS
e. Concurrent using THRIFT?, which is type safe, … THRIFT? Is a great archival system
2010年5月2日 星期日
IBM summer intern program
The job is in 南港軟體園區. Let me know if you are interested. I can direct you to my contact person at IBM...
2010 Blue Gene (Internship Program)
Let’s join IBM with global talent in your summer time! IBM provides not only excellent environment for working, but also fantastic training for your future development.
1. Program Period : 1st Jul. – 31st Aug. of 2010 (maximum to 9 months)
2. Documents required to apply
1) Resume.
2) Autobiography
3) Transcript.
3. Selection Process
1) Paper Test : 30th Apr. – 28th May of 2010.
2) Interview : 31st May – 21st Jun. of 2010.
* Job Category: software engineer, headware engineer, supply chain assistant, business consultant.
2010 藍色基因暑期實習計畫
邀請各位一同加入IBM,與全球優秀高手共同度過今年的暑假吧!我們不僅擁有最優良的工作環境,更提供超棒的訓練課程,讓IBM搖滾你的夏天!
1. 活動時間:7月1日 – 8月31日 ﹝最長達九個月﹞
2. 檢附文件: 1) 個人履歷 2) 自傳 3) 成績單
3. 徵選流程
1) 紙筆測驗:4月30日 - 5月28日
2) 面試 :5月31日 - 6月21日
* 職缺類別:軟體工程師、硬體工程師、工程助理、科技顧問。
2010 Blue Gene (Internship Program)
Let’s join IBM with global talent in your summer time! IBM provides not only excellent environment for working, but also fantastic training for your future development.
1. Program Period : 1st Jul. – 31st Aug. of 2010 (maximum to 9 months)
2. Documents required to apply
1) Resume.
2) Autobiography
3) Transcript.
3. Selection Process
1) Paper Test : 30th Apr. – 28th May of 2010.
2) Interview : 31st May – 21st Jun. of 2010.
* Job Category: software engineer, headware engineer, supply chain assistant, business consultant.
2010 藍色基因暑期實習計畫
邀請各位一同加入IBM,與全球優秀高手共同度過今年的暑假吧!我們不僅擁有最優良的工作環境,更提供超棒的訓練課程,讓IBM搖滾你的夏天!
1. 活動時間:7月1日 – 8月31日 ﹝最長達九個月﹞
2. 檢附文件: 1) 個人履歷 2) 自傳 3) 成績單
3. 徵選流程
1) 紙筆測驗:4月30日 - 5月28日
2) 面試 :5月31日 - 6月21日
* 職缺類別:軟體工程師、硬體工程師、工程助理、科技顧問。
訂閱:
意見 (Atom)
 
