hadoop學習心得

第一篇：hadoop學習心得

1.FileInputFormat splits only large files.Here “l(fā)arge” means larger than an HDFS block.The split size is normally the size of an HDFS block, which is appropriate for most applications;however,it is possible to control this value by setting various Hadoop properties.2.So the split size is blockSize.3.Making the minimum split size greater than the block size increases the split size, but at the cost of locality.4.One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file.If the file is very small(“small” means significantly smaller than an HDFS block)and there are a lot of them, then each map task will process very little input, and there will be a lot of them(one per file), each of which imposes extra bookkeeping overhead.hadoop處理大量小數(shù)據(jù)文件效果不好:

hadoop對數(shù)據(jù)的處理是分塊處理的，默認是64M分為一個數(shù)據(jù)塊，如果存在大量小數(shù)據(jù)文件（例如：2-3M一個的文件）這樣的小數(shù)據(jù)文件遠遠不到一個數(shù)據(jù)塊的大小就要按一個數(shù)據(jù)塊來進行處理。

這樣處理帶來的后果由兩個：1.存儲大量小文件占據(jù)存儲空間，致使存儲效率不高檢索速度也比大文件慢。

2.在進行MapReduce運算的時候這樣的小文件消費計算能力，默認是按塊來分配Map任務的（這個應該是使用小文件的主要缺點）

那么如何解決這個問題呢？

1.使用Hadoop提供的Har文件，Hadoop命令手冊中有可以對小文件進行歸檔。2.自己對數(shù)據(jù)進行處理，把若干小文件存儲成超過64M的大文件。

FileInputFormat is the base class for all implementations of InputFormat that use files as their data source(see Figure 7-2).It provides two things: a place to define which files are included as the input to a job, and an implementation for generating splits for the input files.The job of dividing splits into records is performed by subclasses.An InputSplit has a length in bytes, and a set of storage locations, which are just hostname strings.Notice that a split doesn’t contain the input data;it is just a reference to the data.As a MapReduce application writer, you don’t need to deal with InputSplits directly, as they are created by an InputFormat.An InputFormat is responsible for creating the input splits, and dividing them into records.Before we see some concrete examples of InputFormat, let’s briefly examine how it is used in MapReduce.Here’s the interface:

public interface InputFormat { InputSplit[] getSplits(JobConf job, int numSplits)throws IOException;RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter)throws IOException;}

The JobClient calls the getSplits()method.On a tasktracker, the map task passes the split to the getRecordReader()method on InputFormat to obtain a RecordReader for that split.A related requirement that sometimes crops up is for mappers to have access to the full contents of a file.Not splitting the file gets you part of the way there, but you also need to have a RecordReader that delivers the file contents as the value of the record.One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file.If the file is very small(“small” means significantly smaller than an HDFS block)and there are a lot of them, then each map task will process very little input, and there will be a lot of them(one per file), each of which imposes extra bookkeeping overhead.Example 7-2.An InputFormat for reading a whole file as a record public class WholeFileInputFormat extends FileInputFormat { @Override protected boolean isSplitable(FileSystem fs, Path filename){ return false;} @Override public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter)throws IOException { return new WholeFileRecordReader((FileSplit)split, job);} } We implement getRecordReader()to return a custom implementation of RecordReader.Example 7-3.The RecordReader used by WholeFileInputFormat for reading a whole file as a record class WholeFileRecordReader implements RecordReader { private FileSplit fileSplit;private Configuration conf;private boolean processed = false;public WholeFileRecordReader(FileSplit fileSplit, Configuration conf)throws IOException { this.fileSplit = fileSplit;this.conf = conf;} @Override public NullWritable createKey(){ return NullWritable.get();} @Override public BytesWritable createValue(){ return new BytesWritable();} @Override public long getPos()throws IOException { return processed ? fileSplit.getLength(): 0;} @Override public float getProgress()throws IOException { return processed ? 1.0f : 0.0f;} @Override public boolean next(NullWritable key, BytesWritable value)throws IOException { if(!processed){ byte[] contents = new byte[(int)fileSplit.getLength()];Path file = fileSplit.getPath();FileSystem fs = file.getFileSystem(conf);FSDataInputStream in = null;try { in = fs.open(file);IOUtils.readFully(in, contents, 0, contents.length);value.set(contents, 0, contents.length);} finally { IOUtils.closeStream(in);} processed = true;return true;} return false;} @Override public void close()throws IOException { // do nothing } }

Input splits are represented by the Java interface, InputSplit(which, like all of the classes mentioned in this section, is in the org.apache.hadoop.mapred package?): public interface InputSplit extends Writable { long getLength()throws IOException;String[] getLocations()throws IOException;}

An InputSplit has a length in bytes, and a set of storage locations, which are just hostname strings.Notice that a split doesn’t contain the input data;it is just a reference to the data.The storage locations are used by the MapReduce system to place map tasks as close to the split’s data as possible, and the size is used to order the splits so that the largest get processed first, in an attempt to minimize the job runtime(this is an instance of a greedy approximation algorithm).As a MapReduce application writer, you don’t need to deal with InputSplits directly, as they are created by an InputFormat.An InputFormat is responsible for creating the input splits, and dividing them into records.Before we see some concrete examples of InputFormat, let’s briefly examine how it is used in MapReduce.Here’s the interface:

public interface InputFormat { InputSplit[] getSplits(JobConf job, int numSplits)throws IOException;RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter)throws IOException;}

Having calculated the splits, the client sends them to the jobtracker, which uses their storage locations to schedule map tasks to process them on the tasktrackers.A path may represent a file, a directory, or, by using a glob, a collection of files and directories.A path representing a directory includes all the files in the directory as input to the job.See “File patterns” on page 60 for more on using globs.It is a common requirement to process sets of files in a single operation.For example, a MapReduce job for log processing might analyze a month worth of files, contained in a number of directories.Rather than having to enumerate each file and directory to specify the input, it is convenient to use wildcard characters to match multiple files with a single expression, an operation that is known as globbing.Hadoop provides two FileSystem methods for processing globs: public FileStatus[] globStatus(Path pathPattern)throws IOException public FileStatus[] globStatus(Path pathPattern, PathFilter filter)throws IOException

第二篇：Hadoop之JobTrack分析

Hadoop之JobTrack分析

1.client端指定Job的各種參數(shù)配置之后調(diào)用job.waitForCompletion(true)方法提交Job給JobTracker，等待Job 完成。

[java] view plaincopyprint?

1.public void submit()throws IOException, InterruptedException, 2.ClassNotFoundException { 3.ensureState(JobState.DEFINE);//檢查JobState狀態(tài)

4.setUseNewAPI();//檢查及設置是否使用新的MapReduce API

5.6.// Connect to the JobTracker and submit the job

7.connect();//鏈接JobTracker

8.info = jobClient.submitJobInternal(conf);//將job信息提交

9.super.setJobID(info.getID());

10.state = JobState.RUNNING;//更改job狀態(tài)

11.}

以上代碼主要有兩步驟,連接JobTracker并提交Job信息。connect方法主要是實例化JobClient對象，包括設置JobConf和init工作：

[java] view plaincopyprint?

1.public void init(JobConf conf)throws IOException {

2.String tracker = conf.get(“mapred.job.tracker”, “l(fā)ocal”);//讀取配置文件信息用于判斷該Job是運行于本地單機模式還是分布式模式

3.tasklogtimeout = conf.getInt（4.TASKLOG_PULL_TIMEOUT_KEY, DEFAULT_TASKLOG_TIMEOUT);5.this.ugi = UserGroupInformation.getCurrentUser();

6.if(“l(fā)ocal”.equals(tracker)){//如果是單機模式，new LocalJobRunner

7.conf.setNumMapTasks(1);

8.this.jobSubmitClient = new LocalJobRunner(conf);9.} else {

10.this.jobSubmitClient = createRPCProxy(JobTracker.getAddress(conf), conf);

11.} 12.}

分布式模式下就會創(chuàng)建一個RPC代理鏈接：

[java] view plaincopyprint?

1.public static VersionedProtocol getProxy(2.Class protocol，3.long clientVersion, InetSocketAddress addr, UserGroupInformation ticket，4.Configuration conf, SocketFactory factory, int rpcTimeout)throws IOException { 5.6.if(UserGroupInformation.isSecurityEnabled()){ 7.SaslRpcServer.init(conf);8.}

9.VersionedProtocol proxy =

10.(VersionedProtocol)Proxy.newProxyInstance（11.protocol.getClassLoader(), new Class[] { protocol }，12.new Invoker(protocol, addr, ticket, conf, factory, rpcTimeout));

13.long serverVersion = proxy.getProtocolVersion(protocol.getName(), 14.clientVersion);15.if(serverVersion == clientVersion){ 16.return proxy;17.} else {

18.throw new VersionMismatch(protocol.getName(), clientVersion, 19.serverVersion);20.} 21.}

從上述代碼可以看出hadoop實際上使用了Java自帶的Proxy API來實現(xiàn)Remote Procedure Call 初始完之后，需要提交job [java] view plaincopyprint?

1.info = jobClient.submitJobInternal(conf);//將job信息提交

submit方法做以下幾件事情：

1.將conf中目錄名字替換成hdfs代理的名字

2.檢查output是否合法：比如路徑是否已經(jīng)存在，是否是明確的3.將數(shù)據(jù)分成多個split并放到hdfs上面，寫入job.xml文件

4.調(diào)用JobTracker的submitJob方法

該方法主要新建JobInProgress對象，然后檢查訪問權(quán)限和系統(tǒng)參數(shù)是否滿足job，最后addJob：

[java] view plaincopyprint?

1.private synchronized JobStatus addJob(JobID jobId, JobInProgress job)2.throws IOException { 3.totalSubmissions++;4.5.synchronized(jobs){

6.synchronized(taskScheduler){

7.jobs.put(job.getProfile().getJobID(), job);

8.for(JobInProgressListener listener : jobInProgressListeners){ 9.listener.jobAdded(job);10.} 11.} 12.}

13.myInstrumentation.submitJob(job.getJobConf(), jobId);14.job.getQueueMetrics().submitJob(job.getJobConf(), jobId);15.16.LOG.info(“Job ” + jobId + “ added successfully for user '”

17.+ job.getJobConf().getUser()+ “' to queue '”

18.+ job.getJobConf().getQueueName()+ “'”);19.AuditLogger.logSuccess(job.getUser()，20.Operation.SUBMIT_JOB.name(), jobId.toString());21.return job.getStatus();22.}

totalSubmissions記錄client端提交job到JobTracker的次數(shù)。而jobs則是JobTracker所有可以管理的job的映射表

Map jobs = Collections.synchronizedMap(new TreeMap());taskScheduler是用于調(diào)度job先后執(zhí)行策略的，其類圖如下所示：

hadoop job調(diào)度機制； public enum SchedulingMode { FAIR, FIFO } 1.公平調(diào)度FairScheduler 對于每個用戶而言，分布式資源是公平分配的，每個用戶都有一個job池，假若某個用戶目前所占有的資源很多，對于其他用戶而言是不公平的，那么調(diào)度器就會殺掉占有資源多的用戶的一些task，釋放資源供他人使用 2.容量調(diào)度JobQueueTaskScheduler 在分布式系統(tǒng)上維護多個隊列，每個隊列都有一定的容量，每個隊列中的job按照FIFO的策略進行調(diào)度。隊列中可以包含隊列。

兩個Scheduler都要實現(xiàn)TaskScheduler的public synchronized List assignTasks(TaskTracker tracker)方法，該方法通過具體的計算生成可以分配的task

接下來看看JobTracker的工作：記錄更新JobTracker重試的次數(shù)：

[java] view plaincopyprint?

1.while(true){ 2.try {

3.recoveryManager.updateRestartCount();4.break;

5.} catch(IOException ioe){

6.LOG.warn(“Failed to initialize recovery manager.”, ioe);7.// wait for some time

8.Thread.sleep(FS_ACCESS_RETRY_PERIOD);9.LOG.warn(“Retrying...”);10.} 11.}

啟動Job調(diào)度器,默認是FairScheduler: taskScheduler.start();主要是初始化一些管理對象，比如job pool管理池

[java] view plaincopyprint?

1.// Initialize other pieces of the scheduler

2.jobInitializer = new JobInitializer(conf, taskTrackerManager);3.taskTrackerManager.addJobInProgressListener(jobListener);4.poolMgr = new PoolManager(this);5.poolMgr.initialize();

6.loadMgr =(LoadManager)ReflectionUtils.newInstance(7.conf.getClass(“mapred.fairscheduler.loadmanager”, 8.CapBasedLoadManager.class, LoadManager.class), conf);9.loadMgr.setTaskTrackerManager(taskTrackerManager);10.loadMgr.setEventLog(eventLog);11.loadMgr.start();

12.taskSelector =(TaskSelector)ReflectionUtils.newInstance(13.conf.getClass(“mapred.fairscheduler.taskselector”, 14.DefaultTaskSelector.class, TaskSelector.class), conf);15.taskSelector.setTaskTrackerManager(taskTrackerManager);16.taskSelector.start();

[java] view plaincopyprint?

1.JobInitializer有一個確定大小的ExecutorService threadPool，每個thread用于初始化job

[java] view plaincopyprint?

1.try {

2.JobStatus prevStatus =(JobStatus)job.getStatus().clone();3.LOG.info(“Initializing ” + job.getJobID());4.job.initTasks();

5.// Inform the listeners if the job state has changed 6.// Note : that the job will be in PREP state.7.JobStatus newStatus =(JobStatus)job.getStatus().clone();8.if(prevStatus.getRunState()!= newStatus.getRunState()){ 9.JobStatusChangeEvent event =

10.new JobStatusChangeEvent(job, EventType.RUN_STATE_CHANGED, prevStatus，11.newStatus);

12.synchronized(JobTracker.this){ 13.updateJobInProgressListeners(event);14.} 15.} 16.}

初始化操作主要用于初始化生成tasks然后通知其他的監(jiān)聽者執(zhí)行其他操作。initTasks主要處理以下工作：

[java] view plaincopyprint?

1.// 記錄用戶提交的運行的job信息

2.try {

3.userUGI.doAs(new PrivilegedExceptionAction

欧美色欧美亚洲高清在线观看,国产特黄特色a级在线视频,国产一区视频一区欧美,亚洲成a 人在线观看中文

第一篇：hadoop學習心得

第二篇：Hadoop之JobTrack分析

第三篇：Hadoop常見錯誤總結(jié)

第四篇：Hadoop運維工程師崗位職責簡潔版

第五篇：在三臺虛擬機上部署多節(jié)點Hadoop

相關范文推薦

Hadoop的頂級匯報、分析、可視化、集成和開發(fā)工具

Hadoop之父與英特爾研究院院長分享大數(shù)據(jù)心得

基于Hadoop的云教學資源平臺設計與實現(xiàn)

【八斗學院】2018年最新Hadoop大數(shù)據(jù)開發(fā)學習路線圖5篇

【八斗學院】2018年最新Hadoop大數(shù)據(jù)簡歷,Hadoop工程師簡歷[5篇范例]

cent_OS_下hadoop完全分布式安裝-hadoop2.6.1版-親測自己總結(jié)

大數(shù)據(jù)培訓零基礎教學 Hadoop模式與搭建的相關問題（小編整理）

學習心得