Thanks for your visit !!

The article has been moved here: Talend ETL Tutorial : Data Validation




Thanks for your visit !!

The article has been moved to new website : Talend ETL Tutorial : How To Check For Reject Row Option

Thanks for your visit !!

This article has been moved to new website : Talend ETL Tutorial : How To Run Talend Jobs Externally



Thanks for your visit !!

This article has been moved to new website :

Talend ETL Tutorial : How To Run Jobs In Sequential and Parallel Mode







Yesterday, I was going through the unanswered posts of Talendforge forum and one question caught my eye. It was titled as Problem with Joins. First, I thought it would be related to some settings problem but it was not. It was related to one scenario that required the implementation of Right Outer Join. In Talend, we can join using two components: tJoin and tMap. The problem is: tJoin only supports Inner Join whereas tMap supports only Inner Join and Left Outer Join. So, we have no way of direct usage of Right Outer Join. You have to find another way of implementing it. This is my workaround for the mentioned problem, you can think of another approach also.

Talendforgre forum question: Problem with the joins

Requirement 

Click To Enlarge


Final Job Design:


Steps For Processing:

First, I have created 3 delimited files with a comma (,) as a delimiter and placed all these records as per the requirement in the files. Now, to read all these input files, I am using tFileInputDelimited component.My thought is to create two data sets and then merge them to Right Outer Join way. To achieve this, I am joining first two input files Literacy_rate.csv and Graduation_rate.csv. As I don't have to skip any records from here, I am using tJoin component for INNER JOIN.


We want all columns from both input files. To get this, go to Edit Schema and add columns as per your need.You can just simply use -> (arrow symbol) to move columns from one panel to another.We will get Year, Grade_Rate and Literacy_Rate columns from the input files but we also have to add one extra column specifically so that it can work with the tUnite component. tUnite expects same schema structure from its sources. So, this will give me the first dataset. And, it will have all records from both input files.


In the second dataset, we will try to capture the records from Grade_count.csv. In the output, we should have matched records for the year 2011 and 2012 as these years are present in the above-mentioned input files. In the output, we should have the values for the respective columns.Also, we should have an entry for Year 2013. Grade_count.csv data will be our Main records and rest two files records will be lookup and the join mode will be Left Outer Join.



The next step is to merge this dataset with the other dataset. To merge it, I am going to use tUnite component. Also, in this case, and as per our requirement, set the order as mentioned in the below screenshot.



The next step is to remove the duplicate records. We would be getting two sets of records for the year 2011 and 2012 form the first and second dataset and we just have to keep the values from the second dataset as it will have correct values for all the columns. I have defined the key as Year to filter out records. tUnique will take first records and drop the second occurrence. To make sure the tUnique picks the dataset coming from tMap. We have defined the merge order as mentioned in the above screenshot.



I am using tSort row next to just align the data in the ascending order of the years.



Once you execute this job, you will get an output like this:






In the last post, we covered how to write records into multiple sheets of an excel from a single delimited input file. This post is about reading the records from multiple sheets present in an excel file. You can't directly read and write records from different schemas excel sheets so we need to involve few more Talend components while doing the transformation. We have already seen how to use tBufferInput and tBufferOutput component for the purpose of temporary storing records while writing into Excel. For the reading purpose, we will use tMap component to define different schemas as per our output requirement.

Requirement

The requirement is to read multiple sheets present in an excel file. Also, the schema of each sheet is different. Extract all records available in the sheets and place it into two different CSV files based on the sheet name.

Input 



Output


Final Job Diagram 


Steps For Processing 

The first step is to read the excel file. The first step is to create a Metadata for this excel file. We will define the schema as the Repository level. Also, our excel has sheets with different schemas so we can't read both sheets with the predefined schemas specific to one sheet. The trick is to first, select All Sheets while defining the schema and second, take the schema of the sheet which has the most number of columns. Please check below screenshots:



In next screen, you will get the preview of the records. You may not get the desired sheet record as per your requirement. You need not bother about the records. Just make sure you get the number the exact number of columns as per your need. In my input, the max number of columns(4) are on the Details sheet. Even though I am getting the Name sheet records are here but I am ok with as the number of columns is 4 here.


In the next screen, you will get the preview of the Metadata. You can rename the column's name based on your whim. But in this scenario, just keep the column names as general as possible. As these 4 columns will only store the records of both sheets. Later, we will define the exact schema as per our need.



The next step is to split records as per our need. In our requirement, we have to split records based on our sheets. To split the records, we are going to use tMap component. To read the CURRENT SHEET , we are going to use the global variable : ((String)globalMap.get("tFileInputExcel_1_CURRENT_SHEET")) 
I am just storing this Expression under Variable var1. If you want, you can use it directly in the ou1 and out2 expression editor. Using a variable here makes code clean and concise.


So. once done with the mapping, you can rename the columns based on your output requirement. As you can see it here, the input columns will store both sheets records. I am detecting the current sheet to be read and then mapping the input columns to output based on the current sheet's schema. This is the only trick that you have to apply while using multiple sheets of an excel file.

The next step is to just map this output to the two different CSV file. The configuration would be like this:









In my previous posts, I have covered how to read an excel and put the data into it. But, it was only related to one sheet of the excel file. What if we have to generate the multiple sheets for the records coming from an input file? It's possible and there is a way to do it.You can create N number of sheets based on your requirement. In this post, I am going to create two sheets in an excel file. Let's check how to do it.

Requirement 

The requirement is to process the fixed length file and generates output in the Excel sheet. The output will be displayed in two different sheets of the same Excel file.

Input 



Output 


The Final Job Diagram 


Steps For The Processing 

The first step is to read our input file.The input file is the fixed length file.To read this file, define the metadata using File Positional option.To define the Metadata, go to the File Positional option and define the positions as per your input file. I have defined the Metadata as per my input file:



Once you define the Metadata, you can read the input file using tFilePositional input file. The configuration would be as mentioned in the below screenshot:


As we have to split the single input file into multiple sheets, we are going to use tMap components to split the contents. The configuration of tMap component would be as mentioned in the below screenshot:



Now, next thing is to put these records into the Excel file. For this, we are going to use tFileOutputExcel. The configuration would be like this:



We can't define more than one sheet name here. But, there is a workaround for this. We are going to use tBufferOutput to store the records for the second sheet in memory and later will use the tBufferInput component to retrieve those values. tBuffer* components provide you the temporary location to store and retrieve records.


You can automatically populate the schema for tBufferOutput by pressing Sync columns. But, this can't be done for tBufferInput, you have to manually define the schema for this component.As we are going to retrieve what we store in memory using tBufferOutput, we have to use the same schema as in tBufferOutput. The schema would be same as mentioned in the below screenshot:




The last step is to link the tBufferOutput to the Excel using tFileOutputExcel component. We have to link to the same Excel as we want to write into the same Excel but in a different sheet. Give the new name for the sheet and don't forget to tick the Append Existing File and Sheet.









In my last Talend tutorial, I have covered how to use tMap component to reject rows from Input file. There is an alternative to it, instead of using tMap component, we can also use tFilterRow component to filter the records. I am going to use same scenario as used in the previous post and will try to achieve same output as we got in the last post.

Requirement 

In this demo, we are going to process an input file. This input file has only one column: NAME.We are going to pick records which start with only A or B and will reject rest of the records. The input file is as mentioned below:

The Final Job Diagram :




Steps For The Processing :

The first step is to read the input file. As used in the previous posts also, I am going to use tFileInputDelimited component to read the file. The configuration of this components is as below:



The next step is to process records based on our requirement. We have to analyze the first character of the input records and have to check whether it is equal to A or B. For the processing, we are going to use tFilterRow component. The whole configuration will look like this:


The code :

input_row.Name.substring(0,1).equals("A")||input_row.Name.substring(0,1).equals("B")

A template is already provided by Talend to use the filter condition in advanced mode. This will give you the hint how to call columns. This window gives us a lot of dynamism as we can write our own custom condition in Java and we don't have to rely only on the existing functions which are very few and basic for the filter criteria.

Please check this link to learn how to use Substring in Java.

tFilterRow Demystified..

tFilterRow provides us two mode of filtering : Simple and Advanced. The Simple Mode filtering requires you to select values from the given options. To add your columns and filter criteria, just click on the (+) button. But, you won't get so much options to choose from, the given options under Function and Operator are very basic. The next option is the Advanced Mode. This option add a lot of dynamism in the tFilterRow and makes it very powerful. Here, you can use your basic Java skills and add the criteria based on your will. If you want to use both mode, then select And condition from the dropdown under Logical Operator Used To Combine Conditions.

You can get more details from here : tFilterRow In Detail



Once you are done with the tFilterRow configuration, just link it to the desired output component. I am using tLogRow component here. Similarly, you can use tFileOutputDelimited to put the records in the text file. 


Also, you don't have to define anything specifically for the Reject Rows and to link it to the output component. First, click on the tFilterRow component, Go to Row-> Filter and connect first tLogRow component. Next, repeat the same process but this time Talend will only give you Reject option, use this to link other tLogRow component. 



Click on the Reject Rows to get more details about it. The Reject row comes with the default Error message column. 



The Final Job Execution and Output :

You will get an output like this once you run the Job :











We usually get the requirement where we have to pick some rows and have to drop others based on some conditions. This requires the processing of an input file. If you have worked on some ETL tools before like Datastage, then Transformer is your component for the processing. Same way,  Talend provides tMap component for processing. In this demo, we are going to filter rows using tMap component.

Let's check how to get it done...

Requirement 

In this demo, we are going to process an input file. This input file has only one column: CITY.
We are going to pick records which start with only A or B and will reject rest of the records. The input file is as mentioned below:



The Final Job Diagram:



 Steps For The Processing 

  • The first step is to read the input file. As used in the previous posts also, I am going to use tFileInputDelimited component to read the file. The configuration of this components is as below:

  • The next step is to process records based on our requirement. We have to analyze the first character of the input records and have to check whether it is equal to A or B. For the processing, we are going to use tMap component. The whole configuration will look like this:

(Click To Enlarge)

  • At the left side of the screen, we will get the input columns. As soon as we connect the input to the tMap, all the columns will get automatically populated.



  • At the right side of the screen, you can see two output window: out1 and out2. This is required as we are going to filter the records based on the condition. The out1 window is for the valid records ( The records that will match our criteria). The out2 window is for the rejected records.



  • To put the conditional statement, first of all, open the expression editor. Once you are inside the expression editor, you can use the different functions as provided by Talend. Here, I am using LEFT function under StringHandling categories.




  • To reject records, we don't have to put any condition specifically. Talend provides a feature to reject records and we are just using that. Just open the tMap settings and select the value as True for Catch Output Reject.

  • Finally, we just have to place records coming out from Ou1 and Out2 to an output file. For this, we are going to use tFileOutputDelimited component. The VALID.txt file will store the records which are matching our criteria and the rest will go to INVALID.txt.





The Final Job Execution:


It's time to execute our designed job. The final result would be like this:



© 2015 Techie's House. Designed by Bloggertheme9