Technical Bulletins

The Output Count of Union Sheet Change If Some Sheets Aren't Kept 

Details

Unkept union sheets in hierarchical unions and map-side-join sheets output partial results. The problem only occurs in a very specific configuration that meets all of the requirements: 

  • The workbook contains at least three union sheets:
    • The results of two union sheets are the sources for the third union sheet.
    • The union sheets aren't kept.
  • The workbook contains at least one join sheet:
    • The join sheet must process as a map-side-join at runtime.
    • The join sheet must process data before at least one of the union sheets.

If all the above requirements are met, the problem is as follows: If two union sheets ("UnionA" and "UnionB") feed another union sheet ("UnionResult") downstream in the same workbook and the source union sheets (UnionA and/or UnionB) aren't kept sheets, the final union (UnionResult) results lacks records of three of the four initial input sources. 

Versions affected

  • Datameer 7.2: 7.2.3
  • Datameer 7.1: 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7 and 7.1.8
  • Datameer 6.4: 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7, 6.4.8, 6.4.9, 6.4.10, 6.4.11 and 6.4.12
  • Datameer 6.3: All released versions

Users affected

Datameer users that build workbooks with hierarchical unions where the union sheets aren't kept may be affected. Users of downstream workbooks may be indirectly impacted.

Datameer administrators may verify if any workbooks are affected by using the following SQL command: 

SELECT t.workbook_fk WorkbookID
FROM (
         SELECT
             SUM(sheet_type =  "das.internal.UnionSheetType" ) union_sheets,
             SUM(sheet_type =  "das.internal.UnionSheetType" AND keep) kept_union_sheets,
             SUM(sheet_type =  "das.internal.JoinedSheetType" ) join_sheets,
             workbook_fk
         FROM sheet
         GROUP BY workbook_fk) t
WHERE t.union_sheets >  2 AND t.union_sheets != t.kept_union_sheets AND t.join_sheets >  0 ;

If the output of this query is 0 rows, then no workbooks are affected. If the output of this query contains workbook IDs, they may be affected and need attention in affected versions of Datameer.

Severity

High

Resolved versions

It is recommended to install the latest Datameer Maintenance Release. At a minimum, this issue is resolved in the following Datameer versions:

  • 6.4.13
  • 7.1.9
  • 7.2.4

Datameer versions beyond 7.2.4 will also include the fix. 

Immediate action required

To workaround this issue, there is an immediately available option: 

  • In affected workbooks, configure all union sheets to be kept. 

This workaround might impact execution time of workbooks that contain newly saved sheets. It is recommended to schedule an update to the latest maintenance patch as soon as possible. 

More information and updates may be available through the following KB article: Output count of Union Sheet changes if some sheets are not kept


Duplicate Records Processed in Workbook Functions 

Details

Functions output duplicate records on some workbook sheets. Specifically, source Parquet files less than the threshold may be read twice. Any downstream calculations include the duplicated data.

The problem only occurs if the job processes at least 1 small file (smaller than the threshold) and at least 1 large file (bigger than the threshold). The default threshold is 256MB but could be altered by system settings.

Versions affected

  • Datameer 7.1: 7.1.3, 7.1.4 and 7.1.5
  • Datameer 6.4: 6.4.7, 6.4.8 and 6.4.9
  • Datameer 6.3: 6.3.9 and 6.3.10

Users affected

Datameer users that reference data from data sources that are stored in Parquet and is below a certain size, the small file is read twice. Users of downstream workbooks may be indirectly impacted.

Severity

High

Resolved versions

It is recommended to install the latest Datameer Maintenance Release. At a minimum, this issue is resolved in the following Datameer versions:

  • 6.4.10
  • 7.1.6

Datameer versions beyond 7.1.6 also include the fix. 

Immediate action required

To work-around this issue, there is an immediately available option: 

  • Add the following custom property das.splitting.disable-individual-file-splitting=true to the Hadoop Cluster's Custom Properties section (Note: this option may cause a significant performance for very large parquet files, which will not be split and can therefore not processed in parallel anymore)

More information and updates may be available through the following KB article: Duplicate Records Processed in Workbook Functions