摘要
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
本课程介绍了许多常见的数据陷阱,从数据集质量到思维、可视化和统计分析。
机器学习从业者应问:
- 我对数据集的特征以及收集相应数据时的条件了解程度如何?
- 我的数据中存在哪些质量或偏差问题?是否存在混杂因素?
- 使用这些特定数据集可能会导致哪些潜在的下游问题?
- 在训练用于进行预测或分类的模型时,模型所训练的数据集是否包含所有相关变量?
无论发现如何,机器学习从业者都应始终检查自己是否存在确认偏差,然后根据直觉和常识检查自己的发现,并在数据与这些直觉和常识相冲突时进行调查。
附加阅读材料
开罗,阿尔贝托。How Charts Lie: Getting Smarter about Visual Information(图表如何撒谎:更智能地了解视觉信息)。纽约:W.W. Norton,2019 年。
Huff, Darrell. How to Lie with Statistics(如何用统计数据撒谎)。纽约:W.W. Norton,1954 年。
Monmonier, Mark. How to Lie with Maps,第 3 版。芝加哥:芝加哥大学出版社,2018 年。
Jones, Ben. 避免数据陷阱。Hoboken, NJ: Wiley, 2020.
Wheelan, Charles. Naked Statistics: Stripping the Dread from the Data. 纽约:W.W. Norton,2013 年
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-27。
[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eThis course explores common data traps encountered in machine learning, encompassing dataset quality, thinking processes, visualization, and statistical analysis.\u003c/p\u003e\n"],["\u003cp\u003eMachine learning practitioners must critically assess their datasets, identifying potential biases, confounding factors, and downstream issues arising from data usage.\u003c/p\u003e\n"],["\u003cp\u003eThoroughly understanding data characteristics and collection conditions is crucial for mitigating data pitfalls and ensuring robust machine learning models.\u003c/p\u003e\n"],["\u003cp\u003eConfirmation bias should be actively addressed, and data findings should be validated against intuition and common sense, prompting further investigation where discrepancies exist.\u003c/p\u003e\n"],["\u003cp\u003eFurther insights into data analysis and interpretation can be gained from the listed additional reading materials covering topics like chart interpretation, statistical manipulation, and map-based data representation.\u003c/p\u003e\n"]]],[],null,["# Summary\n\n\u003cbr /\u003e\n\nThis course has walked through many common data traps, from dataset quality\nto thinking to visualization and statistical analysis.\n\nML practitioners should ask:\n\n- How well do I understand the characteristics of my datasets and the conditions under which that data was collected?\n- What quality or bias issues exist in my data? Are confounding factors present?\n- What potential downstream issues could arise from using these particular datasets?\n- When training a model that makes predictions or classifications: does the dataset that the model is trained on contain all relevant variables?\n\nWhatever their findings, ML practitioners should always examine\nthemselves for confirmation bias, then check their findings against their\nintuition and common sense, and investigate wherever the data is in conflict\nwith these.\n\nAdditional reading\n------------------\n\nCairo, Alberto. *How Charts Lie: Getting Smarter about Visual Information.* NY:\nW.W. Norton, 2019.\n\nHuff, Darrell. *How to Lie with Statistics.* NY: W.W. Norton, 1954.\n\nMonmonier, Mark. *How to Lie with Maps,* 3rd ed. Chicago: U of Chicago P, 2018.\n\nJones, Ben. *Avoiding Data Pitfalls.* Hoboken, NJ: Wiley, 2020.\n\nWheelan, Charles. *Naked Statistics: Stripping the Dread from the Data.* NY:\nW.W. Norton, 2013"]]